Table of Contents
cs.CL [Back]
[1] Where did you get that? Towards Summarization Attribution for Analysts
Violet B,John M. Conroy,Sean Lynch,Danielle M,Neil P. Molino,Aaron Wiechmann,Julia S. Yang
Main category: cs.CL
TL;DR: 本文探讨了通过混合摘要方法(自动改写抽取式摘要)来简化归因过程的自动归因方法,并使用自定义拓扑识别与归因相关的错误类型比例。
Details
Motivation: 分析人员需要信息来源的归因,以便报告内容时有据可依,因此需要自动化的句子级归因方法。 Method: 采用混合摘要方法(自动改写抽取式摘要),并设计自定义拓扑结构来分析归因相关错误的类别和比例。 Result: 该方法有助于减轻归因负担,并能有效识别不同类型的归因错误。 Conclusion: 混合摘要结合自定义错误分析拓扑是一种有前景的自动归因解决方案,有助于提升摘要的可追溯性和可信度。 Abstract: Analysts require attribution, as nothing can be reported without knowing the source of the information. In this paper, we will focus on automatic methods for attribution, linking each sentence in the summary to a portion of the source text, which may be in one or more documents. We explore using a hybrid summarization, i.e., an automatic paraphrase of an extractive summary, to ease attribution. We also use a custom topology to identify the proportion of different categories of attribution-related errors.[2] GMTRouter: Personalized LLM Router over Multi-turn User Interactions
Encheng Xie,Yihang Sun,Tao Feng,Jiaxuan You
Main category: cs.CL
TL;DR: 提出GMTRouter,一种基于异构图的个性化大模型路由方法,通过建模用户、LLM、查询和响应之间的多轮交互,实现基于少样本数据的有效个性化路由。
Details
Motivation: 现有LLM路由方法在个性化方面不足,难以捕捉用户与模型间的复杂交互,且用户偏好数据稀疏、噪声多,限制了个性化效果。 Method: 将多轮用户-LLM交互建模为包含用户、LLM、查询和响应四种节点的异构图,采用定制的消息传递机制,在轻量级归纳图学习框架中学习用户偏好。 Result: 在多个数据集上显著优于强基线,准确率提升0.9%~21.6%,AUC提升0.006~0.309,并能仅用少样本数据适应新用户和动态偏好。 Conclusion: GMTRouter有效解决了个性化LLM路由中的数据稀缺和复杂交互建模问题,具备良好的泛化性和实用性。 Abstract: Large Language Model (LLM) routing has demonstrated strong capability in balancing response quality with computational cost. As users exhibit diverse preferences, personalization has attracted increasing attention in LLM routing, since even identical queries may require different models to generate responses tailored to individual needs. However, existing approaches are not fully personalized and often fail to capture the complex interactions between specific users and LLMs. Moreover, user preference data is typically scarce, noisy, and inconsistent in format, which limits the effectiveness of methods that rely solely on user-specific data. To address these challenges, we propose GMTRouter, which represents multi-turn user-LLM interactions as a heterogeneous graph with four node types: user, LLM, query, and response, thereby preserving the rich relational structure of the interaction. Through a tailored message-passing mechanism, GMTRouter learns to capture user preferences from few-shot data within a lightweight inductive graph learning framework, enabling effective personalization. Extensive experiments demonstrate that GMTRouter consistently outperforms strong baselines, achieving 0.9 to 21.6 percent higher accuracy and 0.006 to 0.309 higher AUC across multiple datasets. More importantly, we demonstrate that GMTRouter can adapt to new users and evolving preferences using only few-shot data, without extensive fine-tuning. The code for GMTRouter is publicly available at https://github.com/ulab-uiuc/GMTRouter.[3] The Collective Turing Test: Large Language Models Can Generate Realistic Multi-User Discussions
Azza Bouleimen,Giordano De Marzo,Taehee Kim,Nicol`o Pagan,Hannah Metzler,Silvia Giordano,David Garcia
Main category: cs.CL
TL;DR: 研究评估了大语言模型(LLM)模拟社交媒体人类群组对话的能力,发现生成的对话有较高比例被误认为是人类创作,显示出其在社交模拟中的潜力及被滥用的风险。
Details
Motivation: 验证大语言模型在模拟在线社区和社交媒体互动中的有效性与真实性。 Method: 从Reddit收集真实人类对话,并使用Llama 3 70B和GPT-4o两个大语言模型生成同类话题的人工对话,通过参与者对比判断其真实性。 Result: LLM生成的对话中有39%被误认为是人类创作;其中Llama 3生成的对话仅56%被正确识别为AI生成,接近随机猜测。 Conclusion: 大语言模型能生成足够逼真的社交媒体对话,在社会模拟中具潜力,但也警示其可能被用于制造虚假内容。 Abstract: Large Language Models (LLMs) offer new avenues to simulate online communities and social media. Potential applications range from testing the design of content recommendation algorithms to estimating the effects of content policies and interventions. However, the validity of using LLMs to simulate conversations between various users remains largely untested. We evaluated whether LLMs can convincingly mimic human group conversations on social media. We collected authentic human conversations from Reddit and generated artificial conversations on the same topic with two LLMs: Llama 3 70B and GPT-4o. When presented side-by-side to study participants, LLM-generated conversations were mistaken for human-created content 39\% of the time. In particular, when evaluating conversations generated by Llama 3, participants correctly identified them as AI-generated only 56\% of the time, barely better than random chance. Our study demonstrates that LLMs can generate social media conversations sufficiently realistic to deceive humans when reading them, highlighting both a promising potential for social simulation and a warning message about the potential misuse of LLMs to generate new inauthentic social media content.[4] Knowledge Graph Analysis of Legal Understanding and Violations in LLMs
Abha Jha,Abel Salinas,Fred Morstatter
Main category: cs.CL
TL;DR: 本研究探讨了大型语言模型(LLM)在解释美国法典第18卷第175条(生物武器相关法律)中的潜力与风险,提出结合知识图谱与检索增强生成(RAG)的方法来评估LLM对法律的理解、意图识别能力及安全隐患。实验表明,当前LLM在法律推理和安全性方面存在显著缺陷,但通过改进安全机制和法律推理框架,有望实现更安全、合伦理的法律辅助应用。
Details
Motivation: 尽管LLM在法律分析中具有潜力,但其可能生成危险内容(如制造生物武器的指导),存在严重安全隐患。因此,亟需系统方法评估其法律理解与安全控制能力。 Method: 构建基于知识图谱的法律表示,并采用检索增强生成(RAG)技术,设计实验评估LLM在识别违法行为、生成禁用指令和判断违法意图(mens rea)方面的表现。 Result: 发现现有LLM在准确理解法律条文、判断主观意图和防止危险输出方面存在明显不足,容易生成违反安全规范的内容。 Conclusion: 单纯依赖现有LLM处理敏感法律问题风险较高;需结合更强的安全协议与结构化法律推理框架,才能实现安全、可靠的法律人工智能应用。 Abstract: The rise of Large Language Models (LLMs) offers transformative potential for interpreting complex legal frameworks, such as Title 18 Section 175 of the US Code, which governs biological weapons. These systems hold promise for advancing legal analysis and compliance monitoring in sensitive domains. However, this capability comes with a troubling contradiction: while LLMs can analyze and interpret laws, they also demonstrate alarming vulnerabilities in generating unsafe outputs, such as actionable steps for bioweapon creation, despite their safeguards. To address this challenge, we propose a methodology that integrates knowledge graph construction with Retrieval-Augmented Generation (RAG) to systematically evaluate LLMs' understanding of this law, their capacity to assess legal intent (mens rea), and their potential for unsafe applications. Through structured experiments, we assess their accuracy in identifying legal violations, generating prohibited instructions, and detecting unlawful intent in bioweapons-related scenarios. Our findings reveal significant limitations in LLMs' reasoning and safety mechanisms, but they also point the way forward. By combining enhanced safety protocols with more robust legal reasoning frameworks, this research lays the groundwork for developing LLMs that can ethically and securely assist in sensitive legal domains - ensuring they act as protectors of the law rather than inadvertent enablers of its violation.[5] Diverse Preference Learning for Capabilities and Alignment
Stewart Slocum,Asher Parker-Sartori,Dylan Hadfield-Menell
Main category: cs.CL
TL;DR: 本文提出了一种称为Soft Preference Learning的新方法,以解决现有对齐算法(如RLHF和DPO)在降低LLM输出多样性方面的问题。该方法通过解耦KL正则项中的熵与交叉熵项,实现对生成多样性的精细控制,在保持对齐的同时提升语义、词汇多样性和社会观点表达能力。
Details
Motivation: 现有对齐算法(如RLHF和DPO)使用KL散度正则化,导致模型过度偏向主流观点,牺牲了输出的多样性,限制了LLM表达多元社会视角的能力。因此需要一种能平衡对齐与多样性的新方法。 Method: 提出Soft Preference Learning方法,将偏好学习中KL正则项的熵和交叉熵部分解耦,从而独立调控模型输出的多样性。该方法允许更细粒度地控制生成过程,提升输出的多样性而不损害对齐性能。 Result: 采用Soft Preference Learning训练的LLM在困难的重复采样任务上准确率更高,生成文本具有更高的语义和词汇多样性;同时能表达更广泛的社会观点,并表现出更好的logit校准性能。该方法优于标准温度缩放,是其帕累托改进。 Conclusion: Soft Preference Learning有效缓解了传统对齐算法导致的多样性下降问题,在不牺牲对齐效果的前提下提升了LLM的输出多样性和社会观点包容性,为构建更具包容性和鲁棒性的语言模型提供了新路径。 Abstract: The ability of LLMs to represent diverse perspectives is critical as they increasingly impact society. However, recent studies reveal that alignment algorithms such as RLHF and DPO significantly reduce the diversity of LLM outputs. Not only do aligned LLMs generate text with repetitive structure and word choice, they also approach problems in more uniform ways, and their responses reflect a narrower range of societal perspectives. We attribute this problem to the KL divergence regularizer employed in preference learning algorithms. This causes the model to systematically overweight majority opinions and sacrifice diversity in its outputs. To address this, we propose Soft Preference Learning, which decouples the entropy and cross-entropy terms in the KL penalty - allowing for fine-grained control over LLM generation diversity. From a capabilities perspective, LLMs trained using Soft Preference Learning attain higher accuracy on difficult repeated sampling tasks and produce outputs with greater semantic and lexical diversity. From an alignment perspective, they are capable of representing a wider range of societal viewpoints and display improved logit calibration. Notably, Soft Preference Learning resembles, but is a Pareto improvement over, standard temperature scaling.[6] Chopping Trees: Semantic Similarity Based Dynamic Pruning for Tree-of-Thought Reasoning
Joongho Kim,Xirui Huang,Zarreen Reza,Gabriel Grand,Kevin Zhu,Ryan Lagasse
Main category: cs.CL
TL;DR: 本文提出了一种名为基于语义相似性的动态剪枝(SSDP)的方法,用于提升大语言模型在树状思维(ToT)推理中的效率。SSDP通过在线合并语义冗余的推理路径,在保持高准确率的同时显著减少搜索节点数量并加快推理速度。
Details
Motivation: Tree-of-Thought(ToT)虽然增强了大语言模型的问题解决能力,但由于存在语义冗余——即不同分支探索等效推理路径——导致计算成本高昂。因此需要一种高效机制来识别并消除这些冗余,以实现更快速、可扩展的推理。 Method: 提出Semantic Similarity-Based Dynamic Pruning(SSDP),这是一种轻量级方法,首次将在线语义合并引入并行化树搜索中。该方法实时计算推理步骤之间的语义相似性,并对高度相似的路径进行聚类与剪枝,从而减少重复探索。 Result: 在GSM8K和MATH500等多个推理基准上,SSDP相比最先进的树搜索基线实现了最高2.3倍的速度提升,探索节点数减少了85-90%,同时准确率保持在最强基线的5%以内。 Conclusion: SSDP为高效的、可扩展的大语言模型推理提供了一个实用解决方案,能够在不显著牺牲性能的前提下大幅降低ToT推理的计算开销。 Abstract: Tree-of-Thought (ToT) reasoning boosts the problem-solving abilities of Large Language Models (LLMs) but is computationally expensive due to semantic redundancy, where distinct branches explore equivalent reasoning paths. We introduce Semantic Similarity-Based Dynamic Pruning (SSDP), a lightweight method that, to the best of our knowledge, is the first framework to integrate online semantic merging into parallelized tree search, enabling the clustering and pruning of redundant steps in real time. Across reasoning benchmarks, including GSM8K and MATH500, SSDP achieves up to a 2.3x speedup over state-of-the-art tree-search baselines while maintaining competitive accuracy (typically within 5% of the strongest baseline) and reducing the number of explored nodes by 85-90%, demonstrating a practical approach to efficient, scalable LLM reasoning. The implementation of SSDP is publicly available at https://github.com/kimjoonghokim/SSDP.[7] What About the Scene with the Hitler Reference? HAUNT: A Framework to Probe LLMs' Self-consistency Via Adversarial Nudge
Arka Dutta,Sujan Dutta,Rijul Magu,Soumyajit Datta,Munmun De Choudhury,Ashiqur R. KhudaBukhsh
Main category: cs.CL
TL;DR: 提出一个用于测试大语言模型在对抗性引导下事实保真度的框架,通过生成并验证真假陈述来评估模型的幻觉问题。
Details
Motivation: 大语言模型在高风险领域应用中面临幻觉问题,影响其实际部署,需要评估其在对抗性干扰下的事实准确性。 Method: 构建三步框架:首先让模型生成与特定封闭领域一致的真/假陈述;然后让同一模型验证这些陈述;最后测试模型对自身生成的虚假陈述的鲁棒性。实验在电影和小说两个领域进行,评估五个主流闭源模型。 Result: 实验发现不同模型对对抗性引导的敏感性差异显著:Claude表现出强韧性,GPT和Grok中等,Gemini和DeepSeek较弱。 Conclusion: 当前主流大语言模型在面对自身生成的虚假信息时存在不同程度的脆弱性,这对依赖LLM获取信息的用户构成潜在风险,需引起重视。 Abstract: Hallucinations pose a critical challenge to the real-world deployment of large language models (LLMs) in high-stakes domains. In this paper, we present a framework for stress testing factual fidelity in LLMs in the presence of adversarial nudge. Our framework consists of three steps. In the first step, we instruct the LLM to produce sets of truths and lies consistent with the closed domain in question. In the next step, we instruct the LLM to verify the same set of assertions as truths and lies consistent with the same closed domain. In the final step, we test the robustness of the LLM against the lies generated (and verified) by itself. Our extensive evaluation, conducted using five widely known proprietary LLMs across two closed domains of popular movies and novels, reveals a wide range of susceptibility to adversarial nudges: \texttt{Claude} exhibits strong resilience, \texttt{GPT} and \texttt{Grok} demonstrate moderate resilience, while \texttt{Gemini} and \texttt{DeepSeek} show weak resilience. Considering that a large population is increasingly using LLMs for information seeking, our findings raise alarm.[8] Self-HarmLLM: Can Large Language Model Harm Itself?
Heehwan Kim,Sungjune Park,Daeseon Choi
Main category: cs.CL
TL;DR: 本文提出了Self-HarmLLM场景,探讨大语言模型自身生成的模糊有害查询(MHQ)作为新攻击向量的可能性,并在多种模型和条件下验证了其导致越狱的有效性,揭示了现有防护机制和自动化评估方法的不足。
Details
Motivation: 现有防御机制多假设攻击来自外部输入,而忽略了模型自身输出可能演变为新型攻击向量的风险,本文旨在探索这一被忽视的安全漏洞。 Method: 提出Self-HarmLLM框架,利用同一模型生成具有隐藏有害意图的模糊查询(MHQ),并将其输入新会话中测试是否引发越狱;在GPT-3.5-turbo、LLaMA3-8B-instruct和DeepSeek-R1-Distill-Qwen-7B上进行Base、Zero-shot和Few-shot实验,并结合前缀自动化评估与人工评估对比结果。 Result: 实验显示Zero-shot下最高52%转换成功率和33%越狱成功率,Few-shot下分别达65%和41%;自动化评估相比人工评估平均高估52%,表明其不可靠。 Conclusion: 模型自身输出可成为有效攻击载体,当前防护机制存在盲区,需重新设计更鲁棒的防护体系与评估方法。 Abstract: Large Language Models (LLMs) are generally equipped with guardrails to block the generation of harmful responses. However, existing defenses always assume that an external attacker crafts the harmful query, and the possibility of a model's own output becoming a new attack vector has not been sufficiently explored. In this study, we propose the Self-HarmLLM scenario, which uses a Mitigated Harmful Query (MHQ) generated by the same model as a new input. An MHQ is an ambiguous query whose original intent is preserved while its harmful nature is not directly exposed. We verified whether a jailbreak occurs when this MHQ is re-entered into a separate session of the same model. We conducted experiments on GPT-3.5-turbo, LLaMA3-8B-instruct, and DeepSeek-R1-Distill-Qwen-7B under Base, Zero-shot, and Few-shot conditions. The results showed up to 52% transformation success rate and up to 33% jailbreak success rate in the Zero-shot condition, and up to 65% transformation success rate and up to 41% jailbreak success rate in the Few-shot condition. By performing both prefix-based automated evaluation and human evaluation, we found that the automated evaluation consistently overestimated jailbreak success, with an average difference of 52%. This indicates that automated evaluation alone is not accurate for determining harmfulness. While this study is a toy-level study based on a limited query set and evaluators, it proves that our method can still be a valid attack scenario. These results suggest the need for a fundamental reconsideration of guardrail design and the establishment of a more robust evaluation methodology.[9] OKBench: Democratizing LLM Evaluation with Fully Automated, On-Demand, Open Knowledge Benchmarking
Yanhong Li,Tianyang Xu,Kenan Tang,Karen Livescu,David McAllester,Jiawei Zhou
Main category: cs.CL
TL;DR: 提出OKBench,一个全自动化的动态知识评测框架,用于评估大语言模型在不断更新的知识(如新闻)下的问答能力。
Details
Motivation: 现有静态基准无法反映动态世界中的知识演化,且人工维护难以跟上大模型发展速度。 Method: 构建名为OKBench的智能体框架,自动化完成知识来源、题目生成、验证与分发,聚焦新闻领域,降低与预训练数据的重叠。 Result: 在多种开源和闭源大模型上验证表明,检索增强能缩小大小模型在新知识上的性能差距,并揭示模型处理新信息的不同行为。 Conclusion: 动态、可按需生成的知识评测对全面评估大语言模型至关重要,OKBench为知识密集型问答提供了可持续的评估方案。 Abstract: Knowledge-intensive question answering is central to large language models (LLMs) and is typically assessed using static benchmarks derived from sources like Wikipedia and textbooks. However, these benchmarks fail to capture evolving knowledge in a dynamic world, and centralized curation struggles to keep pace with rapid LLM advancements. To address these drawbacks, we propose Open Knowledge Bench (OKBench), a fully automated framework for generating high-quality, dynamic knowledge benchmarks on demand. Focusing on the news domain where knowledge updates daily, OKBench is an agentic framework that automates the sourcing, creation, validation, and distribution of benchmarks. Our approach democratizes benchmark creation and facilitates thorough evaluation of retrieval-augmented methods by reducing overlap with pretraining data. We evaluate our framework on a wide range open-source and proprietary LLMs of various sizes and configurations, both with and without retrieval over freshly generated knowledge. Our results reveal distinct model behaviors when confronted with new information and highlight how retrieval narrows the performance gap between small and large models. These findings underscore the importance of evaluating LLMs on evolving knowledge benchmarks.[10] Retrieval-Augmented Generation of Pediatric Speech-Language Pathology vignettes: A Proof-of-Concept Study
Yilan Liu
Main category: cs.CL
TL;DR: 本研究提出了一种结合检索增强生成(RAG)与专业知识库的系统,用于自动生成儿科言语语言病理学(SLP)临床案例,验证了该方法在多种大语言模型上的可行性。
Details
Motivation: 手动编写SLP临床案例耗时且费力,通用大模型易产生幻觉并缺乏领域专业知识,需要大量专家修改。因此需要一种能生成高质量、符合专业规范的自动化方法。 Method: 构建一个基于RAG的多模型系统,整合精选的领域知识库和设计好的提示模板,支持五种商用和开源大语言模型;设计七种不同障碍类型和年级水平的测试场景,并采用多维度评分标准进行自动化质量评估。 Result: 证明了RAG增强生成儿科SLP案例的技术可行性;商用模型略优于开源模型,但后者表现也可接受,具备机构内部隐私保护部署潜力;集成专业知识库可提升内容与专业指南的一致性。 Conclusion: 该系统为自动生成高质量SLP教学案例提供了可行方案,未来需进一步专家验证与心理测量评估,潜在应用包括临床决策支持、IEP目标生成和临床反思训练。 Abstract: Clinical vignettes are essential educational tools in speech-language pathology (SLP), but manual creation is time-intensive. While general-purpose large language models (LLMs) can generate text, they lack domain-specific knowledge, leading to hallucinations and requiring extensive expert revision. This study presents a proof-of-concept system integrating retrieval-augmented generation (RAG) with curated knowledge bases to generate pediatric SLP case materials. A multi-model RAG-based system was prototyped integrating curated domain knowledge with engineered prompt templates, supporting five commercial (GPT-4o, Claude 3.5 Sonnet, Gemini 2.5 Pro) and open-source (Llama 3.2, Qwen 2.5-7B) LLMs. Seven test scenarios spanning diverse disorder types and grade levels were systematically designed. Generated cases underwent automated quality assessment using a multi-dimensional rubric evaluating structural completeness, internal consistency, clinical appropriateness, and IEP goal/session note quality. This proof-of-concept demonstrates technical feasibility for RAG-augmented generation of pediatric SLP vignettes. Commercial models showed marginal quality advantages, but open-source alternatives achieved acceptable performance, suggesting potential for privacy-preserving institutional deployment. Integration of curated knowledge bases enabled content generation aligned with professional guidelines. Extensive validation through expert review, student pilot testing, and psychometric evaluation is required before educational or research implementation. Future applications may extend to clinical decision support, automated IEP goal generation, and clinical reflection training.[11] Evaluating DisCoCirc in Translation Tasks & its Limitations: A Comparative Study Between Bengali & English
Nazmoon Falgunee Moon
Main category: cs.CL
TL;DR: 本文将DisCoCirc框架扩展至孟加拉语,并用于英-孟翻译任务,发现其在处理语言结构差异时存在局限性,尤其在简单句子上仍表现不佳,同时探讨了英语连词与布尔逻辑的联系。
Details
Motivation: 重新评估DisCoCirc在减少语言官僚化方面的有效性,并将其应用于结构差异较大的语言对(如英语和孟加拉语)的翻译任务。 Method: 基于[4]中的DisCoCirc形式化方法,构建适用于孟加拉语的语法框架,并应用于英-孟互译;同时借鉴[1]分析英语连词与布尔逻辑的关系。 Result: DisCoCirc虽在部分语言现象中表现良好,但受限于英、孟语间结构差异,在实际翻译中仍难以处理较简单的句子,效果不如先前研究[5]所声称的理想。 Conclusion: DisCoCirc在跨语言应用中存在明显局限,需进一步改进以应对结构多样性;同时指出连词与逻辑结构之间的潜在关联,为后续研究提供方向。 Abstract: In [4], the authors present the DisCoCirc (Distributed Compositional Circuits) formalism for the English language, a grammar-based framework derived from the production rules that incorporates circuit-like representations in order to give a precise categorical theoretical structure to the language. In this paper, we extend this approach to develop a similar framework for Bengali and apply it to translation tasks between English and Bengali. A central focus of our work lies in reassessing the effectiveness of DisCoCirc in reducing language bureaucracy. Unlike the result suggested in [5], our findings indicate that although it works well for a large part of the language, it still faces limitations due to the structural variation of the two languages. We discuss the possible methods that might handle these shortcomings and show that, in practice, DisCoCirc still struggles even with relatively simple sentences. This divergence from prior claims not only highlights the framework's constraints in translation but also suggest scope for future improvement. Apart from our primary focus on English-Bengali translation, we also take a short detour to examine English conjunctions, following [1], showing a connection between conjunctions and Boolean logic.[12] Mina: A Multilingual LLM-Powered Legal Assistant Agent for Bangladesh for Empowering Access to Justice
Azmine Toushik Wasi,Wahid Faisal,Mst Rafia Islam
Main category: cs.CL
TL;DR: Mina 是一个基于多语言大模型的法律助手,专为孟加拉国低收入人群设计,通过检索增强生成(RAG)和工具链框架提供法律草案、引用和通俗解释,在律师考试评估中表现媲美人类,显著提升法律服务可及性。
Details
Motivation: 孟加拉国低收入群体因法律语言复杂、程序不透明和高成本难以获得法律援助,现有AI法律助手缺乏孟加拉语支持和本地化适配。 Method: 开发Mina系统,采用多语言嵌入和基于RAG的工具链架构,实现信息检索、推理、翻译和文档生成,通过交互式聊天界面提供情境化法律服务。 Result: 在2022和2023年孟加拉国律师委员会考试中,经法学院专家评估,Mina在初试选择题、笔试和模拟口试中得分达75-80%,表现等同或优于平均水平,展现出清晰性、情境理解力和合理法律推理能力。 Conclusion: Mina具备作为低成本、多语言AI法律助手的潜力,可自动化关键法律任务并扩大司法获取,为资源受限环境下的专业AI系统构建提供了实践范例。 Abstract: Bangladesh's low-income population faces major barriers to affordable legal advice due to complex legal language, procedural opacity, and high costs. Existing AI legal assistants lack Bengali-language support and jurisdiction-specific adaptation, limiting their effectiveness. To address this, we developed Mina, a multilingual LLM-based legal assistant tailored for the Bangladeshi context. It employs multilingual embeddings and a RAG-based chain-of-tools framework for retrieval, reasoning, translation, and document generation, delivering context-aware legal drafts, citations, and plain-language explanations via an interactive chat interface. Evaluated by law faculty from leading Bangladeshi universities across all stages of the 2022 and 2023 Bangladesh Bar Council Exams, Mina scored 75-80% in Preliminary MCQs, Written, and simulated Viva Voce exams, matching or surpassing average human performance and demonstrating clarity, contextual understanding, and sound legal reasoning. These results confirm its potential as a low-cost, multilingual AI assistant that automates key legal tasks and scales access to justice, offering a real-world case study on building domain-specific, low-resource systems and addressing challenges of multilingual adaptation, efficiency, and sustainable public-service AI deployment.[13] A Super-Learner with Large Language Models for Medical Emergency Advising
Sergey K. Aityan,Abdolreza Mosaddegh,Rolando Herrero,Haitham Tayyar,Jiang Han,Vikram Sawant,Qi Chen,Rishabh Jain,Aruna Senthamaraikannan,Stephen Wood,Manuel Mersini,Rita Lazzaro,Mario Balzaneli,Nicola Iacovazzo,Ciro Gargiulo Isacco
Main category: cs.CL
TL;DR: 本研究评估了五种主流大语言模型(LLM)在急诊医学诊断中的表现,准确率介于58%至65%,超过人类医生的平均水平。研究构建了一个名为MEDAS的超级学习者系统,整合Gemini、Llama、Grok、GPT和Claude五个LLM,通过元学习方法将诊断准确率提升至70%,其中至少一个集成模型达到85%的准确率,表明多模型融合可有效提升急诊诊断性能。
Details
Motivation: 急诊医疗中快速准确的诊断至关重要,但人类医生的诊断准确率有限。近年来LLM在医疗决策支持中展现出潜力,但其在急性病诊断中的表现差异尚不明确,因此需要系统评估并探索提升诊断准确性的集成方法。 Method: 研究选取五种知名大语言模型(Gemini、Llama、Grok、GPT、Claude),测试其对真实急诊病例的诊断响应,并构建基于元学习的超级学习者系统MEDAS,融合多个LLM的输出以提升整体诊断准确率。 Result: 五种LLM的诊断准确率在58%到65%之间,均高于报道的人类医生水平;MEDAS超级学习者系统达到70%的准确率,且其中至少一个集成模型准确率达85%。结果表明,元学习融合策略的诊断准确性优于任一单独LLM。 Conclusion: 通过元学习整合多个大语言模型可显著提升急诊诊断的准确性,MEDAS系统展示了多模型协作在医疗决策支持中的巨大潜力,未来有望作为辅助工具提升临床诊疗效率与质量。 Abstract: Medical decision-support and advising systems are critical for emergency physicians to quickly and accurately assess patients' conditions and make diagnosis. Artificial Intelligence (AI) has emerged as a transformative force in healthcare in recent years and Large Language Models (LLMs) have been employed in various fields of medical decision-support systems. We studied responses of a group of different LLMs to real cases in emergency medicine. The results of our study on five most renown LLMs showed significant differences in capabilities of Large Language Models for diagnostics acute diseases in medical emergencies with accuracy ranging between 58% and 65%. This accuracy significantly exceeds the reported accuracy of human doctors. We built a super-learner MEDAS (Medical Emergency Diagnostic Advising System) of five major LLMs - Gemini, Llama, Grok, GPT, and Claude). The super-learner produces higher diagnostic accuracy, 70%, even with a quite basic meta-learner. However, at least one of the integrated LLMs in the same super-learner produces 85% correct diagnoses. The super-learner integrates a cluster of LLMs using a meta-learner capable of learning different capabilities of each LLM to leverage diagnostic accuracy of the model by collective capabilities of all LLMs in the cluster. The results of our study showed that aggregated diagnostic accuracy provided by a meta-learning approach exceeds that of any individual LLM, suggesting that the super-learner can take advantage of the combined knowledge of the medical datasets used to train the group of LLMs.[14] Learn More, Forget Less: A Gradient-Aware Data Selection Approach for LLM
Yibai Liu,Shihang Wang,Zeming Liu,Zheming Song,Junzhe Wang,Jingjing Liu,Qingjie Liu,Yunhong Wang
Main category: cs.CL
TL;DR: 提出一种自适应梯度感知数据选择方法GrADS,用于大语言模型的监督微调,通过分析梯度信息筛选高效训练子集,在仅用5%数据时即可超越全数据训练的效果,并显著缓解灾难性遗忘。
Details
Motivation: 监督微调在领域专业化中至关重要,但资源消耗大且易导致灾难性遗忘,需更高效的数据选择策略。 Method: 设计基于梯度幅值和统计分布的自指导准则,在预训练阶段后分析梯度以挑选对学习贡献最大的样本进行微调。 Result: 在医学、法律、金融等多个领域和不同大模型上验证,使用5%的GrADS选样即超越全数据微调性能,50%数据时提升显著,同时有效缓解灾难性遗忘。 Conclusion: GrADS能高效选择关键训练样本,大幅降低微调成本,提升领域适应性能,并减轻对通用能力的干扰。 Abstract: Despite large language models (LLMs) have achieved impressive achievements across numerous tasks, supervised fine-tuning (SFT) remains essential for adapting these models to specialized domains. However, SFT for domain specialization can be resource-intensive and sometimes leads to a deterioration in performance over general capabilities due to catastrophic forgetting (CF). To address these issues, we propose a self-adaptive gradient-aware data selection approach (GrADS) for supervised fine-tuning of LLMs, which identifies effective subsets of training data by analyzing gradients obtained from a preliminary training phase. Specifically, we design self-guided criteria that leverage the magnitude and statistical distribution of gradients to prioritize examples that contribute the most to the model's learning process. This approach enables the acquisition of representative samples that enhance LLMs understanding of domain-specific tasks. Through extensive experimentation with various LLMs across diverse domains such as medicine, law, and finance, GrADS has demonstrated significant efficiency and cost-effectiveness. Remarkably, utilizing merely 5% of the selected GrADS data, LLMs already surpass the performance of those fine-tuned on the entire dataset, and increasing to 50% of the data results in significant improvements! With catastrophic forgetting substantially mitigated simultaneously. We will release our code for GrADS later.[15] Detecting Suicidal Ideation in Text with Interpretable Deep Learning: A CNN-BiGRU with Attention Mechanism
Mohaiminul Islam Bhuiyan,Nur Shazwani Kamarudin,Nur Hafieza Ismail
Main category: cs.CL
TL;DR: 提出一种结合CNN和BiGRU的混合深度学习模型,用于从社交媒体数据中准确识别自杀意念,并利用SHAP提高模型可解释性,实验结果显示准确率达到93.97%。
Details
Motivation: 自杀是青少年第二大死因,过去有自杀尝试的人未来风险更高,许多人在社交媒体上表达自杀意图,因此需要有效工具来识别这些信号。 Method: 结合卷积神经网络(CNN)和双向门控循环单元(BiGRU),并引入注意力机制和SHAP进行模型解释。 Result: 在公开数据集上测试,模型准确率达到93.97%,优于现有机器学习和深度学习方法。 Conclusion: 该混合深度学习框架在自杀意念检测方面表现优异,具备高准确性和可解释性,适用于社交媒体中的心理健康监测。 Abstract: Worldwide, suicide is the second leading cause of death for adolescents with past suicide attempts to be an important predictor for increased future suicides. While some people with suicidal thoughts may try to suppress them, many signal their intentions in social media platforms. To address these issues, we propose a new type of hybrid deep learning scheme, i.e., the combination of a CNN architecture and a BiGRU technique, which can accurately identify the patterns of suicidal ideation from SN datasets. Also, we apply Explainable AI methods using SHapley Additive exPlanations to interpret the prediction results and verifying the model reliability. This integration of CNN local feature extraction, BiGRU bidirectional sequence modeling, attention mechanisms, and SHAP interpretability provides a comprehensive framework for suicide detection. Training and evaluation of the system were performed on a publicly available dataset. Several performance metrics were used for evaluating model performance. Our method was found to have achieved 93.97 accuracy in experimental results. Comparative study to different state-of-the-art Machine Learning and DL models and existing literature demonstrates the superiority of our proposed technique over all the competing methods.[16] Structured Uncertainty guided Clarification for LLM Agents
Manan Suri,Puneet Mathur,Nedim Lipka,Franck Dernoncourt,Ryan A. Rossi,Dinesh Manocha
Main category: cs.CL
TL;DR: 本文提出了一种基于结构化不确定性的方法(SAGE-Agent),用于提升LLM智能体在模糊指令下的工具调用准确性,通过POMDP建模和EVPI优化选择澄清问题,在减少提问次数的同时显著提高任务覆盖率,并构建了首个多轮工具增强型消歧基准ClarifyBench。
Details
Motivation: 模糊的用户指令常导致LLM智能体错误调用工具,影响任务成功率,因此需要一种系统性方法来建模和处理工具调用中的不确定性。 Method: 将工具参数的不确定性进行结构化建模,将联合工具-参数澄清问题建模为POMDP,采用期望完美信息价值(EVPI)作为目标函数选择最优提问,并引入基于方面的成本模型避免冗余提问;同时利用不确定性信号进行强化学习训练(GRPO)。 Result: SAGE-Agent在多个领域(如文档编辑、车辆控制、旅行预订)中相比强基线提升了7-39%的任务覆盖度,同时减少1.5-2.7倍的澄清提问;通过不确定性加权训练,When2Call准确率从约36.5%提升至65.2%(3B模型)和62.9%(7B模型)。 Conclusion: 结构化不确定性是一种原理清晰且高效的方法,能显著提升工具增强型智能体在现实场景中的任务成功率与交互效率。 Abstract: LLM agents extend large language models with tool-calling capabilities, but ambiguous user instructions often lead to incorrect invocations and task failures. We introduce a principled formulation of structured uncertainty over tool-call parameters, modeling joint tool-argument clarification as a POMDP with Expected Value of Perfect Information (EVPI) objective for optimal question selection and aspect-based cost modeling to prevent redundancy. Our SAGE-Agent leverages this structured uncertainty to achieve superior efficiency: increasing coverage on ambiguous tasks by 7-39\% while reducing clarification questions by 1.5-2.7$\times$ compared to strong prompting and uncertainty-based baselines. We present ClarifyBench, the first multi-turn tool-augmented disambiguation benchmark with realistic LLM-based user simulation across diverse domains including document editing, vehicle control, and travel booking. Additionally, we demonstrate that structured uncertainty provides effective training signals for reinforcement learning, boosting When2Call accuracy from 36.5\% to 65.2\% (3B model) and 36.7\% to 62.9\% (7B model) through uncertainty-weighted GRPO training. These results establish structured uncertainty as a principled, efficient approach for tool-augmented agents, improving both task success and interaction efficiency in real-world scenarios.[17] Toward Automated Cognitive Assessment in Parkinson's Disease Using Pretrained Language Models
Varada Khanna,Nilay Bhatt,Ikgyu Shin,Sule Tinaz,Yang Ren,Hua Xu,Vipina K. Keloth
Main category: cs.CL
TL;DR: 本研究开发并评估了三种自然语言处理模型,用于从帕金森病患者的自述中自动识别认知过程类别,发现微调后的Meta-Llama-3-8B-Instruct模型整体表现最佳,表明NLP技术有望辅助长期监测PD患者的认知功能。
Details
Motivation: 通过分析帕金森病患者日常叙述中的认知体验,可深入了解其认知与情绪变化,但因认知概念的抽象性和重叠性,从非结构化文本中提取信息具有挑战性。 Method: 比较了三类NLP模型:基于Bio_ClinicalBERT的跨度分类模型、使用QLoRA微调的Meta-Llama-3-8B-Instruct模型,以及在零样本和少样本设置下评估的GPT-4o mini模型,用于提取七种认知相关类别。 Result: 微调后的Meta-Llama-3-8B-Instruct模型取得最高F1分数(micro-average 0.74,macro-average 0.59),尤其在‘思维’和‘社交互动’等依赖上下文的类别上表现优异;Bio_ClinicalBERT精度高但召回率低,对位置和时间识别较好,但在情感、思维等类别上表现差;GPT-4o mini表现受限于样本设置。 Conclusion: 尽管该任务因认知描述的抽象性和重叠性而具挑战性,但经进一步优化,这些NLP系统有望实现低负担、长期的认知功能监测,并作为帕金森病神经心理评估的有效补充。 Abstract: Understanding how individuals with Parkinson's disease (PD) describe cognitive experiences in their daily lives can offer valuable insights into disease-related cognitive and emotional changes. However, extracting such information from unstructured patient narratives is challenging due to the subtle, overlapping nature of cognitive constructs. This study developed and evaluated natural language processing (NLP) models to automatically identify categories that reflect various cognitive processes from de-identified first-person narratives. Three model families, a Bio_ClinicalBERT-based span categorization model for nested entity recognition, a fine-tuned Meta-Llama-3-8B-Instruct model using QLoRA for instruction following, and GPT-4o mini evaluated under zero- and few-shot settings, were compared on their performance on extracting seven categories. Our findings indicated that model performance varied substantially across categories and model families. The fine-tuned Meta-Llama-3-8B-Instruct achieved the highest overall F1-scores (0.74 micro-average and 0.59 macro-average), particularly excelling in context-dependent categories such as thought and social interaction. Bio_ClinicalBERT exhibited high precision but low recall and performed comparable to Llama for some category types such as location and time but failed on other categories such as thought, emotion and social interaction. Compared to conventional information extraction tasks, this task presents a greater challenge due to the abstract and overlapping nature of narrative accounts of complex cognitive processes. Nonetheless, with continued refinement, these NLP systems hold promise for enabling low-burden, longitudinal monitoring of cognitive function and serving as a valuable complement to formal neuropsychological assessments in PD.[18] BNLI: A Linguistically-Refined Bengali Dataset for Natural Language Inference
Farah Binta Haque,Md Yasin,Shishir Saha,Md Shoaib Akhter Rafi,Farig Sadeque
Main category: cs.CL
TL;DR: 本文介绍了BNLI,一个经过语言学精炼的孟加拉语自然语言推断数据集,旨在解决现有数据集中标注错误、模糊句对和语言多样性不足的问题。
Details
Motivation: 现有的孟加拉语NLI数据集存在标注错误、模糊句子对和语言多样性不足等问题,限制了模型的有效训练与评估。 Method: 通过强调语义清晰性和三类(蕴含、矛盾、中立)平衡的严格标注流程构建BNLI数据集,并使用多语言和孟加拉语特定的Transformer模型进行基准测试。 Result: 实验结果表明,BNLI在可靠性和可解释性方面均有提升,能够有效支持孟加拉语及其他低资源语言的推理任务研究。 Conclusion: BNLI为孟加拉语NLI任务提供了一个高质量的数据基础,有助于推动低资源语言的自然语言理解研究。 Abstract: Despite the growing progress in Natural Language Inference (NLI) research, resources for the Bengali language remain extremely limited. Existing Bengali NLI datasets exhibit several inconsistencies, including annotation errors, ambiguous sentence pairs, and inadequate linguistic diversity, which hinder effective model training and evaluation. To address these limitations, we introduce BNLI, a refined and linguistically curated Bengali NLI dataset designed to support robust language understanding and inference modeling. The dataset was constructed through a rigorous annotation pipeline emphasizing semantic clarity and balance across entailment, contradiction, and neutrality classes. We benchmarked BNLI using a suite of state-of-the-art transformer-based architectures, including multilingual and Bengali-specific models, to assess their ability to capture complex semantic relations in Bengali text. The experimental findings highlight the improved reliability and interpretability achieved with BNLI, establishing it as a strong foundation for advancing research in Bengali and other low-resource language inference tasks.[19] Beyond Task-Oriented and Chitchat Dialogues: Proactive and Transition-Aware Conversational Agents
Yejin Yoon,Yuri Son,Namyoung So,Minseo Kim,Minsoo Cho,Chanhee Park,Seungshin Lee,Taeuk Kim
Main category: cs.CL
TL;DR: 本文提出了TACT数据集,用于建模任务导向与闲聊对话之间的模式转换,并引入Switch和Recovery两个新指标来评估模型在模式切换中的表现。实验表明,结合直接偏好优化(DPO)的模型在意图识别和模式转换处理上显著优于基线模型,在人类评估中胜过GPT-4o的比例达到70.1%。
Details
Motivation: 传统对话系统通常分离任务导向对话与开放闲聊,但真实对话中两者常自然切换,现有研究缺乏对这种过渡的建模能力。因此需要一个支持多模式动态转换的数据集和评估方法。 Method: 构建了TACT数据集,包含用户和代理驱动的多种对话模式转换结构;提出Switch和Recovery两个新评估指标;采用Direct Preference Optimization(DPO)优化模型训练。 Result: TACT训练的模型在联合模式-意图识别准确率上达到75.74%,在人类评估中以70.1%的胜率优于GPT-4o,显著提升模式切换与恢复能力。 Conclusion: 结构多样化数据结合DPO能有效提升对话系统在模式转换中的表现,推动更主动、过渡感知的对话代理发展。 Abstract: Conversational agents have traditionally been developed for either task-oriented dialogue (TOD) or open-ended chitchat, with limited progress in unifying the two. Yet, real-world conversations naturally involve fluid transitions between these modes. To address this gap, we introduce TACT (TOD-And-Chitchat Transition), a dataset designed for transition-aware dialogue modeling that incorporates structurally diverse and integrated mode flows. TACT supports both user- and agent-driven mode switches, enabling robust modeling of complex conversational dynamics. To evaluate an agent's ability to initiate and recover from mode transitions, we propose two new metrics -- Switch and Recovery. Models trained on TACT outperform baselines in both intent detection and mode transition handling. Moreover, applying Direct Preference Optimization (DPO) to TACT-trained models yields additional gains, achieving 75.74\% joint mode-intent accuracy and a 70.1\% win rate against GPT-4o in human evaluation. These results demonstrate that pairing structurally diverse data with DPO enhances response quality and transition control, paving the way for more proactive and transition-aware conversational agents.[20] BioVerge: A Comprehensive Benchmark and Study of Self-Evaluating Agents for Biomedical Hypothesis Generation
Fuyi Yang,Chenchen Ye,Mingyu Derek Ma,Yijia Xiao,Matthew Yang,Wei Wang
Main category: cs.CL
TL;DR: 本文提出了BioVerge基准和BioVerge Agent框架,旨在利用大语言模型(LLM)在生物医学假设生成中实现标准化探索,结合结构化与文本数据,并通过自评估机制提升假设的新颖性和相关性。
Details
Motivation: 现有生物医学假设生成方法多依赖单一数据类型或预定义模式,限制了复杂关系的发现;同时缺乏标准化数据集和执行环境,阻碍了LLM代理的应用。 Method: 构建了一个包含历史假设和PubMed文献的综合数据集BioVerge,并设计了基于ReAct的BioVerge Agent框架,包含生成与评估两个模块,支持迭代式假设生成与自我评估。 Result: 实验表明:不同架构影响探索多样性和推理策略;结构化与文本信息源各自提供关键上下文;自评估显著提升假设的新颖性和相关性。 Conclusion: BioVerge及其代理框架为生物医学假设生成提供了标准化、可扩展的平台,验证了多源信息融合与自我评估在LLM驱动发现中的有效性。 Abstract: Hypothesis generation in biomedical research has traditionally centered on uncovering hidden relationships within vast scientific literature, often using methods like Literature-Based Discovery (LBD). Despite progress, current approaches typically depend on single data types or predefined extraction patterns, which restricts the discovery of novel and complex connections. Recent advances in Large Language Model (LLM) agents show significant potential, with capabilities in information retrieval, reasoning, and generation. However, their application to biomedical hypothesis generation has been limited by the absence of standardized datasets and execution environments. To address this, we introduce BioVerge, a comprehensive benchmark, and BioVerge Agent, an LLM-based agent framework, to create a standardized environment for exploring biomedical hypothesis generation at the frontier of existing scientific knowledge. Our dataset includes structured and textual data derived from historical biomedical hypotheses and PubMed literature, organized to support exploration by LLM agents. BioVerge Agent utilizes a ReAct-based approach with distinct Generation and Evaluation modules that iteratively produce and self-assess hypothesis proposals. Through extensive experimentation, we uncover key insights: 1) different architectures of BioVerge Agent influence exploration diversity and reasoning strategies; 2) structured and textual information sources each provide unique, critical contexts that enhance hypothesis generation; and 3) self-evaluation significantly improves the novelty and relevance of proposed hypotheses.[21] Hallucinate or Memorize? The Two Sides of Probabilistic Learning in Large Language Models
Junichiro Niimi
Main category: cs.CL
TL;DR: 该研究探讨了大型语言模型(LLM)在生成参考文献时的幻觉问题,发现引用次数越高,生成结果越准确,超过约1000次引用后文献信息近乎被逐字记忆,但也存在内容相似文献间的记忆干扰。
Details
Motivation: 解决LLM在引文推荐中产生不存在论文(幻觉)的问题,探究知识是来自生成还是记忆。 Method: 使用GPT-4.1在20个计算机科学领域生成100个引文,人工验证并计算生成元数据与真实元数据的余弦相似度,以引用次数作为训练数据冗余度的代理指标。 Result: 引用次数与事实准确性强相关;超过约1000次引用后,文献信息近乎逐字被记忆;当多篇高引论文内容相似时,会出现记忆干扰现象。 Conclusion: LLM对高频引用文献的信息更倾向于直接记忆而非生成,存在从泛化到记忆的阈值,高被引论文几乎被完整保留在模型中。 Abstract: Large language models (LLMs) have been increasingly applied to a wide range of tasks, from natural language understanding to code generation. While they have also been used to assist in citation recommendation, the hallucination of non-existent papers remains a major issue. Building on prior studies, this study hypothesizes that an LLM's ability to correctly produce bibliographic records depends on whether the underlying knowledge is generated or memorized, with highly cited papers (i.e., more frequently appear in the pretraining corpus) showing lower hallucination rates. We therefore assume citation count as a proxy for training data redundancy (i.e., the frequency with which a given bibliographic record appears in the pretraining corpus) and investigate how citation frequency affects hallucinated references in LLM outputs. Using GPT-4.1, we generated and manually verified 100 citations across twenty computer-science domains, and measured factual consistency via cosine similarity between generated and authentic metadata. The results revealed that (i) citation count is strongly correlated with factual accuracy, (ii) bibliographic information becomes almost verbatim memorized beyond roughly 1,000 citations, and (iii) memory interference occurs when multiple highly cited papers share similar content. These findings indicate a threshold where generalization shifts into memorization, with highly cited papers being nearly verbatim retained in the model.[22] HalluClean: A Unified Framework to Combat Hallucinations in LLMs
Yaxin Zhao,Yu Zhang
Main category: cs.CL
TL;DR: 提出HalluClean,一种轻量级、任务无关的框架,用于检测和纠正大语言模型生成文本中的幻觉内容,通过推理增强范式提升事实一致性。
Details
Motivation: 大语言模型虽在多种自然语言处理任务中表现优异,但常产生幻觉内容,影响事实可靠性,需有效方法进行检测与纠正。 Method: 提出HalluClean框架,采用推理增强范式,将过程分解为规划、执行和修订三个阶段,利用最小化任务路由提示实现零样本泛化,无需外部知识或监督检测器。 Result: 在问答、对话、摘要、数学应用题和矛盾检测五个任务上实验表明,HalluClean显著提升了事实一致性,优于现有基线方法。 Conclusion: HalluClean能有效检测和纠正LLM生成文本中的幻觉,具有良好的泛化性和应用潜力,有助于提升大语言模型输出的可信度。 Abstract: Large language models (LLMs) have achieved impressive performance across a wide range of natural language processing tasks, yet they often produce hallucinated content that undermines factual reliability. To address this challenge, we introduce HalluClean, a lightweight and task-agnostic framework for detecting and correcting hallucinations in LLM-generated text. HalluClean adopts a reasoning-enhanced paradigm, explicitly decomposing the process into planning, execution, and revision stages to identify and refine unsupported claims. It employs minimal task-routing prompts to enable zero-shot generalization across diverse domains, without relying on external knowledge sources or supervised detectors. We conduct extensive evaluations on five representative tasks-question answering, dialogue, summarization, math word problems, and contradiction detection. Experimental results show that HalluClean significantly improves factual consistency and outperforms competitive baselines, demonstrating its potential to enhance the trustworthiness of LLM outputs in real-world applications.[23] TiDAR: Think in Diffusion, Talk in Autoregression
Jingyu Liu,Xin Dong,Zhifan Ye,Rishabh Mehta,Yonggan Fu,Vartika Singh,Jan Kautz,Ce Zhang,Pavlo Molchanov
Main category: cs.CL
TL;DR: TiDAR是一种新型混合架构,结合扩散模型的并行生成能力和自回归模型的高质量输出,在单次前向传播中实现高效推理,显著提升吞吐量且质量媲美自回归模型。
Details
Motivation: 现有方法难以兼顾扩散模型的高吞吐与自回归模型的高质量,需探索能融合二者优势的新架构。 Method: 提出TiDAR,采用结构化注意力掩码,在一个前向传递中先用扩散模型并行草拟令牌(Thinking),再以自回归方式生成最终输出(Talking)。 Result: 在1.5B和8B规模上,TiDAR相比自回归模型提升4.71x至5.91x吞吐量,优于推测解码和现有扩散模型,在效率和质量上均表现更优。 Conclusion: TiDAR首次实现了与自回归模型相当的质量同时大幅提高生成速度,为高效语言生成提供了新范式。 Abstract: Diffusion language models hold the promise of fast parallel generation, while autoregressive (AR) models typically excel in quality due to their causal structure aligning naturally with language modeling. This raises a fundamental question: can we achieve a synergy with high throughput, higher GPU utilization, and AR level quality? Existing methods fail to effectively balance these two aspects, either prioritizing AR using a weaker model for sequential drafting (speculative decoding), leading to lower drafting efficiency, or using some form of left-to-right (AR-like) decoding logic for diffusion, which still suffers from quality degradation and forfeits its potential parallelizability. We introduce TiDAR, a sequence-level hybrid architecture that drafts tokens (Thinking) in Diffusion and samples final outputs (Talking) AutoRegressively - all within a single forward pass using specially designed structured attention masks. This design exploits the free GPU compute density, achieving a strong balance between drafting and verification capacity. Moreover, TiDAR is designed to be serving-friendly (low overhead) as a standalone model. We extensively evaluate TiDAR against AR models, speculative decoding, and diffusion variants across generative and likelihood tasks at 1.5B and 8B scales. Thanks to the parallel drafting and sampling as well as exact KV cache support, TiDAR outperforms speculative decoding in measured throughput and surpasses diffusion models like Dream and Llada in both efficiency and quality. Most notably, TiDAR is the first architecture to close the quality gap with AR models while delivering 4.71x to 5.91x more tokens per second.[24] EVADE: LLM-Based Explanation Generation and Validation for Error Detection in NLI
Longfei Zuo,Barbara Plank,Siyao Peng
Main category: cs.CL
TL;DR: 提出EVADE框架,利用大语言模型生成和验证解释以检测自然语言推断数据集中的标注错误,在减少人工成本的同时提升数据质量和模型微调性能。
Details
Motivation: 在存在人类标注变异(HLV)的情况下,难以区分标注错误与合理的标签变化,现有双轮人工标注方法成本高且覆盖有限,需更高效的方法识别数据错误。 Method: 提出EVADE框架,使用大语言模型(LLM)生成并验证解释以检测错误,并与人类标注结果在分布比较、验证重叠和微调影响方面进行综合对比分析。 Result: LLM验证能更好对齐人类标注的解释分布,且去除LLM检测出的错误比去除人工识别的错误更能提升模型微调性能。 Conclusion: 利用LLM进行错误检测可有效提升NLP数据集质量,降低人工成本,具有大规模应用潜力。 Abstract: High-quality datasets are critical for training and evaluating reliable NLP models. In tasks like natural language inference (NLI), human label variation (HLV) arises when multiple labels are valid for the same instance, making it difficult to separate annotation errors from plausible variation. An earlier framework VARIERR (Weber-Genzel et al., 2024) asks multiple annotators to explain their label decisions in the first round and flag errors via validity judgments in the second round. However, conducting two rounds of manual annotation is costly and may limit the coverage of plausible labels or explanations. Our study proposes a new framework, EVADE, for generating and validating explanations to detect errors using large language models (LLMs). We perform a comprehensive analysis comparing human- and LLM-detected errors for NLI across distribution comparison, validation overlap, and impact on model fine-tuning. Our experiments demonstrate that LLM validation refines generated explanation distributions to more closely align with human annotations, and that removing LLM-detected errors from training data yields improvements in fine-tuning performance than removing errors identified by human annotators. This highlights the potential to scale error detection, reducing human effort while improving dataset quality under label variation.[25] SpiralThinker: Latent Reasoning through an Iterative Process with Text-Latent Interleaving
Shengmin Piao,Sanghyun Park
Main category: cs.CL
TL;DR: SpiralThinker是一种统一的隐式推理框架,通过在潜表示上进行迭代更新,实现无需生成额外token的扩展隐式推理,并通过渐进对齐目标和结构化标注保持潜层与文本推理的一致性,在数学、逻辑和常识推理任务中表现优于现有方法。
Details
Motivation: 现有潜推理方法缺乏确保潜表示稳定演化的机制,且无法系统地融合隐式与显式推理,因此需要一种能稳定、有效地结合两者的框架。 Method: 提出SpiralThinker框架,采用迭代更新潜表示的方式进行隐式推理,引入渐进对齐目标和结构化标注以保持潜层推理与文本输出的一致性,支持测试时扩展而无需增加生成token数量。 Result: 在多个数学、逻辑和常识推理基准上,SpiralThinker在所有潜推理方法中取得最佳整体性能;消融实验表明迭代和对齐机制均不可或缺,潜token数量和迭代次数存在数据集特定的最优值,适当的对齐对迭代过程至关重要。 Conclusion: SpiralThinker成功结合了迭代计算与潜推理,证明经过对齐的迭代更新能够可靠地引导潜空间中的推理过程,为隐式推理提供了一种有效且稳定的解决方案。 Abstract: Recent advances in large reasoning models have been driven by reinforcement learning and test-time scaling, accompanied by growing interest in latent rather than purely textual reasoning. However, existing latent reasoning methods lack mechanisms to ensure stable evolution of latent representations and a systematic way to interleave implicit and explicit reasoning. We introduce SpiralThinker, a unified framework that performs iterative updates over latent representations, enabling extended implicit reasoning without generating additional tokens. A progressive alignment objective combined with structured annotations maintains coherence between latent and textual reasoning. Across mathematical, logical, and commonsense reasoning tasks, SpiralThinker achieves the best overall performance among latent reasoning approaches, consistently surpassing previous methods across all benchmarks. Detailed analyses reveal that both iteration and alignment are indispensable, the numbers of latent tokens and iterations exhibit dataset-specific optima, and appropriate alignment proves critical for an effective iterative process. Overall, SpiralThinker bridges iterative computation and latent reasoning, demonstrating that aligned iterative updates can reliably steer reasoning in the latent space.[26] Detecting Emotional Dynamic Trajectories: An Evaluation Framework for Emotional Support in Language Models
Zhouxing Tan,Ruochong Xiong,Yulong Wan,Jinlong Ma,Hanlin Xue,Qichun Deng,Haifeng Jing,Zhengtong Zhang,Depei Liu,Shiyuan Luo,Junfei Liu
Main category: cs.CL
TL;DR: 提出了一种基于用户情绪轨迹的评估框架,用于评测大语言模型在长期情感支持中的表现,引入了三个时序指标,并构建了包含真实情绪变化场景的大规模基准。
Details
Motivation: 现有对大语言模型的情感支持能力评估多基于短文本、静态对话,难以反映其在动态、长期交互中改善和稳定用户情绪的能力,因此需要更贴近真实心理支持过程的评估方法。 Method: 采用用户中心视角,将情绪轨迹建模为一阶马尔可夫过程,结合因果调整的情绪估计方法;构建包含328个情感上下文和1,152个干扰事件的基准;约束模型输出使用验证过的情绪调节策略(如情境选择、认知重评)。 Result: 提出了三个轨迹级指标:基线情绪水平(BEL)、情绪轨迹波动性(ETV)和情绪质心位置(ECP),并在多种大语言模型上进行了广泛评估,揭示了不同模型在情感支持能力上的显著差异。 Conclusion: 所提出的轨迹式评估框架能更全面地衡量大语言模型在长期情感支持中的表现,提供了可操作的改进方向,推动面向心理健康应用的AI系统发展。 Abstract: Emotional support is a core capability in human-AI interaction, with applications including psychological counseling, role play, and companionship. However, existing evaluations of large language models (LLMs) often rely on short, static dialogues and fail to capture the dynamic and long-term nature of emotional support. To overcome this limitation, we shift from snapshot-based evaluation to trajectory-based assessment, adopting a user-centered perspective that evaluates models based on their ability to improve and stabilize user emotional states over time. Our framework constructs a large-scale benchmark consisting of 328 emotional contexts and 1,152 disturbance events, simulating realistic emotional shifts under evolving dialogue scenarios. To encourage psychologically grounded responses, we constrain model outputs using validated emotion regulation strategies such as situation selection and cognitive reappraisal. User emotional trajectories are modeled as a first-order Markov process, and we apply causally-adjusted emotion estimation to obtain unbiased emotional state tracking. Based on this framework, we introduce three trajectory-level metrics: Baseline Emotional Level (BEL), Emotional Trajectory Volatility (ETV), and Emotional Centroid Position (ECP). These metrics collectively capture user emotional dynamics over time and support comprehensive evaluation of long-term emotional support performance of LLMs. Extensive evaluations across a diverse set of LLMs reveal significant disparities in emotional support capabilities and provide actionable insights for model development.[27] A Neurosymbolic Approach to Natural Language Formalization and Verification
Sam Bayless,Stefano Buliani,Darion Cassel,Byron Cook,Duncan Clough,Rémi Delmas,Nafi Diallo,Ferhat Erata,Nick Feng,Dimitra Giannakopoulou,Aman Goel,Aditya Gokhale,Joe Hendrix,Marc Hudak,Dejan Jovanović,Andrew M. Kent,Benjamin Kiesl-Reiter,Jeffrey J. Kuna,Nadia Labai,Joseph Lilien,Divya Raghunathan,Zvonimir Rakamarić,Niloofar Razavi,Michael Tautschnig,Ali Torkamani,Nathaniel Weir,Michael W. Whalen,Jianan Yao
Main category: cs.CL
TL;DR: 提出了一种两阶段神经符号框架,利用大语言模型和人工指导将自然语言政策形式化,并通过推理时自动形式化验证语句的逻辑正确性,实现了超过99%的可靠性。
Details
Motivation: 大语言模型的随机性限制了其在金融、医疗等严格遵循政策的行业中的应用,因此需要提高其决策的可解释性和逻辑准确性。 Method: 采用两阶段神经符号框架:第一阶段使用大语言模型结合人工指导将自然语言政策形式化;第二阶段在推理时进行自动形式化并交叉验证多个形式化结果的语义等价性。 Result: 该方法在基准测试中实现了超过99%的逻辑正确性,几乎为零的误报率,并生成可审计的逻辑证据。 Conclusion: 该框架显著提升了大语言模型在高风险领域中政策合规性验证的可靠性与透明度,支持结果追溯和原始文本优化。 Abstract: Large Language Models perform well at natural language interpretation and reasoning, but their inherent stochasticity limits their adoption in regulated industries like finance and healthcare that operate under strict policies. To address this limitation, we present a two-stage neurosymbolic framework that (1) uses LLMs with optional human guidance to formalize natural language policies, allowing fine-grained control of the formalization process, and (2) uses inference-time autoformalization to validate logical correctness of natural language statements against those policies. When correctness is paramount, we perform multiple redundant formalization steps at inference time, cross checking the formalizations for semantic equivalence. Our benchmarks demonstrate that our approach exceeds 99% soundness, indicating a near-zero false positive rate in identifying logical validity. Our approach produces auditable logical artifacts that substantiate the verification outcomes and can be used to improve the original text.[28] MM-CRITIC: A Holistic Evaluation of Large Multimodal Models as Multimodal Critique
Gailun Zeng,Ziyang Luo,Hongzhan Lin,Yuchen Tian,Kaixin Li,Ziyang Gong,Jianxiong Guo,Jing Ma
Main category: cs.CL
TL;DR: 本文提出了MM-CRITIC,一个用于评估大型多模态模型(LMMs)在多个维度上批判能力的综合基准,涵盖8种主要任务类型和超过500个任务,包含4471个样本。通过结合专家指导的答案和GPT-4o进行注释,提升了评估的可靠性,并对当前领先的LMMs进行了全面的能力评估。
Details
Motivation: 尽管大型多模态模型(LMMs)在图像描述、视觉推理等任务中表现出色,但其自我批判能力的研究仍不足。为了提升LMMs作为可靠AI助手的能力,需要系统地评估其在多模态环境下的批判性能。 Method: 构建了一个名为MM-CRITIC的多维评测基准,涵盖基础性判断、错误修正和结果比较三个维度,覆盖8类任务。采用专家制定的标准答案作为评分依据,利用GPT-4o生成参考评语并对模型输出进行自动评分。 Result: 实验验证了MM-CRITIC的有效性,揭示了响应质量与批判能力之间的相关性,并发现不同评估维度的批判难度存在差异。该基准可用于可靠评估多种LMMs的批判能力。 Conclusion: MM-CRITIC为评估LMMs的批判能力提供了可靠且系统的框架,有助于推动具备自我改进能力的多模态AI的发展。 Abstract: The ability of critique is vital for models to self-improve and serve as reliable AI assistants. While extensively studied in language-only settings, multimodal critique of Large Multimodal Models (LMMs) remains underexplored despite their growing capabilities in tasks like captioning and visual reasoning. In this work, we introduce MM-CRITIC, a holistic benchmark for evaluating the critique ability of LMMs across multiple dimensions: basic, correction, and comparison. Covering 8 main task types and over 500 tasks, MM-CRITIC collects responses from various LMMs with different model sizes and is composed of 4471 samples. To enhance the evaluation reliability, we integrate expert-informed ground answers into scoring rubrics that guide GPT-4o in annotating responses and generating reference critiques, which serve as anchors for trustworthy judgments. Extensive experiments validate the effectiveness of MM-CRITIC and provide a comprehensive assessment of leading LMMs' critique capabilities under multiple dimensions. Further analysis reveals some key insights, including the correlation between response quality and critique, and varying critique difficulty across evaluation dimensions. Our code is available at https://github.com/MichealZeng0420/MM-Critic.[29] Context-Aware Dynamic Chunking for Streaming Tibetan Speech Recognition
Chao Wang,Yuqing Cai,Renzeng Duojie,Jin Zhang,Yutong Liu,Nyima Tashi
Main category: cs.CL
TL;DR: 提出了一种基于混合CTC/Attention架构和上下文感知动态块机制的安多方言藏语流式语音识别框架,通过自适应调整块宽度和引入语言模型,显著降低了词错误率并减少了识别延迟。
Details
Motivation: 解决固定块方法在流式语音识别中存在的上下文截断问题,并提升藏语这种特定语言的识别性能。 Method: 采用混合CTC/Attention架构,结合上下文感知的动态块机制,自适应调整编码块大小;构建基于藏语正字法原则的词典,并在解码时融合外部语言模型以提升语义一致性和长句识别效果。 Result: 在测试集上实现了6.23%的词错误率(WER),相比固定块基线方法相对降低48.15%,同时显著减少识别延迟,性能接近全局解码。 Conclusion: 所提出的动态块机制有效缓解了上下文截断问题,提升了安多藏语流式语音识别的准确性和实时性,结合语言知识建模进一步增强了系统性能。 Abstract: In this work, we propose a streaming speech recognition framework for Amdo Tibetan, built upon a hybrid CTC/Atten-tion architecture with a context-aware dynamic chunking mechanism. The proposed strategy adaptively adjusts chunk widths based on encoding states, enabling flexible receptive fields, cross-chunk information exchange, and robust adaptation to varying speaking rates, thereby alleviating the context truncation problem of fixed-chunk methods. To further capture the linguistic characteristics of Tibetan, we construct a lexicon grounded in its orthographic principles, providing linguistically motivated modeling units. During decoding, an external language model is integrated to enhance semantic consistency and improve recognition of long sentences. Experimental results show that the proposed framework achieves a word error rate (WER) of 6.23% on the test set, yielding a 48.15% relative improvement over the fixed-chunk baseline, while significantly reducing recognition latency and maintaining performance close to global decoding.[30] Thinking Forward and Backward: Multi-Objective Reinforcement Learning for Retrieval-Augmented Reasoning
Wenda Wei,Yu-An Liu,Ruqing Zhang,Jiafeng Guo,Lixin Su,Shuaiqiang Wang,Dawei Yin,Maarten de Rijke,Xueqi Cheng
Main category: cs.CL
TL;DR: 提出了一种新的检索增强推理框架Bi-RAR,通过前向和后向双向评估中间步骤,结合基于Kolmogorov复杂度的信息距离度量和多目标强化学习,显著提升复杂推理任务中的性能。
Details
Motivation: 现有检索增强生成方法在多步复杂推理中效果有限,且依赖结果监督,缺乏对中间步骤的指导,易导致奖励黑客行为和响应质量下降。 Method: 提出Bi-RAR框架,引入基于Kolmogorov复杂度的双向信息距离,衡量每一步推理的信息完整性,并采用具有级联奖励结构的多目标强化学习优化推理过程。 Result: 在七个问答基准上的实验表明,Bi-RAR优于先前方法,在训练和推理过程中实现了与搜索引擎的高效交互与推理。 Conclusion: Bi-RAR通过双向评估和多目标优化,有效提升了复杂多步推理场景下检索增强生成的性能和稳定性。 Abstract: Retrieval-augmented generation (RAG) has proven to be effective in mitigating hallucinations in large language models, yet its effectiveness remains limited in complex, multi-step reasoning scenarios.Recent efforts have incorporated search-based interactions into RAG, enabling iterative reasoning with real-time retrieval. Most approaches rely on outcome-based supervision, offering no explicit guidance for intermediate steps. This often leads to reward hacking and degraded response quality. We propose Bi-RAR, a novel retrieval-augmented reasoning framework that evaluates each intermediate step jointly in both forward and backward directions. To assess the information completeness of each step, we introduce a bidirectional information distance grounded in Kolmogorov complexity, approximated via language model generation probabilities. This quantification measures both how far the current reasoning is from the answer and how well it addresses the question. To optimize reasoning under these bidirectional signals, we adopt a multi-objective reinforcement learning framework with a cascading reward structure that emphasizes early trajectory alignment. Empirical results on seven question answering benchmarks demonstrate that Bi-RAR surpasses previous methods and enables efficient interaction and reasoning with the search engine during training and inference.[31] Assessing the Capabilities of LLMs in Humor:A Multi-dimensional Analysis of Oogiri Generation and Evaluation
Ritsu Sakabe,Hwichan Kim,Tosho Hirasawa,Mamoru Komachi
Main category: cs.CL
TL;DR: 该论文通过引入日本即兴喜剧Oogiri,系统评估大语言模型(LLM)在多维度幽默理解上的表现,发现LLM在生成幽默响应方面接近中等人类水平,但显著缺乏共情能力,且更重视新颖性而非人类看重的共情,导致其难以准确评估幽默。
Details
Motivation: 现有对大语言模型幽默能力的评估多依赖单一维度(如是否‘好笑’),缺乏对幽默多面性的深入理解。为此,论文提出需从多维度系统评估LLM的幽默生成与判断能力。 Method: 扩展了现有的Oogiri数据集,加入新来源数据及LLM生成的回应,并对扩展后的数据集进行人工标注,采用五点量表对新颖性、清晰度、相关性、智力、共情和整体好笑程度六个维度评分,进而评估最先进LLM在幽默生成和多维评价两项任务上的表现。 Result: LLM生成的幽默回应水平介于低等到中等人类表现之间,但在共情维度上明显不足;LLM与人类在评估标准上存在根本差异:LLM更看重新颖性,而人类更重视共情。 Conclusion: 共情缺失是LLM难以复制人类幽默判断的关键原因,未来需提升模型的情感智能以发展更 sophisticated 的对话系统,作者已公开标注语料库以支持后续研究。 Abstract: Computational humor is a frontier for creating advanced and engaging natural language processing (NLP) applications, such as sophisticated dialogue systems. While previous studies have benchmarked the humor capabilities of Large Language Models (LLMs), they have often relied on single-dimensional evaluations, such as judging whether something is simply ``funny.'' This paper argues that a multifaceted understanding of humor is necessary and addresses this gap by systematically evaluating LLMs through the lens of Oogiri, a form of Japanese improvisational comedy games. To achieve this, we expanded upon existing Oogiri datasets with data from new sources and then augmented the collection with Oogiri responses generated by LLMs. We then manually annotated this expanded collection with 5-point absolute ratings across six dimensions: Novelty, Clarity, Relevance, Intelligence, Empathy, and Overall Funniness. Using this dataset, we assessed the capabilities of state-of-the-art LLMs on two core tasks: their ability to generate creative Oogiri responses and their ability to evaluate the funniness of responses using a six-dimensional evaluation. Our results show that while LLMs can generate responses at a level between low- and mid-tier human performance, they exhibit a notable lack of Empathy. This deficit in Empathy helps explain their failure to replicate human humor assessment. Correlation analyses of human and model evaluation data further reveal a fundamental divergence in evaluation criteria: LLMs prioritize Novelty, whereas humans prioritize Empathy. We release our annotated corpus to the community to pave the way for the development of more emotionally intelligent and sophisticated conversational agents.[32] One-Topic-Doesn't-Fit-All: Transcreating Reading Comprehension Test for Personalized Learning
Jieun Han,Daniel Lee,Haneul Yoo,Jinsung Yoon,Junyeong Park,Suin Kim,So-Yeon Ahn,Alice Oh
Main category: cs.CL
TL;DR: 提出一种基于学生兴趣的个性化英语阅读理解测试生成方法,利用gpt-4o对RACE-C数据集进行内容转创,实验证明个性化材料能提升EFL学习者的理解能力与动机。
Details
Motivation: 在EFL教育中,学生参与度和动机对阅读理解至关重要,传统阅读材料缺乏个性化,难以满足不同学习者的兴趣需求。 Method: 构建一个结构化的内容转创流程,结合主题提取、基于Bloom分类法的问题分类、语言特征分析,使用gpt-4o生成语义上符合学生兴趣但语言难度相似的新阅读材料和题目。 Result: 在韩国EFL学习者中的对照实验显示,使用个性化阅读材料的学生在阅读理解表现和动机保持方面优于使用非个性化材料的学生。 Conclusion: 兴趣对齐的个性化阅读材料能有效提升EFL学习者的阅读 comprehension 和学习动机,验证了生成式AI在个性化教育中的应用潜力。 Abstract: Personalized learning has gained attention in English as a Foreign Language (EFL) education, where engagement and motivation play crucial roles in reading comprehension. We propose a novel approach to generating personalized English reading comprehension tests tailored to students' interests. We develop a structured content transcreation pipeline using OpenAI's gpt-4o, where we start with the RACE-C dataset, and generate new passages and multiple-choice reading comprehension questions that are linguistically similar to the original passages but semantically aligned with individual learners' interests. Our methodology integrates topic extraction, question classification based on Bloom's taxonomy, linguistic feature analysis, and content transcreation to enhance student engagement. We conduct a controlled experiment with EFL learners in South Korea to examine the impact of interest-aligned reading materials on comprehension and motivation. Our results show students learning with personalized reading passages demonstrate improved comprehension and motivation retention compared to those learning with non-personalized materials.[33] DoPE: Denoising Rotary Position Embedding
Jing Xiong,Liyang Fan,Hui Shen,Zunhai Su,Min Yang,Lingpeng Kong,Ngai Wong
Main category: cs.CL
TL;DR: 提出了一种无需训练的位置编码去噪方法DoPE,通过截断矩阵熵检测特征图中的异常频带,并利用高斯分布重参数化实现鲁棒的上下文外推,有效缓解注意力沉降问题。
Details
Motivation: Rotary Position Embedding (RoPE) 在长序列外推中存在局限性,导致注意力沉降和不均衡的注意力模式,影响模型在长上下文中的推理能力。 Method: 将带有位置编码的注意力图重新解释为含噪特征图,基于截断矩阵熵识别异常频率成分,并采用无参数的高斯分布进行特征图重参数化,实现无需训练的位置编码优化。 Result: 在needle-in-a-haystack和多跳推理等任务中,DoPE显著提升了长达64K token上下文下的检索准确率与推理稳定性,恢复了均衡的注意力分布。 Conclusion: DoPE通过去噪视角揭示了注意力沉降现象的成因及其与截断矩阵熵的关系,为位置编码的长度外推提供了一种简洁而有效的方法。 Abstract: Rotary Position Embedding (RoPE) in Transformer models has inherent limits that weaken length extrapolation. We reinterpret the attention map with positional encoding as a noisy feature map, and propose Denoising Positional Encoding (DoPE), a training-free method based on truncated matrix entropy to detect outlier frequency bands in the feature map. Leveraging the noise characteristics of the feature map, we further reparameterize it with a parameter-free Gaussian distribution to achieve robust extrapolation. Our method theoretically reveals the underlying cause of the attention sink phenomenon and its connection to truncated matrix entropy. Experiments on needle-in-a-haystack and many-shot in-context learning tasks demonstrate that DoPE significantly improves retrieval accuracy and reasoning stability across extended contexts (up to 64K tokens). The results show that the denoising strategy for positional embeddings effectively mitigates attention sinks and restores balanced attention patterns, providing a simple yet powerful solution for improving length generalization. Our project page is Project: https://The-physical-picture-of-LLMs.github.io[34] LoopTool: Closing the Data-Training Loop for Robust LLM Tool Calls
Kangning Zhang,Wenxiang Jiao,Kounianhua Du,Yuan Lu,Weiwen Liu,Weinan Zhang,Lei Zhang,Yong Yu
Main category: cs.CL
TL;DR: 本文提出了LoopTool,一个全自动、模型感知的数据进化框架,通过闭环整合数据生成与模型训练来提升大语言模型的工具使用能力。
Details
Motivation: 现有的工具学习受限于静态的合成数据流水线,无法针对模型的具体弱点进行自适应优化,且存在标签噪声问题,影响训练效率。 Method: LoopTool包含三个协同模块:贪婪能力探测(GCP)诊断模型掌握和失败的能力;判断引导的标签验证(JGLV)利用开源判别模型发现并纠正标注错误;错误驱动的数据扩展(EDDE)基于识别出的失败生成新的挑战性样本。该过程在低成本开源生态系统中运行。 Result: 实验表明,使用LoopTool训练的8B模型显著超越了其32B的数据生成器,并在BFCL-v3和ACEBench基准上为其规模设定了新的最先进性能。 Conclusion: 闭环、自我 refine 的数据管道能显著增强大语言模型的工具使用能力,且无需依赖昂贵的闭源API。 Abstract: Augmenting Large Language Models (LLMs) with external tools enables them to execute complex, multi-step tasks. However, tool learning is hampered by the static synthetic data pipelines where data generation and model training are executed as two separate, non-interactive processes. This approach fails to adaptively focus on a model's specific weaknesses and allows noisy labels to persist, degrading training efficiency. We introduce LoopTool, a fully automated, model-aware data evolution framework that closes this loop by tightly integrating data synthesis and model training. LoopTool iteratively refines both the data and the model through three synergistic modules: (1) Greedy Capability Probing (GCP) diagnoses the model's mastered and failed capabilities; (2) Judgement-Guided Label Verification (JGLV) uses an open-source judge model to find and correct annotation errors, progressively purifying the dataset; and (3) Error-Driven Data Expansion (EDDE) generates new, challenging samples based on identified failures. This closed-loop process operates within a cost-effective, open-source ecosystem, eliminating dependence on expensive closed-source APIs. Experiments show that our 8B model trained with LoopTool significantly surpasses its 32B data generator and achieves new state-of-the-art results on the BFCL-v3 and ACEBench benchmarks for its scale. Our work demonstrates that closed-loop, self-refining data pipelines can dramatically enhance the tool-use capabilities of LLMs.[35] A Hybrid Search for Complex Table Question Answering in Securities Report
Daiki Shirafuji,Koji Tanaka,Tatsuhiko Saito
Main category: cs.CL
TL;DR: 提出一种无需手动识别的单元格提取方法,通过结合语言模型和TF-IDF的混合检索机制计算问题与单元格的相似性来估计表头,并在NTCIR-18的TQA数据集上实现了74.6%的准确率,优于GPT-4o mini。
Details
Motivation: 大多数大语言模型难以理解复杂表格结构,直接输入完整表格容易导致错误答案,因此需要一种能有效解析表格并准确回答问题的方法。 Method: 通过计算问题与各单元格的相似性,利用语言模型和TF-IDF的混合检索机制估计表格的行和列头,并选取最相关行列交点的单元格作为答案;使用对比学习在小规模问题-表头数据集上训练语言模型以提升性能。 Result: 在NTCIR-18的U4共享任务TQA数据集上达到74.6%的准确率,优于GPT-4o mini(63.9%)。 Conclusion: 该方法能有效处理复杂表头结构,在不依赖人工标注的情况下提升表格问答的准确性,未来计划引入更高效的文本搜索模型进一步优化性能。 Abstract: Recently, Large Language Models (LLMs) are gaining increased attention in the domain of Table Question Answering (TQA), particularly for extracting information from tables in documents. However, directly entering entire tables as long text into LLMs often leads to incorrect answers because most LLMs cannot inherently capture complex table structures. In this paper, we propose a cell extraction method for TQA without manual identification, even for complex table headers. Our approach estimates table headers by computing similarities between a given question and individual cells via a hybrid retrieval mechanism that integrates a language model and TF-IDF. We then select as the answer the cells at the intersection of the most relevant row and column. Furthermore, the language model is trained using contrastive learning on a small dataset of question-header pairs to enhance performance. We evaluated our approach in the TQA dataset from the U4 shared task at NTCIR-18. The experimental results show that our pipeline achieves an accuracy of 74.6\%, outperforming existing LLMs such as GPT-4o mini~(63.9\%). In the future, although we used traditional encoder models for retrieval in this study, we plan to incorporate more efficient text-search models to improve performance and narrow the gap with human evaluation results.[36] Context is Enough: Empirical Validation of $\textit{Sequentiality}$ on Essays
Amal Sunny,Advay Gupta,Vishnu Sreekumar
Main category: cs.CL
TL;DR: 本文验证了基于上下文的sequentiality度量在评估文章连贯性和组织性方面优于基于主题的版本,并且与人类评分更一致。结合传统语言特征后,该方法还优于零样本大模型预测结果,表明显式建模句间流动性的价值。
Details
Motivation: 原始sequentiality度量因主题选择方式受到混淆,且缺乏与真实流畅度的验证;因此需要一个更合理、可解释的替代方案。 Method: 使用ASAP++和ELLIPSE两个带有人工标注特征分数的作文数据集,比较基于上下文的sequentiality与原始方法及零样本大语言模型在预测篇章级特征(如组织性、连贯性)上的表现。 Result: 基于上下文的sequentiality与人工评分的相关性更高;虽然零样本LLM单独预测更准,但加入上下文sequentiality能显著提升预测性能,并超过零样本LLM的表现。 Conclusion: 基于上下文的sequentiality是一种有效、可解释且互补的指标,支持其在自动作文评分等NLP任务中的应用。 Abstract: Recent work has proposed using Large Language Models (LLMs) to quantify narrative flow through a measure called sequentiality, which combines topic and contextual terms. A recent critique argued that the original results were confounded by how topics were selected for the topic-based component, and noted that the metric had not been validated against ground-truth measures of flow. That work proposed using only the contextual term as a more conceptually valid and interpretable alternative. In this paper, we empirically validate that proposal. Using two essay datasets with human-annotated trait scores, ASAP++ and ELLIPSE, we show that the contextual version of sequentiality aligns more closely with human assessments of discourse-level traits such as Organization and Cohesion. While zero-shot prompted LLMs predict trait scores more accurately than the contextual measure alone, the contextual measure adds more predictive value than both the topic-only and original sequentiality formulations when combined with standard linguistic features. Notably, this combination also outperforms the zero-shot LLM predictions, highlighting the value of explicitly modeling sentence-to-sentence flow. Our findings support the use of context-based sequentiality as a validated, interpretable, and complementary feature for automated essay scoring and related NLP tasks.[37] The Learning Dynamics of Subword Segmentation for Morphologically Diverse Languages
Francois Meyer,Jan Buys
Main category: cs.CL
TL;DR: 本文研究了在预训练和微调过程中可学习子词分割的动态变化,扩展了子词分段语言模型(SSLM)框架,并分析了三种不同类型语言的子词演化过程,发现可学习子词有助于提升低资源、形态复杂的语言的文本生成和跨语言迁移性能。
Details
Motivation: 探索在语言模型训练过程中动态优化分词方式的子词分割学习机制,理解不同语言类型下子词边界如何演化,并改进低资源语言的建模效果。 Method: 扩展子词分段语言模型(SSLM)以支持预训练和微调,在代表不同形态类型的三种语言(Isi-Xhosa、Setswana 和 English)上进行实验,从语言学角度分析子词的形态、产出性和繁殖力等动态特征。 Result: 识别出子词学习的四个阶段,形态复杂的Isi-Xhosa表现出更大的不稳定性;微调过程中子词边界趋向更细粒度;可学习子词能提升低资源、形态复杂语言的文本生成与跨语言迁移能力。 Conclusion: 可学习的子词分割在训练和微调中展现出系统性的动态演化规律,尤其对形态复杂语言具有显著优势,是一种有前景的低资源语言建模方法。 Abstract: Subword segmentation is typically applied in preprocessing and stays fixed during training. Alternatively, it can be learned during training to optimise the training objective. In this paper we study the learning dynamics of subword segmentation: if a language model can dynamically optimise tokenisation, how do its subwords evolve during pretraining and finetuning? To explore this, we extend the subword segmental language model (SSLM), a framework for learning subwords during training, to support pretraining and finetuning. We train models for three typologically diverse languages to study learning dynamics across the morphological spectrum: Isi-Xhosa is conjunctive (long word forms composed of many morphemes), Setswana is disjunctive (morphemes written as separate words), and English represents a typological middle ground. We analyse subword dynamics from a linguistic perspective, tracking morphology, productivity, and fertility. We identify four stages of subword learning, with the morphologically complex isi-Xhosa exhibiting greater instability. During finetuning, subword boundaries shift to become finer-grained. Lastly, we show that learnable subwords offers a promising approach to improve text generation and cross-lingual transfer for low-resource, morphologically complex languages.[38] Pretraining Finnish ModernBERTs
Akseli Reunamo,Laura-Maria Peltonen,Hans Moen,Sampo Pyysalo
Main category: cs.CL
TL;DR: 本文介绍了六种不同规模的ModernBERT编码器模型的预训练,重点是有限多语言性,强调与芬兰相关的语言。这些模型在需要超过512个标记上下文的任务上优于单语模型,并且在某些方面优于现有的多语言模型。最终阶段训练中使用不同数据的实证结果也被展示,代码和模型已公开发布。
Details
Motivation: 提高对芬兰相关语言的支持,并探索在长文本任务上的性能表现。 Method: 预训练了六个不同大小的ModernBERT模型(从51M到475M参数),并在多语言和长上下文任务上进行评估,同时测试了不同训练数据对最终性能的影响。 Result: 模型在多语言任务上表现优异,超过或媲美现有模型;在超过512个token的长上下文任务上优于单语模型;不同数据选择对最终性能有显著影响。 Conclusion: ModernBERT模型在有限多语言场景下具有竞争力,尤其适合芬兰相关语言及长文本处理,且已开源供社区使用。 Abstract: This paper reports on pretraining ModernBERT encoder models in six different sizes, ranging from 51M to 475M parameters, with a focus on limited multilingualism, emphasizing languages relevant to Finland. Our models are competitive with, or superior to, existing multilingual models. They outperform monolingual models on tasks that require a context longer than 512 tokens. We present empirical results on using different data in the final stage of training. The code and models are publicly released.[39] Stabilizing Reinforcement Learning for Honesty Alignment in Language Models on Deductive Reasoning
Jiarui Liu,Kaustubh Dhole,Yingheng Wang,Haoyang Wen,Sarah Zhang,Haitao Mao,Gaotang Li,Neeraj Varshney,Jingguo Liu,Xiaoman Pan
Main category: cs.CL
TL;DR: 提出了一种名为Anchor的强化学习方法,通过在训练中引入真实轨迹来稳定语言模型的演绎推理能力,特别是在诚实对齐任务中表现更优。
Details
Motivation: 现有基于奖励的强化学习方法在早期训练中容易因负奖励占主导而崩溃,尤其是在需要判断结论是否可从前提推出的诚实对齐任务中。为此,需研究能稳定训练并提升推理对齐的方法。 Method: 构建了两个基于图结构的多步演绎推理数据集(线性代数与逻辑推理),引入不可回答样本;提出Anchor方法,在 rollout 过程中注入真实轨迹以防止早期训练崩溃,并结合课程学习进行对比实验。 Result: 实验表明GRPO等现有方法在该任务上表现不佳;课程学习有一定帮助但依赖精心设计的数据分布;Anchor方法显著提升了训练稳定性与推理性能。 Conclusion: 训练动态对语言模型的可靠演绎推理至关重要,Anchor通过引入真实轨迹有效缓解了早期训练崩溃问题,为诚实对齐提供了有效解决方案。 Abstract: Reinforcement learning with verifiable rewards (RLVR) has recently emerged as a promising framework for aligning language models with complex reasoning objectives. However, most existing methods optimize only for final task outcomes, leaving models vulnerable to collapse when negative rewards dominate early training. This challenge is especially pronounced in honesty alignment, where models must not only solve answerable queries but also identify when conclusions cannot be drawn from the given premises. Deductive reasoning provides an ideal testbed because it isolates reasoning capability from reliance on external factual knowledge. To investigate honesty alignment, we curate two multi-step deductive reasoning datasets from graph structures, one for linear algebra and one for logical inference, and introduce unanswerable cases by randomly perturbing an edge in half of the instances. We find that GRPO, with or without supervised fine tuning initialization, struggles on these tasks. Through extensive experiments across three models, we evaluate stabilization strategies and show that curriculum learning provides some benefit but requires carefully designed in distribution datasets with controllable difficulty. To address these limitations, we propose Anchor, a reinforcement learning method that injects ground truth trajectories into rollouts, preventing early training collapse. Our results demonstrate that this method stabilizes learning and significantly improves the overall reasoning performance, underscoring the importance of training dynamics for enabling reliable deductive reasoning in aligned language models.[40] POTSA: A Cross-Lingual Speech Alignment Framework for Low Resource Speech-to-Text Translation
Xuanchen Li,Chenrui Cui,Tianrui Wang,Meng Ge,Zikang Huang,Jin Li,Yizhou Peng,Longbiao Wang,Jianwu Dang,Nyima Tashi
Main category: cs.CL
TL;DR: 提出基于最优传输的POTSA框架,利用平行语音对实现跨语言语音表示对齐,提升多语言语音-文本翻译性能,尤其在低资源和零样本语言上表现显著。
Details
Motivation: 现有语音大模型在多语言语音到文本翻译中忽视了源语言间的语义共性,导致翻译性能偏差。 Method: 引入偏差补偿模块进行粗略对齐,使用Q-Former结合并行语音对施加token级最优传输约束,并采用层调度策略聚焦最具语义价值的层。 Result: 在FLEURS数据集上,仅用每种语言10小时平行语音,平均提升0.93 BLEU(五种常用语言)和5.05 BLEU(零样本语言)。 Conclusion: POTSA有效缩小了高/低资源语言间的翻译差距,通过最优传输实现细粒度跨语言对齐,显著提升多语言S2TT性能。 Abstract: Speech Large Language Models (SpeechLLMs) have achieved breakthroughs in multilingual speech-to-text translation (S2TT). However, existing approaches often overlook semantic commonalities across source languages, leading to biased translation performance. In this work, we propose \textbf{POTSA} (Parallel Optimal Transport for Speech Alignment), a new framework based on cross-lingual parallel speech pairs and Optimal Transport (OT), designed to bridge high- and low-resource translation gaps. First, we introduce a Bias Compensation module to coarsely align initial speech representations across languages. Second, we impose token-level OT constraints on a Q-Former using parallel speech pairs to establish fine-grained consistency of representations. Then, we apply a layer scheduling strategy to focus OT constraints on the most semantically beneficial layers. Experiments on the FLEURS dataset show that our method achieves SOTA performance, with +0.93 BLEU on average over five common languages and +5.05 BLEU on zero-shot languages, using only 10 hours of parallel speech per source language.[41] C$^3$TG: Conflict-aware, Composite, and Collaborative Controlled Text Generation
Yu Li,Zhe Yang,Yi Huang,Xin Liu,Guilin Qi
Main category: cs.CL
TL;DR: 提出C$^3$TG框架,实现无需模型修改的细粒度、多维度文本属性控制,通过两阶段方法解决属性冲突并提升生成质量。
Details
Motivation: 现有方法难以在不修改模型结构或大量微调的情况下实现精确的多属性控制,且缺乏处理属性冲突和迭代优化的机制。 Method: 采用两阶段框架:生成阶段结合加权KL散度与多个属性分类器调整token概率;优化阶段利用包含分类器得分和惩罚项的能量函数进行迭代反馈以解决属性冲突。 Result: 实验表明C$^3$TG在属性准确率、语言流畅性、输出多样性等指标上显著优于基线方法,并能有效降低毒性。 Conclusion: C$^3$TG是一种高效、灵活的多维文本属性控制方案,无需昂贵的模型修改即可实现精准可控生成。 Abstract: Recent advancements in large language models (LLMs) have demonstrated remarkable text generation capabilities. However, controlling specific attributes of generated text remains challenging without architectural modifications or extensive fine-tuning. Current methods typically toggle a single, basic attribute but struggle with precise multi-attribute control. In scenarios where attribute requirements conflict, existing methods lack coordination mechanisms, causing interference between desired attributes. Furthermore, these methods fail to incorporate iterative optimization processes in the controlled generation pipeline. To address these limitations, we propose Conflict-aware, Composite, and Collaborative Controlled Text Generation (C$^3$TG), a two-phase framework for fine-grained, multi-dimensional text attribute control. During generation, C$^3$TG selectively pairs the LLM with the required attribute classifiers from the 17 available dimensions and employs weighted KL-divergence to adjust token probabilities. The optimization phase then leverages an energy function combining classifier scores and penalty terms to resolve attribute conflicts through iterative feedback, enabling precise control over multiple dimensions simultaneously while preserving natural text flow. Experiments show that C$^3$TG significantly outperforms baselines across multiple metrics including attribute accuracy, linguistic fluency, and output diversity, while simultaneously reducing toxicity. These results establish C$^3$TG as an effective and flexible solution for multi-dimensional text attribute control that requires no costly model modifications.[42] LiteraryTaste: A Preference Dataset for Creative Writing Personalization
John Joon Young Chung,Vishakh Padmakumar,Melissa Roemmele,Yi Wang,Yuqian Sun,Tiffany Wang,Shm Garanganao Almeda,Brett A. Halperin,Yuwen Lu,Max Kreminski
Main category: cs.CL
TL;DR: 本文介绍了LiteraryTaste数据集,用于个性化创意写作大语言模型的研究,发现个体间写作偏好差异显著,且陈述偏好对揭示偏好的建模作用有限。
Details
Motivation: 为了提升创意写作大语言模型对用户个性化偏好的适应能力,解决现有数据集将多样化个人品味视为单一整体的问题。 Method: 构建包含60人阅读习惯和成对文本偏好标注的LiteraryTaste数据集,并使用Transformer编码器进行偏好建模,结合LLM驱动的可解释性分析。 Result: 个体在创意写作偏好上存在显著差异;微调后的Transformer编码器在个人和群体偏好预测上分别达到75.8%和67.7%的准确率;陈述偏好对揭示偏好的预测能力较弱。 Conclusion: LiteraryTaste为个性化创意写作技术的发展提供了基础,强调了建模个体偏好的重要性。 Abstract: People have different creative writing preferences, and large language models (LLMs) for these tasks can benefit from adapting to each user's preferences. However, these models are often trained over a dataset that considers varying personal tastes as a monolith. To facilitate developing personalized creative writing LLMs, we introduce LiteraryTaste, a dataset of reading preferences from 60 people, where each person: 1) self-reported their reading habits and tastes (stated preference), and 2) annotated their preferences over 100 pairs of short creative writing texts (revealed preference). With our dataset, we found that: 1) people diverge on creative writing preferences, 2) finetuning a transformer encoder could achieve 75.8% and 67.7% accuracy when modeling personal and collective revealed preferences, and 3) stated preferences had limited utility in modeling revealed preferences. With an LLM-driven interpretability pipeline, we analyzed how people's preferences vary. We hope our work serves as a cornerstone for personalizing creative writing technologies.[43] Towards Explainable Khmer Polarity Classification
Marry Kong,Rina Buoy,Sovisal Chenda,Nguonly Taing
Main category: cs.CL
TL;DR: 本文提出了一种可解释的高棉语情感分类方法,通过微调基于指令推理的Qwen-3模型,在准确预测情感标签的同时生成自解释推理过程,并发布了包含因果、罗马化及混合代码表达的新高棉语情感数据集。
Details
Motivation: 现有高棉语情感分类模型通常无法解释预测结果,缺乏透明性和可解释性,限制了其在实际应用中的可信度和可用性。 Method: 通过微调支持指令推理的Qwen-3模型,使其在预测情感极性时生成包含关键词或短语的推理理由,实现自我解释;同时构建并发布新的高棉语情感数据集。 Result: 实验表明,微调后的模型不仅能准确预测情感标签,还能通过识别与情感相关的关键词或短语提供合理的推理依据;新构建的数据集已公开发布。 Conclusion: 该方法提升了高棉语情感分类模型的可解释性与实用性,发布的模型和数据集为后续研究提供了重要资源。 Abstract: Khmer polarity classification is a fundamental natural language processing task that assigns a positive, negative, or neutral label to a given Khmer text input. Existing Khmer models typically predict the label without explaining the rationale behind the prediction. This paper proposes an explainable Khmer polarity classifier by fine-tuning an instruction-based reasoning Qwen-3 model. The notion of explainability in this paper is limited to self-explanations, which the model uses to rationalize its predictions. Experimental results show that the fine-tuned model not only predicts labels accurately but also provides reasoning by identifying polarity-related keywords or phrases to support its predictions. In addition, we contribute a new Khmer polarity dataset consisting of short- to medium-length casual, romanized, and mixed-code Khmer expressions. This dataset was constructed using both heuristic rules and human curation and is publicly available through a gated Hugging Face repository (rinabuoy/khmerpolarity_nonreasoning). The fine-tuned Qwen-3 models are also made available in the same Hugging Face account.[44] AdaptDel: Adaptable Deletion Rate Randomized Smoothing for Certified Robustness
Zhuoqun Huang,Neil G. Marchant,Olga Ohrimenko,Benjamin I. P. Rubinstein
Main category: cs.CL
TL;DR: 提出AdaptDel方法,通过可变删除率提升序列分类的鲁棒性认证性能。
Details
Motivation: 现有固定删除率方法在处理变长输入时表现不佳,难以有效应对编辑距离扰动下的鲁棒性认证问题。 Method: 引入AdaptDel,采用可自适应调整删除率的随机平滑框架,支持基于输入特性的动态删除。 Result: 在自然语言任务中显著提升认证区域的中位基数,相比当前最优方法提升高达30个数量级。 Conclusion: AdaptDel通过可变删除率有效增强了序列分类模型在编辑距离扰动下的鲁棒性认证能力。 Abstract: We consider the problem of certified robustness for sequence classification against edit distance perturbations. Naturally occurring inputs of varying lengths (e.g., sentences in natural language processing tasks) present a challenge to current methods that employ fixed-rate deletion mechanisms and lead to suboptimal performance. To this end, we introduce AdaptDel methods with adaptable deletion rates that dynamically adjust based on input properties. We extend the theoretical framework of randomized smoothing to variable-rate deletion, ensuring sound certification with respect to edit distance. We achieve strong empirical results in natural language tasks, observing up to 30 orders of magnitude improvement to median cardinality of the certified region, over state-of-the-art certifications.[45] mmJEE-Eval: A Bilingual Multimodal Benchmark for Evaluating Scientific Reasoning in Vision-Language Models
Arka Mukherjee,Shreya Ghosh
Main category: cs.CL
TL;DR: 本文提出了mmJEE-Eval,一个双语多模态基准,用于评估视觉-语言模型在科学推理方面的能力,揭示了现有模型在高推理负荷下的局限性。
Details
Motivation: 现有视觉-语言模型在多模态推理基准上表现良好,但难以区分真正的科学推理能力与模式匹配,因此需要更具挑战性的评估基准。 Method: 构建了一个包含1460个来自印度JEE高级考试的双语(英语和印地语)多学科问题的基准mmJEE-Eval,并对17种前沿模型进行系统评估与消融分析。 Result: 前沿闭源模型在2025年保留题目上准确率达77-84%,开源模型仅为37-45%;但在增加元认知推理负荷时,GPT-5仅能修正5.2%的错误,显示其推理能力有限。 Conclusion: mmJEE-Eval有效区分了不同模型的训练与推理方法优劣,突出了当前模型在深层科学推理上的不足,强调需提升真实推理能力而非依赖模式匹配。 Abstract: Contemporary vision-language models (VLMs) perform well on existing multimodal reasoning benchmarks (78-85\% accuracy on MMMU, MathVista). Yet, these results fail to sufficiently distinguish true scientific reasoning articulation capabilities from pattern-matching. To address this gap, we introduce \textbf{mmJEE-Eval}, a multimodal bilingual (English and Hindi) benchmark comprising 1,460 questions from India's JEE Advanced examination (2019-2025) spanning pre-college Physics, Chemistry, and Mathematics domains. Our evaluation of 17 state-of-the-art models reveals that while frontier VLMs (GPT-5, Gemini 2.5 Pro/Flash) achieve 77-84\% accuracy on held-out 2025 questions, open-source models plateau at 37-45\% despite scaling to 400B parameters, a significant difference not observed on existing benchmarks. While closed frontiers from Google and OpenAI show high problem-solving accuracies (up to 100\% pass@3 scores), they fully collapse when the reasoning load is increased meta-cognitively (GPT-5 fixes just 5.2\% errors). Systematic ablations show mmJEE-Eval's difficulty stems from complexity and reasoning depth rather than memorization. Effectively, our benchmark segregates superior training and reasoning methodologies where alternatives fail. We publicly release our code and data: https://mmjee-eval.github.io[46] Seer Self-Consistency: Advance Budget Estimation for Adaptive Test-Time Scaling
Shiyu Ji,Yixuan Wang,Yijun Liu,Qingfu Zhu,Wanxiang Che
Main category: cs.CL
TL;DR: 本文提出了SeerSC,一种结合系统1和系统2推理的动态自洽框架,在提升大模型推理性能的同时显著降低令牌消耗和延迟。
Details
Motivation: 测试时扩展虽能提升大语言模型的推理性能,但带来较高的计算成本和延迟,现有方法在效率和时延方面仍有局限。 Method: 利用快速的系统1计算查询的答案熵,评估样本的扩展潜力,并据此在系统2中实现动态自洽,支持并行生成以减少延迟。 Result: 相比现有方法,SeerSC最多减少47%的令牌消耗和43%的推理延迟,且未造成显著性能下降。 Conclusion: SeerSC有效平衡了测试时扩展中的性能、效率与延迟,为大模型推理提供了更高效的动态自洽方案。 Abstract: Test-time scaling improves the inference performance of Large Language Models (LLMs) but also incurs substantial computational costs. Although recent studies have reduced token consumption through dynamic self-consistency, they remain constrained by the high latency of sequential requests. In this paper, we propose SeerSC, a dynamic self-consistency framework that simultaneously improves token efficiency and latency by integrating System 1 and System 2 reasoning. Specifically, we utilize the rapid System 1 to compute the answer entropy for given queries. This score is then used to evaluate the potential of samples for scaling, enabling dynamic self-consistency under System 2. Benefiting from the advance and accurate estimation provided by System 1, the proposed method can reduce token usage while simultaneously achieving a significant decrease in latency through parallel generation. It outperforms existing methods, achieving up to a 47% reduction in token consumption and a 43% reduction in inference latency without significant performance loss.[47] Spider4SSC & S2CLite: A text-to-multi-query-language dataset using lightweight ontology-agnostic SPARQL to Cypher parser
Martin Vejvar,Yasutaka Fujimoto
Main category: cs.CL
TL;DR: 本文提出了Spider4SSC数据集和S2CLite解析工具,S2CLite是一种轻量级、本体无关的规则基础SPARQL到Cypher转换工具,显著提高了查询解析和执行的准确性,并开源了工具与数据集。
Details
Motivation: 现有的SPARQL到Cypher转换工具依赖RDF图或外部组件,且错误率较高,因此需要一种更高效、无需依赖图数据的轻量级解析方案。 Method: 设计了一种受传统编译器启发的纯规则基础解析器S2CLite,可在无RDF图或外部工具的情况下实现SPARQL到Cypher的转换,并利用其生成多语言统一的Text-to-Query数据集Spider4SSC。 Result: 在BSBM42和Spider4SPARQL数据集上实验显示,S2CLite在Spider4SPARQL上的解析准确率达77.8%(优于S2CTrans的44.2%),在共通查询子集上的执行准确率达96.6%,比S2CTrans高7.3%。成功生成包含SQL、SPARQL和Cypher三类等价查询的Spider4SSC数据集。 Conclusion: S2CLite是一种高效、准确且无需依赖图数据的SPARQL-to-Cypher解析器,其生成的Spider4SSC数据集为多语言文本到查询任务提供了重要资源,具有良好的可扩展性和应用前景。 Abstract: We present Spider4SSC dataset and S2CLite parsing tool. S2CLite is a lightweight, ontology-agnostic parser that translates SPARQL queries into Cypher queries, enabling both in-situ and large-scale SPARQL to Cypher translation. Unlike existing solutions, S2CLite is purely rule-based (inspired by traditional programming language compilers) and operates without requiring an RDF graph or external tools. Experiments conducted on the BSBM42 and Spider4SPARQL datasets show that S2CLite significantly reduces query parsing errors, achieving a total parsing accuracy of 77.8% on Spider4SPARQL compared to 44.2% by the state-of-the-art S2CTrans. Furthermore, S2CLite achieved a 96.6\% execution accuracy on the intersecting subset of queries parsed by both parsers, outperforming S2CTrans by 7.3%. We further use S2CLite to parse Spider4SPARQL queries to Cypher and generate Spider4SSC, a unified Text-to-Query language (SQL, SPARQL, Cypher) dataset with 4525 unique questions and 3 equivalent sets of 2581 matching queries (SQL, SPARQL and Cypher). We open-source S2CLite for further development on GitHub (github.com/vejvarm/S2CLite) and provide the clean Spider4SSC dataset for download.[48] MTQ-Eval: Multilingual Text Quality Evaluation for Language Models
Rhitabrat Pokharel,Ameeta Agrawal
Main category: cs.CL
TL;DR: 本文提出了一种名为MTQ-Eval的多语言文本质量评估框架,通过学习高质量和低质量文本样例来提升大语言模型的评估能力,并在115种语言上验证了其有效性。
Details
Motivation: 现有大语言模型在任务特定评估中表现良好,但在多语言环境下对文本质量进行通用评估的能力尚不明确,因此需要一种更普适、可扩展的多语言评估方法。 Method: 构建MTQ-Eval框架,首先自动生成文本质量偏好数据,然后利用这些数据微调开源基础大语言模型,使其能够区分高低质量文本并调整内部表示。 Result: 在115种语言上的实验表明,MTQ-Eval显著提升了文本质量评估性能,并且该能力还带来了下游任务的性能改进。 Conclusion: MTQ-Eval是一种有效且可扩展的多语言文本质量评估方法,证明了基于示例学习的大模型在跨语言质量评估中的潜力。 Abstract: The use of large language models (LLMs) for evaluating outputs is becoming an increasingly effective and scalable approach. However, it remains uncertain whether this capability extends beyond task-specific evaluations to more general assessments of text quality, particularly in multilingual contexts. In this study, we introduce, MTQ-Eval, a novel framework for multilingual text quality evaluation that learns from examples of both high- and low-quality texts, adjusting its internal representations. To develop MTQ-Eval, we first automatically generate text quality preference data and then use it to train open-source base LLMs to align with ratings of high- and low-quality text. Our comprehensive evaluation across 115 languages demonstrates the improved performance of the proposed model. Upon further analysis, we find that this enhanced evaluation capability also leads to notable improvements in downstream tasks.[49] Self-Correcting Large Language Models: Generation vs. Multiple Choice
Hossein A. Rahmani,Satyapriya Krishna,Xi Wang,Mohammadmehdi Naghiaei,Emine Yilmaz
Main category: cs.CL
TL;DR: 本文系统比较了大语言模型在开放式生成与多选题选择两种任务中自我修正机制的表现差异,发现任务结构和输出空间的交互影响纠错效果。
Details
Motivation: 探索大语言模型在不同任务范式(开放式生成 vs. 多选题选择)下的自我修正机制差异,并为面向决策的智能体应用提供设计启示。 Method: 通过在多种自然语言理解与推理任务上,对不同规模和家族的语言模型进行实验,分析其在两种范式下的性能趋势与错误修正行为。 Result: 发现开放式生成有利于重新解释和组合式优化,而多选题选择虽有清晰解空间但受限于选项本身;两种范式表现出不同的改进模式与失败机制。 Conclusion: 自我修正机制的设计应考虑任务结构与输出空间的相互作用,这对知识密集型推理和决策导向的LLM应用具有重要意义。 Abstract: Large language models have recently demonstrated remarkable abilities to self-correct their responses through iterative refinement, often referred to as self-consistency or self-reflection. However, the dynamics of this self-correction mechanism may differ substantially depending on whether the model is tasked with open-ended text generation or with selecting the most appropriate response from multiple predefined options. In this paper, we conduct a systematic investigation of these two paradigms by comparing performance trends and error-correction behaviors across various natural language understanding and reasoning tasks, covering language models of different scales and families. Our experimental results reveal distinct patterns of improvement and failure modes: \textit{While open-ended generation often benefits from the flexibility of re-interpretation and compositional refinement, multiple-choice selection can leverage clearer solution boundaries but may be limited by the provided options}. This contrast also reflects the dual demands faced by emerging agentic LLM applications: effective agents must not only generate and refine open-ended plans or explanations, but also make reliable discrete choices when operating within constrained action spaces. Our findings, therefore, highlight that the design of self-correction mechanisms should take into account the interaction between task structure and output space, with implications for both knowledge-intensive reasoning and decision-oriented applications of LLMs.[50] AMaPO: Adaptive Margin-attached Preference Optimization for Language Model Alignment
Ruibo Deng,Duanyu Feng,Wenqiang Lei
Main category: cs.CL
TL;DR: 本文提出了一种新的离线偏好优化算法AMaPO,通过实例级自适应边界解决现有方法中的过拟合与欠拟合困境,显著提升了排序准确性和模型对齐性能。
Details
Motivation: 现有的离线偏好优化方法受限于排序准确性,其根本原因在于固定边界设计导致正确排序样本过拟合、错误排序样本欠拟合,亟需一种更动态的梯度分配机制。 Method: 提出AMaPO算法,采用实例级自适应边界,并结合Z标准化和指数缩放进行优化,动态调整学习梯度:增强对错误排序样本的纠正信号,抑制对正确排序样本的冗余更新。 Result: 在多个主流基准上实验表明,AMaPO在排序准确率和下游对齐任务性能上均优于现有方法,且分析验证了其有效缓解了过拟合与欠拟合问题。 Conclusion: AMaPO通过自适应边界机制有效解决了偏好优化中的梯度分配失衡问题,为离线对齐训练提供了一个简单而高效的新范式。 Abstract: Offline preference optimization offers a simpler and more stable alternative to RLHF for aligning language models. However, their effectiveness is critically dependent on ranking accuracy, a metric where further gains are highly impactful. This limitation arises from a fundamental problem that we identify and formalize as the Overfitting-Underfitting Dilemma: current margin designs cause models to apply excessive, wasteful gradients to correctly ranked samples (overfitting) while providing insufficient corrective signals for misranked ones (underfitting). To resolve this dilemma, we propose Adaptive Margin-attached Preference Optimization (AMaPO), a simple yet principled algorithm. AMaPO employs an instance-wise adaptive margin, refined by Z-normalization and exponential scaling, which dynamically reallocates learning effort by amplifying gradients for misranked samples and suppressing them for correct ones. Extensive experiments on widely used benchmarks demonstrate that AMaPO not only achieves better ranking accuracy and superior downstream alignment performance, but targeted analysis also confirms that it successfully mitigates the core overfitting and underfitting issues.[51] Multimodal Large Language Models for Low-Resource Languages: A Case Study for Basque
Lukas Arana,Julen Etxaniz,Ander Salaberria,Gorka Azkune
Main category: cs.CL
TL;DR: 本文旨在为低资源语言(如巴斯克语)开发一个强大的多模态大语言模型(MLLM),通过构建训练和评估数据集,探索不同数据组合的效果,发现仅需约20%的巴斯克语多模态数据即可取得良好性能,且无需使用巴斯克语指令微调的LLM作为骨干模型。
Details
Motivation: 当前开源社区在低资源语言的多模态大语言模型方面表现不足,而商业MLLM虽有一定效果,但缺乏开放性。因此,研究如何在低资源语言上构建高效MLLM具有重要意义。 Method: 构建了针对巴斯克语的图像-文本训练与评估数据集,采用Llama-3.1-Instruct和巴斯克语适配版本Latxa两种大模型作为骨干,尝试多种数据混合方案进行训练,并评估其在巴斯克语基准上的表现。 Result: 实验表明:1)仅需约20%的巴斯克语多模态数据即可获得良好的性能;2)无需使用巴斯克语指令微调的骨干LLM即可构建高性能的巴斯克语MLLM。 Conclusion: 该研究为低资源语言的MLLM开发提供了有效路径,并通过公开资源促进开放科学发展。 Abstract: Current Multimodal Large Language Models exhibit very strong performance for several demanding tasks. While commercial MLLMs deliver acceptable performance in low-resource languages, comparable results remain unattained within the open science community. In this paper, we aim to develop a strong MLLM for a low-resource language, namely Basque. For that purpose, we develop our own training and evaluation image-text datasets. Using two different Large Language Models as backbones, the Llama-3.1-Instruct model and a Basque-adapted variant called Latxa, we explore several data mixtures for training. We show that: i) low ratios of Basque multimodal data (around 20%) are already enough to obtain solid results on Basque benchmarks, and ii) contrary to expected, a Basque instructed backbone LLM is not required to obtain a strong MLLM in Basque. Our results pave the way to develop MLLMs for other low-resource languages by openly releasing our resources.[52] CARE-Bench: A Benchmark of Diverse Client Simulations Guided by Expert Principles for Evaluating LLMs in Psychological Counseling
Bichen Wang,Yixin Sun,Junzhe Wang,Hao Yang,Xing Fu,Yanyan Zhao,Si Wei,Shijin Wang,Bing Qin
Main category: cs.CL
TL;DR: 本文提出了CARE-Bench,一个基于真实案例和专家指导构建的动态交互式自动化基准,用于全面评估大语言模型在心理咨询中的能力。
Details
Motivation: 由于心理咨询需求增长与服务资源有限之间的不匹配,促使研究者探索大语言模型在此领域的应用,但现有评估方法存在专业性不足、评估形式静态及指标单一等问题。 Method: 构建了基于多样化真实咨询案例和专家指南模拟的客户画像,设计了一个多维度、动态交互的评估基准CARE-Bench,并结合心理学量表对多种通用和专用大语言模型进行评估。 Result: 通过CARE-Bench评估发现现有模型在应对不同类型客户时存在明显局限性,心理专家合作分析揭示了失败原因。 Conclusion: CARE-Bench能够有效评估大语言模型的心理咨询能力,研究结果为开发更全面、通用和有效的心理咨询模型提供了方向。 Abstract: The mismatch between the growing demand for psychological counseling and the limited availability of services has motivated research into the application of Large Language Models (LLMs) in this domain. Consequently, there is a need for a robust and unified benchmark to assess the counseling competence of various LLMs. Existing works, however, are limited by unprofessional client simulation, static question-and-answer evaluation formats, and unidimensional metrics. These limitations hinder their effectiveness in assessing a model's comprehensive ability to handle diverse and complex clients. To address this gap, we introduce \textbf{CARE-Bench}, a dynamic and interactive automated benchmark. It is built upon diverse client profiles derived from real-world counseling cases and simulated according to expert guidelines. CARE-Bench provides a multidimensional performance evaluation grounded in established psychological scales. Using CARE-Bench, we evaluate several general-purpose LLMs and specialized counseling models, revealing their current limitations. In collaboration with psychologists, we conduct a detailed analysis of the reasons for LLMs' failures when interacting with clients of different types, which provides directions for developing more comprehensive, universal, and effective counseling models.[53] GSAP-ERE: Fine-Grained Scholarly Entity and Relation Extraction Focused on Machine Learning
Wolfgang Otto,Lu Gan,Sharmila Upadhyaya,Saurav Karmakar,Stefan Dietze
Main category: cs.CL
TL;DR: 本文介绍了GSAP-ERE,一个用于机器学习研究中信息抽取的细粒度标注数据集,包含10种实体类型和18种关系类型,源自100篇ML论文全文,支持知识图谱构建和AI研究可重复性监测,并用于评估大语言模型在信息抽取任务中的表现。
Details
Motivation: 为了提升机器学习研究的理解与可复现性,需要从大量科学出版物中提取细粒度信息(如方法训练、数据使用等),现有自动信息抽取方法缺乏高质量标注数据支持。 Method: 构建了一个手动标注的细粒度数据集GSAP-ERE,包含63K实体和35K关系,涵盖10类实体和18类语义关系;基于该数据集训练了细粒度信息抽取模型,并将其作为测试集评估大语言模型在命名实体识别(NER)和关系抽取(RE)任务上的表现。 Result: 微调模型在NER和RE任务上分别达到80.6%和54.0%的性能,显著优于当前最先进的大语言模型提示方法(NER: 44.4%,RE: 10.1%)。 Conclusion: 高质量的手动标注数据集GSAP-ERE对于推动学术信息抽取研究至关重要,当前大语言模型在少样本或零样本提示下的表现仍远不及监督微调模型,表明未来需加强领域特定数据建设以提升自动化信息提取能力。 Abstract: Research in Machine Learning (ML) and AI evolves rapidly. Information Extraction (IE) from scientific publications enables to identify information about research concepts and resources on a large scale and therefore is a pathway to improve understanding and reproducibility of ML-related research. To extract and connect fine-grained information in ML-related research, e.g. method training and data usage, we introduce GSAP-ERE. It is a manually curated fine-grained dataset with 10 entity types and 18 semantically categorized relation types, containing mentions of 63K entities and 35K relations from the full text of 100 ML publications. We show that our dataset enables fine-tuned models to automatically extract information relevant for downstream tasks ranging from knowledge graph (KG) construction, to monitoring the computational reproducibility of AI research at scale. Additionally, we use our dataset as a test suite to explore prompting strategies for IE using Large Language Models (LLM). We observe that the performance of state-of-the-art LLM prompting methods is largely outperformed by our best fine-tuned baseline model (NER: 80.6%, RE: 54.0% for the fine-tuned model vs. NER: 44.4%, RE: 10.1% for the LLM). This disparity of performance between supervised models and unsupervised usage of LLMs suggests datasets like GSAP-ERE are needed to advance research in the domain of scholarly information extraction.[54] BIG5-TPoT: Predicting BIG Five Personality Traits, Facets, and Items Through Targeted Preselection of Texts
Triet M. Le,Arjun Chandra,C. Anton Rytting,Valerie P. Karuzis,Vladimir Rife,William A. Simpson
Main category: cs.CL
TL;DR: 提出一种名为TPoT的新策略,通过语义筛选与人格特质相关的文本,提升大语言模型在预测五大人格特质时的准确性和效率。
Details
Motivation: 从大量生成文本中预测个体人格具有挑战性,尤其是当文本量过大时,现有方法难以有效处理。 Method: 采用语义过滤方法(TPoT),选择与特定人格特质、层面或项目语义相关的内容作为深度学习模型输入,构建BIG5-TPoT模型。 Result: 在Stream of Consciousness Essays数据集上,该方法不仅缓解了大语言模型的输入长度限制问题,还降低了预测的平均绝对误差并提高了准确性。 Conclusion: TPoT策略通过针对性地预选语义相关文本,显著提升了人格预测性能,是一种简单而有效的文本筛选方法。 Abstract: Predicting an individual's personalities from their generated texts is a challenging task, especially when the text volume is large. In this paper, we introduce a straightforward yet effective novel strategy called targeted preselection of texts (TPoT). This method semantically filters the texts as input to a deep learning model, specifically designed to predict a Big Five personality trait, facet, or item, referred to as the BIG5-TPoT model. By selecting texts that are semantically relevant to a particular trait, facet, or item, this strategy not only addresses the issue of input text limits in large language models but also improves the Mean Absolute Error and accuracy metrics in predictions for the Stream of Consciousness Essays dataset.[55] Readability Measures and Automatic Text Simplification: In the Search of a Construct
Rémi Cardon,A. Seza Doğruöz
Main category: cs.CL
TL;DR: 本文探讨了可读性度量在英文自动文本简化(ATS)中的作用,研究发现可读性度量与自动评估指标及人工判断之间的相关性普遍较低,表明需要对ATS中的评估构念进行明确定义。
Details
Motivation: 现有的研究多关注自动评估指标与人工判断的相关性,但较少探讨这些指标与常见可读性度量之间的关系,因此本文旨在填补这一空白。 Method: 通过分析可读性度量与人工判断、以及可读性度量与ATS自动评估指标之间的相关性,并结合已有研究进行综合讨论。 Result: 研究发现,总体上可读性度量与自动评估指标和人工判断的相关性较弱。 Conclusion: 由于文本简化的三种评估角度(可读性度量、自动指标、人工判断)之间相关性较低,因此有必要在ATS中明确评估构念的定义。 Abstract: Readability is a key concept in the current era of abundant written information. To help making texts more readable and make information more accessible to everyone, a line of researched aims at making texts accessible for their target audience: automatic text simplification (ATS). Lately, there have been studies on the correlations between automatic evaluation metrics in ATS and human judgment. However, the correlations between those two aspects and commonly available readability measures (such as readability formulas or linguistic features) have not been the focus of as much attention. In this work, we investigate the place of readability measures in ATS by complementing the existing studies on evaluation metrics and human judgment, on English. We first discuss the relationship between ATS and research in readability, then we report a study on correlations between readability measures and human judgment, and between readability measures and ATS evaluation metrics. We identify that in general, readability measures do not correlate well with automatic metrics and human judgment. We argue that as the three different angles from which simplification can be assessed tend to exhibit rather low correlations with one another, there is a need for a clear definition of the construct in ATS.[56] SynClaimEval: A Framework for Evaluating the Utility of Synthetic Data in Long-Context Claim Verification
Mohamed Elaraby,Jyoti Prakash Maheswari
Main category: cs.CL
TL;DR: 本文提出了SynClaimEval框架,用于评估合成数据在长上下文声明验证中的效用,实验表明合成数据能提升验证性能和模型解释质量。
Details
Motivation: 构建标注资源成本高昂,而大语言模型需要长上下文推理能力,因此需要一种可扩展的合成数据方法来有效训练和评估模型。 Method: 提出SynClaimEval框架,从输入特征、合成逻辑和解释质量三个维度评估合成数据在长上下文声明验证中的效果。 Result: 实验显示,合成数据能够提升基础指令微调模型的验证性能,尤其在增强人工标注数据集时效果显著;同时还能提升模型解释的一致性,即使验证准确率未提高。 Conclusion: 合成数据不仅有助于提升长上下文下的声明验证性能,还能增强模型的可解释性,具有在事实核查和幻觉检测中广泛应用的潜力。 Abstract: Large Language Models (LLMs) with extended context windows promise direct reasoning over long documents, reducing the need for chunking or retrieval. Constructing annotated resources for training and evaluation, however, remains costly. Synthetic data offers a scalable alternative, and we introduce SynClaimEval, a framework for evaluating synthetic data utility in long-context claim verification -- a task central to hallucination detection and fact-checking. Our framework examines three dimensions: (i) input characteristics, by varying context length and testing generalization to out-of-domain benchmarks; (ii) synthesis logic, by controlling claim complexity and error type variation; and (iii) explanation quality, measuring the degree to which model explanations provide evidence consistent with predictions. Experiments across benchmarks show that long-context synthesis can improve verification in base instruction-tuned models, particularly when augmenting existing human-written datasets. Moreover, synthesis enhances explanation quality, even when verification scores do not improve, underscoring its potential to strengthen both performance and explainability.cs.CV [Back]
[57] Case Study: Transformer-Based Solution for the Automatic Digitization of Gas Plants
I. Bailo,F. Buonora,G. Ciarfaglia,L. T. Consoli,A. Evangelista,M. Gabusi,M. Ghiani,C. Petracca Ciavarella,F. Picariello,F. Sarcina,F. Tuosto,V. Zullo,L. Airoldi,G. Bruno,D. D. Gobbo,S. Pezzenati,G. A. Tona
Main category: cs.CV
TL;DR: 本文提出了一种基于生成式人工智能的解决方案,用于自动化意大利能源公司SNAM的天然气厂结构数字化,结合OCR、视觉大模型、目标检测和关系推理等技术,实现了91%的文本信息提取准确率和93%的组件识别准确率。
Details
Motivation: 为了应对能源转型中基础设施数字化的挑战,解决非标准化文档带来的信息提取困难,提升天然气厂数字化效率。 Method: 采用OCR、视觉大语言模型(Vision LLM)、目标检测、关系推理和优化算法,输入为PDF格式的P&ID图纸,输出为设计数据的结构化概览和设备层级结构;提出一种改进的Scene Graph Generation Transformer架构以增强组件间复杂关系的分析。 Result: 在设计数据文本提取上达到91%的准确率,93%的设备组件被正确识别,层级结构提取准确率约为80%,有效克服了数据多样性带来的障碍。 Conclusion: 该AI协同方法能高效自动化天然气厂的数字化流程,显著减少人工工作量,具备在能源基础设施领域广泛推广的潜力。 Abstract: The energy transition is a key theme of the last decades to determine a future of eco-sustainability, and an area of such importance cannot disregard digitization, innovation and the new technological tools available. This is the context in which the Generative Artificial Intelligence models described in this paper are positioned, developed by Engineering Ingegneria Informatica SpA in order to automate the plant structures acquisition of SNAM energy infrastructure, a leading gas transportation company in Italy and Europe. The digitization of a gas plant consists in registering all its relevant information through the interpretation of the related documentation. The aim of this work is therefore to design an effective solution based on Artificial Intelligence techniques to automate the extraction of the information necessary for the digitization of a plant, in order to streamline the daily work of MGM users. The solution received the P&ID of the plant as input, each one in pdf format, and uses OCR, Vision LLM, Object Detection, Relational Reasoning and optimization algorithms to return an output consisting of two sets of information: a structured overview of the relevant design data and the hierarchical framework of the plant. To achieve convincing results, we extend a state-of-the-art model for Scene Graph Generation introducing a brand new Transformer architecture with the aim of deepening the analysis of the complex relations between the plant's components. The synergistic use of the listed AI-based technologies allowed to overcome many obstacles arising from the high variety of data, due to the lack of standardization. An accuracy of 91\% has been achieved in the extraction of textual information relating to design data. Regarding the plants topology, 93\% of components are correctly identified and the hierarchical structure is extracted with an accuracy around 80\%.[58] Assessing Identity Leakage in Talking Face Generation: Metrics and Evaluation Framework
Dogucan Yaman,Fevziye Irem Eyiokur,Hazım Kemal Ekenel,Alexander Waibel
Main category: cs.CV
TL;DR: 提出了一种无模型的系统性评估方法,用于分析和量化语音驱动人脸生成中的唇部泄露问题,包括三种测试设置和新的评估指标。
Details
Motivation: 现有方法在使用参考图像保持说话人一致性时容易出现唇部泄露问题,且难以通过传统方法检测。 Method: 设计了三种互补的测试设置:静音输入生成、音频-视频错配和音频-视频匹配合成,并引入了唇同步差异和基于静音音频的唇同步分数等新指标。 Result: 该框架能够有效识别和量化不同模型中的唇部泄露现象,并揭示了身份参考选择对泄露程度的影响。 Conclusion: 所提方法为语音驱动人脸生成任务提供了更可靠的基准,有助于未来研究中对模型鲁棒性和真实性的评估。 Abstract: Inpainting-based talking face generation aims to preserve video details such as pose, lighting, and gestures while modifying only lip motion, often using an identity reference image to maintain speaker consistency. However, this mechanism can introduce lip leaking, where generated lips are influenced by the reference image rather than solely by the driving audio. Such leakage is difficult to detect with standard metrics and conventional test setup. To address this, we propose a systematic evaluation methodology to analyze and quantify lip leakage. Our framework employs three complementary test setups: silent-input generation, mismatched audio-video pairing, and matched audio-video synthesis. We also introduce derived metrics including lip-sync discrepancy and silent-audio-based lip-sync scores. In addition, we study how different identity reference selections affect leakage, providing insights into reference design. The proposed methodology is model-agnostic and establishes a more reliable benchmark for future research in talking face generation.[59] A Multi-Drone Multi-View Dataset and Deep Learning Framework for Pedestrian Detection and Tracking
Kosta Dakic,Kanchana Thilakarathna,Rodrigo N. Calheiros,Teng Joon Lim
Main category: cs.CV
TL;DR: 本文提出了MATRIX,一个包含八架无人机同步拍摄的复杂城市环境下的多视角行人跟踪数据集,以及一种基于深度学习的动态多视角检测与跟踪框架。该方法通过实时相机校准、特征匹配和鸟瞰图特征融合,在存在遮挡和动态摄像机位置的情况下实现了约90%的检测与跟踪精度,并展现出良好的鲁棒性和泛化能力。
Details
Motivation: 现有行人跟踪系统在静态摄像头或有限无人机覆盖下表现良好,但在动态摄像机位置和复杂遮挡场景中性能下降。因此需要更贴近真实复杂环境的数据集和相应算法来推动多无人机监控系统的发展。 Method: 提出了一种结合实时相机校准、基于特征的图像配准和鸟瞰图(BEV)多视角特征融合的深度学习框架,并利用八架无人机采集的同步视频构建MATRIX数据集进行验证。 Result: 在简化环境中传统方法表现良好(>90%精度),但在复杂环境下性能显著下降;本文方法在复杂环境下仍保持约90%的检测与跟踪准确率,成功追踪约80%的轨迹,且具有良好的迁移学习能力和容错性(相机失效实验显示性能缓降)。 Conclusion: MATRIX数据集和所提框架为动态多视角监控系统提供了重要基准,验证了其在复杂现实场景中的有效性与鲁棒性,有助于推动多无人机协同监控技术的发展。 Abstract: Multi-drone surveillance systems offer enhanced coverage and robustness for pedestrian tracking, yet existing approaches struggle with dynamic camera positions and complex occlusions. This paper introduces MATRIX (Multi-Aerial TRacking In compleX environments), a comprehensive dataset featuring synchronized footage from eight drones with continuously changing positions, and a novel deep learning framework for multi-view detection and tracking. Unlike existing datasets that rely on static cameras or limited drone coverage, MATRIX provides a challenging scenario with 40 pedestrians and a significant architectural obstruction in an urban environment. Our framework addresses the unique challenges of dynamic drone-based surveillance through real-time camera calibration, feature-based image registration, and multi-view feature fusion in bird's-eye-view (BEV) representation. Experimental results demonstrate that while static camera methods maintain over 90\% detection and tracking precision and accuracy metrics in a simplified MATRIX environment without an obstruction, 10 pedestrians and a much smaller observational area, their performance significantly degrades in the complex environment. Our proposed approach maintains robust performance with $\sim$90\% detection and tracking accuracy, as well as successfully tracks $\sim$80\% of trajectories under challenging conditions. Transfer learning experiments reveal strong generalization capabilities, with the pretrained model achieving much higher detection and tracking accuracy performance compared to training the model from scratch. Additionally, systematic camera dropout experiments reveal graceful performance degradation, demonstrating practical robustness for real-world deployments where camera failures may occur. The MATRIX dataset and framework provide essential benchmarks for advancing dynamic multi-view surveillance systems.[60] Learning Topology-Driven Multi-Subspace Fusion for Grassmannian Deep Network
Xuan Yu,Tianyang Xu
Main category: cs.CV
TL;DR: 提出了一种拓扑驱动的多子空间融合框架,通过自适应选择和加权任务相关子空间,并利用流形上的Fréchet均值优化实现多子空间融合,在3D动作识别、EEG分类和图任务中取得SOTA性能。
Details
Motivation: 现有方法多依赖静态单子空间表示,难以捕捉复杂几何结构中多个子空间之间的动态交互关系。 Method: 基于Kolmogorov-Arnold表示定理,提出自适应多子空间建模机制,结合拓扑收敛分析动态选择与加权子空间;设计多子空间交互模块,通过Grassmann流形上的Fréchet均值优化融合异构几何表示,并引入黎曼批归一化和互信息正则化。 Result: 在HDM05、FPHA等数据集上验证了方法的有效性,在3D动作识别、EEG分类和图任务中性能优于现有方法,且具有良好的收敛性与优化稳定性。 Conclusion: 该工作将欧氏空间中成熟的多通道交互思想成功扩展到非欧域,推动了几何深度学习的发展,提升了模型的判别能力与可解释性。 Abstract: Grassmannian manifold offers a powerful carrier for geometric representation learning by modelling high-dimensional data as low-dimensional subspaces. However, existing approaches predominantly rely on static single-subspace representations, neglecting the dynamic interplay between multiple subspaces critical for capturing complex geometric structures. To address this limitation, we propose a topology-driven multi-subspace fusion framework that enables adaptive subspace collaboration on the Grassmannian. Our solution introduces two key innovations: (1) Inspired by the Kolmogorov-Arnold representation theorem, an adaptive multi-subspace modelling mechanism is proposed that dynamically selects and weights task-relevant subspaces via topological convergence analysis, and (2) a multi-subspace interaction block that fuses heterogeneous geometric representations through Fréchet mean optimisation on the manifold. Theoretically, we establish the convergence guarantees of adaptive subspaces under a projection metric topology, ensuring stable gradient-based optimisation. Practically, we integrate Riemannian batch normalisation and mutual information regularisation to enhance discriminability and robustness. Extensive experiments on 3D action recognition (HDM05, FPHA), EEG classification (MAMEM-SSVEPII), and graph tasks demonstrate state-of-the-art performance. Our work not only advances geometric deep learning but also successfully adapts the proven multi-channel interaction philosophy of Euclidean networks to non-Euclidean domains, achieving superior discriminability and interpretability.[61] Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising
Assaf Singer,Noam Rotstein,Amir Mann,Ron Kimmel,Or Litany
Main category: cs.CV
TL;DR: 本文提出了Time-to-Move(TTM),一种无需训练、即插即用的框架,用于在图像到视频扩散模型中实现基于粗略参考动画的运动和外观控制生成。
Details
Motivation: 现有基于扩散模型的视频生成方法在运动控制上不够精确,且依赖特定模型的微调,计算成本高且受限。因此需要一种通用、高效的方法来实现精细的运动与外观控制。 Method: 利用用户友好的操作(如拖拽或深度重投影)生成粗糙参考动画作为运动线索,借鉴SDEdit的思想,并提出双时钟去噪策略,在运动指定区域保持强对齐,在其他区域保留生成自由度,从而平衡用户意图与自然动态。 Result: 在物体和相机运动基准上实验表明,TTM在真实感和运动控制方面达到或超过需训练的基线方法,同时支持像素级外观控制,超越纯文本提示的限制。 Conclusion: TTM是一种轻量、无需训练的视频生成框架,兼容任意主干模型,实现了高精度的运动与外观控制,拓展了扩散模型在可控视频生成中的能力。 Abstract: Diffusion-based video generation can create realistic videos, yet existing image- and text-based conditioning fails to offer precise motion control. Prior methods for motion-conditioned synthesis typically require model-specific fine-tuning, which is computationally expensive and restrictive. We introduce Time-to-Move (TTM), a training-free, plug-and-play framework for motion- and appearance-controlled video generation with image-to-video (I2V) diffusion models. Our key insight is to use crude reference animations obtained through user-friendly manipulations such as cut-and-drag or depth-based reprojection. Motivated by SDEdit's use of coarse layout cues for image editing, we treat the crude animations as coarse motion cues and adapt the mechanism to the video domain. We preserve appearance with image conditioning and introduce dual-clock denoising, a region-dependent strategy that enforces strong alignment in motion-specified regions while allowing flexibility elsewhere, balancing fidelity to user intent with natural dynamics. This lightweight modification of the sampling process incurs no additional training or runtime cost and is compatible with any backbone. Extensive experiments on object and camera motion benchmarks show that TTM matches or exceeds existing training-based baselines in realism and motion control. Beyond this, TTM introduces a unique capability: precise appearance control through pixel-level conditioning, exceeding the limits of text-only prompting. Visit our project page for video examples and code: https://time-to-move.github.io/.[62] CADIC: Continual Anomaly Detection Based on Incremental Coreset
Gen Yang,Zhipeng Deng,Junfeng Man
Main category: cs.CV
TL;DR: 提出了一种新的持续异常检测(CAD)框架,通过共享统一内存库和固定大小的coreset增量更新嵌入,解决了现有方法在灵活性和可扩展性上的限制,实现了最先进的检测精度。
Details
Motivation: 现有的基于嵌入的CAD方法需要为每个任务构建特定类别的子内存库,导致灵活性和可扩展性受限,且容易发生灾难性遗忘。 Method: 所有任务共享一个统一的内存库,在训练过程中通过固定大小的coreset增量更新嵌入;在推理阶段使用最近邻匹配机制计算异常分数。 Result: 在MVTec AD和Visa数据集上进行了实验验证,平均图像级AUROC分别为0.972和0.891;在真实电子纸数据集上达到100%异常检测准确率。 Conclusion: 所提方法在持续异常检测中表现出优越的性能和实际应用鲁棒性,具有良好的可扩展性和灵活性,代码将开源。 Abstract: The primary objective of Continual Anomaly Detection (CAD) is to learn the normal patterns of new tasks under dynamic data distribution assumptions while mitigating catastrophic forgetting. Existing embedding-based CAD approaches continuously update a memory bank with new embeddings to adapt to sequential tasks. However, these methods require constructing class-specific sub-memory banks for each task, which restricts their flexibility and scalability. To address this limitation, we propose a novel CAD framework where all tasks share a unified memory bank. During training, the method incrementally updates embeddings within a fixed-size coreset, enabling continuous knowledge acquisition from sequential tasks without task-specific memory fragmentation. In the inference phase, anomaly scores are computed via a nearest-neighbor matching mechanism, achieving state-of-the-art detection accuracy. We validate the method through comprehensive experiments on MVTec AD and Visa datasets. Results show that our approach outperforms existing baselines, achieving average image-level AUROC scores of 0.972 (MVTec AD) and 0.891 (Visa). Notably, on a real-world electronic paper dataset, it demonstrates 100% accuracy in anomaly sample detection, confirming its robustness in practical scenarios. The implementation will be open-sourced on GitHub.[63] Predict and Resist: Long-Term Accident Anticipation under Sensor Noise
Xingcheng Liu,Bin Rao,Yanchen Guan,Chengyue Wang,Haicheng Liao,Jiaxun Zhang,Chengyu Lin,Meixin Zhu,Zhenning Li
Main category: cs.CV
TL;DR: 提出一种结合扩散去噪与时间感知演员-评论家模型的统一框架,用于提升自动驾驶中事故预见的鲁棒性与及时性。
Details
Motivation: 自动驾驶中事故预见面临传感器输入噪声和预测时效性与可靠性平衡的挑战。 Method: 采用扩散模型进行图像和对象特征的去噪重建,结合具有长时序推理和时间加权奖励机制的演员-评论家模型来优化预警时机。 Result: 在DAD、CCD和A3D三个数据集上实现了最先进的准确率,显著提升了平均事故前时间,且在高斯和脉冲噪声下保持鲁棒性能。 Conclusion: 该框架能实现更早、更稳定且符合人类直觉的事故预警,具备实际安全关键应用的潜力。 Abstract: Accident anticipation is essential for proactive and safe autonomous driving, where even a brief advance warning can enable critical evasive actions. However, two key challenges hinder real-world deployment: (1) noisy or degraded sensory inputs from weather, motion blur, or hardware limitations, and (2) the need to issue timely yet reliable predictions that balance early alerts with false-alarm suppression. We propose a unified framework that integrates diffusion-based denoising with a time-aware actor-critic model to address these challenges. The diffusion module reconstructs noise-resilient image and object features through iterative refinement, preserving critical motion and interaction cues under sensor degradation. In parallel, the actor-critic architecture leverages long-horizon temporal reasoning and time-weighted rewards to determine the optimal moment to raise an alert, aligning early detection with reliability. Experiments on three benchmark datasets (DAD, CCD, A3D) demonstrate state-of-the-art accuracy and significant gains in mean time-to-accident, while maintaining robust performance under Gaussian and impulse noise. Qualitative analyses further show that our model produces earlier, more stable, and human-aligned predictions in both routine and highly complex traffic scenarios, highlighting its potential for real-world, safety-critical deployment.[64] RS-Net: Context-Aware Relation Scoring for Dynamic Scene Graph Generation
Hae-Won Jo,Yeong-Jun Cho
Main category: cs.CV
TL;DR: 提出了一种名为Relation Scoring Network (RS-Net)的模块化框架,用于提升视频中动态场景图生成(DSGG)的关系预测性能,通过空间和时间上下文对对象对的重要性进行打分,并在多个基准上显著提升召回率和精度。
Details
Motivation: 现有DSGG方法仅在标注的相关对象对上训练,缺乏对无关对象对的指导,导致推理时难以准确识别有意义的关系,尤其在关系长尾分布情况下表现不佳。 Method: 设计了一个包含空间上下文编码器(带可学习上下文标记)和时间编码器(聚合视频级信息)的RS-Net,生成关系得分,并将其融入统一的三元组打分机制中,以增强关系预测,且可无缝集成到现有DSGG模型中。 Result: 在Action Genome数据集上的实验表明,RS-Net在多种基线上均提升了Recall和Precision,尤其在平均Recall上有显著增益,有效缓解了关系长尾问题,同时保持较高的计算效率。 Conclusion: RS-Net通过引入上下文重要性评分机制,显著提升了DSGG中关系预测的准确性和鲁棒性,是一种高效、即插即用的改进方案。 Abstract: Dynamic Scene Graph Generation (DSGG) models how object relations evolve over time in videos. However, existing methods are trained only on annotated object pairs and lack guidance for non-related pairs, making it difficult to identify meaningful relations during inference. In this paper, we propose Relation Scoring Network (RS-Net), a modular framework that scores the contextual importance of object pairs using both spatial interactions and long-range temporal context. RS-Net consists of a spatial context encoder with learnable context tokens and a temporal encoder that aggregates video-level information. The resulting relation scores are integrated into a unified triplet scoring mechanism to enhance relation prediction. RS-Net can be easily integrated into existing DSGG models without architectural changes. Experiments on the Action Genome dataset show that RS-Net consistently improves both Recall and Precision across diverse baselines, with notable gains in mean Recall, highlighting its ability to address the long-tailed distribution of relations. Despite the increased number of parameters, RS-Net maintains competitive efficiency, achieving superior performance over state-of-the-art methods.[65] Privacy Beyond Pixels: Latent Anonymization for Privacy-Preserving Video Understanding
Joseph Fioresi,Ishan Rajendrakumar Dave,Mubarak Shah
Main category: cs.CV
TL;DR: 提出一种在潜在空间中操作的视频基础模型视觉隐私保护新方法,通过轻量级匿名适配器模块(AAM)去除视频特征中的私密信息,同时保持任务效用,显著减少35%的隐私泄露且不影响下游任务性能。
Details
Motivation: 现有隐私保护方法多集中在像素级匿名化,需重新训练整个模型且仅适用于特定任务,难以适应现代视频基础模型;此外,提取的时空特征可能泄露敏感个人信息,因此需要一种无需重训练、可插拔且兼顾隐私与效用的解决方案。 Method: 提出轻量级匿名适配器模块(AAM),以即插即用方式应用于冻结的视频编码器,在潜在空间中去除私密信息;设计三种新的训练目标:基于静态片段的自监督隐私目标、协同训练以保留已见任务效用、潜在一致性损失以提升未见任务泛化能力。 Result: 实验显示该方法在多个下游任务(动作识别、时序动作检测、异常检测)上实现35%的隐私泄露降低,同时保持接近基线的性能;并能有效缓解动作识别中的性别偏见,提出新的性别偏见评估协议。 Conclusion: 所提AAM框架为视频基础模型提供了一种高效、灵活且通用的隐私保护方案,能够在不重训练的情况下平衡隐私保护与模型效用,推动更公平和安全的视频理解技术发展。 Abstract: We introduce a novel formulation of visual privacy preservation for video foundation models that operates entirely in the latent space. While spatio-temporal features learned by foundation models have deepened general understanding of video content, sharing or storing these extracted visual features for downstream tasks inadvertently reveals sensitive personal information like skin color, gender, or clothing. Current privacy preservation methods focus on input-pixel-level anonymization, which requires retraining the entire utility video model and results in task-specific anonymization, making them unsuitable for recent video foundational models. To address these challenges, we introduce a lightweight Anonymizing Adapter Module (AAM) that removes private information from video features while retaining general task utility. AAM can be applied in a plug-and-play fashion to frozen video encoders, minimizing the computational burden of finetuning and re-extracting features. Our framework employs three newly designed training objectives: (1) a clip-level self-supervised privacy objective to reduce mutual information between static clips, (2) a co-training objective to retain utility across seen tasks, and (3) a latent consistency loss for generalization on unseen tasks. Our extensive evaluations demonstrate a significant 35% reduction in privacy leakage while maintaining near-baseline utility performance across various downstream tasks: Action Recognition (Kinetics400, UCF101, HMDB51), Temporal Action Detection (THUMOS14), and Anomaly Detection (UCF-Crime). We also provide an analysis on anonymization for sensitive temporal attribute recognition. Additionally, we propose new protocols for assessing gender bias in action recognition models, showing that our method effectively mitigates such biases and promotes more equitable video understanding.[66] Rethinking generative image pretraining: How far are we from scaling up next-pixel prediction?
Xinchen Yan,Chen Liang,Lijun Yu,Adams Wei Yu,Yifeng Lu,Quoc V. Le
Main category: cs.CV
TL;DR: 该论文研究了自回归逐像素预测在统一视觉模型中的扩展特性,发现不同任务(如图像分类与生成)的最优扩展策略存在差异,且随着分辨率提升,模型规模需比数据量增长更快;计算资源是主要瓶颈,预计未来五年内逐像素图像建模将可行。
Details
Motivation: 探索简单且端到端的自回归逐像素预测框架在统一视觉模型中的可扩展性,填补该方向研究空白。 Method: 使用IsoFlops配置,在最多7e19 FLOPs的计算预算下训练一系列Transformer模型,输入为32x32分辨率图像,评估其在下一像素预测、ImageNet分类和生成质量(Frechet Distance)三个指标上的表现,并分析模型、数据与计算的扩展规律。 Result: 1)图像分类与生成的最优扩展策略不同,生成任务要求数据规模增长速度是分类任务的3-5倍;2)随着分辨率提高,模型规模需远快于数据规模扩展;3)计算资源是当前主要瓶颈,而非训练数据量。 Conclusion: 自回归逐像素预测具有良好的扩展潜力,未来计算能力的持续增长将使其在高分辨率图像建模中变得可行,有望实现端到端的统一视觉模型。 Abstract: This paper investigates the scaling properties of autoregressive next-pixel prediction, a simple, end-to-end yet under-explored framework for unified vision models. Starting with images at resolutions of 32x32, we train a family of Transformers using IsoFlops profiles across compute budgets up to 7e19 FLOPs and evaluate three distinct target metrics: next-pixel prediction objective, ImageNet classification accuracy, and generation quality measured by Fr'echet Distance. First, optimal scaling strategy is critically task-dependent. At a fixed 32x32 resolution alone, the optimal scaling properties for image classification and image generation diverge, where generation optimal setup requires the data size grow three to five times faster than for the classification optimal setup. Second, as image resolution increases, the optimal scaling strategy indicates that the model size must grow much faster than data size. Surprisingly, by projecting our findings, we discover that the primary bottleneck is compute rather than the amount of training data. As compute continues to grow four to five times annually, we forecast the feasibility of pixel-by-pixel modeling of images within the next five years.[67] Harnessing Diffusion-Generated Synthetic Images for Fair Image Classification
Abhipsa Basu,Aviral Gupta,Abhijnya Bhat,R. Venkatesh Babu
Main category: cs.CV
TL;DR: 本研究探索了使用LoRA和DreamBooth等扩散模型微调技术生成更准确的群体代表性图像,以缓解图像分类中的数据偏见问题。通过在每组内聚类并训练多个DreamBooth模型,生成平衡数据用于预训练,再在真实数据上微调,实验表明该方法在高偏见场景下优于现有去偏技术。
Details
Motivation: 图像分类数据集中群体表示不均导致模型继承偏见,例如发色分类中金发与性别的刻板关联。现有基于Stable Diffusion的数据平衡方法难以保持原始数据分布,需更精确的生成策略。 Method: 采用LoRA和DreamBooth对扩散模型进行微调,直接从各组样本学习生成图像;为避免单模型过载,提出按组内聚类并每个聚类训练独立DreamBooth模型;生成平衡数据用于预训练,随后在真实数据上微调。 Result: 在多个基准实验中,所研究的微调方法平均优于原始Stable Diffusion,并在性能上达到与Group-DRO等SOTA去偏方法相当的水平,且在数据偏见越严重时表现更优。 Conclusion: 基于聚类的DreamBooth微调结合LoRA的方法能更准确建模群体特征,生成高质量平衡数据,在缓解图像分类偏见方面具有显著优势,尤其适用于高度偏斜的数据集。 Abstract: Image classification systems often inherit biases from uneven group representation in training data. For example, in face datasets for hair color classification, blond hair may be disproportionately associated with females, reinforcing stereotypes. A recent approach leverages the Stable Diffusion model to generate balanced training data, but these models often struggle to preserve the original data distribution. In this work, we explore multiple diffusion-finetuning techniques, e.g., LoRA and DreamBooth, to generate images that more accurately represent each training group by learning directly from their samples. Additionally, in order to prevent a single DreamBooth model from being overwhelmed by excessive intra-group variations, we explore a technique of clustering images within each group and train a DreamBooth model per cluster. These models are then used to generate group-balanced data for pretraining, followed by fine-tuning on real data. Experiments on multiple benchmarks demonstrate that the studied finetuning approaches outperform vanilla Stable Diffusion on average and achieve results comparable to SOTA debiasing techniques like Group-DRO, while surpassing them as the dataset bias severity increases.[68] WiCV at CVPR 2025: The Women in Computer Vision Workshop
Estefania Talavera,Deblina Bhattacharjee,Himangi Mittal,Mengwei Ren,Karen Sanchez,Carla Muntean,JungEun Kim,Mona Jalal
Main category: cs.CV
TL;DR: WiCV@CVPR 2025是第16届女性在计算机视觉领域的研讨会,旨在提升女性和少数群体在计算机视觉社区的可见性、包容性和职业发展。会议收录14篇全文和36份扩展摘要,举办导师计划并吸引超过100名现场参与者,在赞助支持下持续推进多样性与包容性目标。
Details
Motivation: 提升女性和少数群体在计算机视觉领域中的参与度、可见性及职业成长,推动该领域内的多样性、公平性和包容性。 Method: 通过组织年度研讨会,接受并展示研究论文和扩展摘要,开展跨学术界与工业界的导师指导计划,并提供旅行资助与多样性奖项以支持新兴研究人员。 Result: 2025年共收到32篇全文投稿,接收14篇(其中5篇为口头报告),36份扩展摘要被接收为海报展示;80名学员与37名导师成功匹配;现场参与人数超100人,并获得约44,000美元资助与10个赞助商支持。 Conclusion: WiCV持续有效地促进了计算机视觉领域的多样性和包容性,为未来活动及其他类似倡议提供了可借鉴的参考模式。 Abstract: The Women in Computer Vision Workshop (WiCV@CVPR 2025) was held in conjunction with the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025) in Nashville, Tennessee, United States. This report presents an overview of the workshop program, participation statistics, mentorship outcomes, and historical trends from previous WiCV editions. The goal is to document the impact and evolution of WiCV as a reference for future editions and for other initiatives aimed at advancing diversity, equity, and inclusion within the AI and computer vision communities. WiCV@CVPR 2025 marked the 16th edition of this long-standing event dedicated to increasing the visibility, inclusion, and professional growth of women and underrepresented minorities in the computer vision community. This year's workshop featured 14 accepted papers in the CVPR Workshop Proceedings out of 32 full-paper submissions. Five of these were selected for oral presentations, while all 14 were also presented as posters, along with 36 extended abstract posters accepted from 62 short-paper submissions, which are not included in the proceedings. The mentoring program matched 80 mentees with 37 mentors from both academia and industry. The 2025 edition attracted over 100 onsite participants, fostering rich technical and networking interactions across all career stages. Supported by 10 sponsors and approximately $44,000 USD in travel grants and diversity awards, WiCV continued its mission to empower emerging researchers and amplify diverse voices in computer vision.[69] Adaptive graph Kolmogorov-Arnold network for 3D human pose estimation
Abu Taib Mohammed Shahjahan,A. Ben Hamza
Main category: cs.CV
TL;DR: 本文提出了一种基于自适应图Kolmogorov-Arnold网络(PoseKAN)的2D到3D人体姿态估计方法,通过可学习的边函数和多跳特征聚合克服了传统GCN在长距离依赖和高频细节建模上的局限性。
Details
Motivation: 现有GCN方法因局部感受野和频谱偏差限制了对遮挡和深度模糊的处理能力,难以捕捉长程依赖和高频率细节。 Method: 引入PoseKAN框架,采用可学习的边激活函数替代固定激活函数,结合多跳特征聚合、残差PoseKAN模块和全局响应归一化,增强模型表达力与空间感知能力。 Result: 在多个基准数据集上实验表明,该方法在2D-to-3D姿态提升任务中表现出与当前最先进方法相媲美的性能。 Conclusion: PoseKAN通过可学习的图结构函数和多跳信息聚合,有效提升了3D人体姿态估计的准确性与鲁棒性,尤其在处理复杂姿态变化方面展现出更强的建模能力。 Abstract: Graph convolutional network (GCN)-based methods have shown strong performance in 3D human pose estimation by leveraging the natural graph structure of the human skeleton. However, their local receptive field limits their ability to capture long-range dependencies essential for handling occlusions and depth ambiguities. They also exhibit spectral bias, which prioritizes low-frequency components while struggling to model high-frequency details. In this paper, we introduce PoseKAN, an adaptive graph Kolmogorov-Arnold Network (KAN), framework that extends KANs to graph-based learning for 2D-to-3D pose lifting from a single image. Unlike GCNs that use fixed activation functions, KANs employ learnable functions on graph edges, allowing data-driven, adaptive feature transformations. This enhances the model's adaptability and expressiveness, making it more expressive in learning complex pose variations. Our model employs multi-hop feature aggregation, ensuring the body joints can leverage information from both local and distant neighbors, leading to improved spatial awareness. It also incorporates residual PoseKAN blocks for deeper feature refinement, and a global response normalization for improved feature selectivity and contrast. Extensive experiments on benchmark datasets demonstrate the competitive performance of our model against state-of-the-art methods.[70] SIFT-Graph: Benchmarking Multimodal Defense Against Image Adversarial Attacks With Robust Feature Graph
Jingjie He,Weijie Liang,Zihan Shan,Matthew Caesar
Main category: cs.CV
TL;DR: 提出SIFT-Graph,一种结合手工设计与学习特征的多模态防御框架,通过整合SIFT关键点与图注意力网络提升视觉模型对对抗攻击的鲁棒性。
Details
Motivation: 现有防御方法多在脆弱的像素空间操作,缺乏利用本质鲁棒视觉特征的机制,导致模型易受对抗扰动影响。 Method: 将SIFT关键点提取的尺度和旋转不变特征与图注意力网络结合,构建结构化特征表示,并与Vision Transformer、CNN等传统模型融合,形成统一的结构感知防御模型。 Result: 在梯度-based白盒攻击下显著提升模型鲁棒性,同时保持较高的干净样本准确率,仅引入轻微性能下降。 Conclusion: SIFT-Graph通过引入结构化的多模态特征融合,有效增强了深度视觉模型对抗对抗攻击的能力,验证了结合经典特征与深度学习的防御潜力。 Abstract: Adversarial attacks expose a fundamental vulnerability in modern deep vision models by exploiting their dependence on dense, pixel-level representations that are highly sensitive to imperceptible perturbations. Traditional defense strategies typically operate within this fragile pixel domain, lacking mechanisms to incorporate inherently robust visual features. In this work, we introduce SIFT-Graph, a multimodal defense framework that enhances the robustness of traditional vision models by aggregating structurally meaningful features extracted from raw images using both handcrafted and learned modalities. Specifically, we integrate Scale-Invariant Feature Transform keypoints with a Graph Attention Network to capture scale and rotation invariant local structures that are resilient to perturbations. These robust feature embeddings are then fused with traditional vision model, such as Vision Transformer and Convolutional Neural Network, to form a unified, structure-aware and perturbation defensive model. Preliminary results demonstrate that our method effectively improves the visual model robustness against gradient-based white box adversarial attacks, while incurring only a marginal drop in clean accuracy.[71] DT-NVS: Diffusion Transformers for Novel View Synthesis
Wonbong Jang,Jonathan Tremblay,Lourdes Agapito
Main category: cs.CV
TL;DR: 本文提出了一种用于广义新视角合成的3D感知扩散模型DT-NVS,能够在真实世界、多类别、未对齐的日常场景视频数据集上仅使用图像级损失进行训练,实现了从单张输入图像生成新颖视角,并在多样性和性能上优于现有方法。
Details
Motivation: 现有的基于扩散模型的方法主要关注小范围相机运动或局限于非自然的以物体为中心的场景,难以应用于真实世界的复杂场景。因此,需要一种能处理真实环境中大范围视角变化且适用于多样化场景的新视角合成方法。 Method: 提出DT-NVS,一种基于Transformer架构的3D感知扩散模型,通过改进Transformer和自注意力机制实现图像到3D表示的转换,设计新的相机条件策略以适应未对齐的真实世界数据,并引入一种新的训练范式,交换条件图像与噪声输入之间的参考帧角色,仅使用图像级损失进行训练。 Result: 在广义新视角合成任务上,DT-NVS在多个真实场景数据集上优于当前最先进的3D感知扩散模型和确定性方法,能够生成更多样化且高质量的新视角图像。 Conclusion: DT-NVS通过创新的网络结构、相机条件策略和训练范式,成功实现了在大规模真实世界未对齐数据上的单图像新视角合成,推动了扩散模型在复杂自然场景中的应用。 Abstract: Generating novel views of a natural scene, e.g., every-day scenes both indoors and outdoors, from a single view is an under-explored problem, even though it is an organic extension to the object-centric novel view synthesis. Existing diffusion-based approaches focus rather on small camera movements in real scenes or only consider unnatural object-centric scenes, limiting their potential applications in real-world settings. In this paper we move away from these constrained regimes and propose a 3D diffusion model trained with image-only losses on a large-scale dataset of real-world, multi-category, unaligned, and casually acquired videos of everyday scenes. We propose DT-NVS, a 3D-aware diffusion model for generalized novel view synthesis that exploits a transformer-based architecture backbone. We make significant contributions to transformer and self-attention architectures to translate images to 3d representations, and novel camera conditioning strategies to allow training on real-world unaligned datasets. In addition, we introduce a novel training paradigm swapping the role of reference frame between the conditioning image and the sampled noisy input. We evaluate our approach on the 3D task of generalized novel view synthesis from a single input image and show improvements over state-of-the-art 3D aware diffusion models and deterministic approaches, while generating diverse outputs.[72] Enhancing Rotation-Invariant 3D Learning with Global Pose Awareness and Attention Mechanisms
Jiaxun Guo,Manar Amayri,Nizar Bouguila,Xin Liu,Wentao Fan
Main category: cs.CV
TL;DR: 提出了一种基于Shadow-informed Pose Feature (SiPF) 和 Rotation-invariant Attention Convolution (RIAttnConv) 的3D点云旋转不变学习方法,通过引入全局一致的参考点(shadow)解决现有方法中因感受野受限导致的对称部件难以区分的问题,在保持旋转不变性的同时增强了全局姿态感知能力。
Details
Motivation: 现有旋转不变学习方法因使用局部不变特征而丢失全局姿态信息,导致无法区分几何相似但空间位置不同的结构,尤其是对称部件(如飞机左右翼),本文旨在解决这一特征坍塌问题。 Method: 提出Shadow-informed Pose Feature (SiPF),利用学习到的共享旋转生成全局参考点(shadow),增强局部不变描述符的全局姿态感知;设计Rotation-invariant Attention Convolution (RIAttnConv) 操作符融合SiPF进行特征聚合;并基于Bingham分布构建任务自适应的shadow定位模块,动态学习最优全局旋转。 Result: 在3D分类与部件分割任务上显著优于现有旋转不变方法,尤其在需要细粒度空间分辨能力的任务中表现突出,验证了方法在任意旋转下的鲁棒性和判别力。 Conclusion: 所提方法有效解决了旋转不变学习中的Wing-tip特征坍塌问题,在保持旋转不变性的同时恢复了全局姿态感知能力,为3D点云分析提供了一种兼具鲁棒性与判别性的新框架。 Abstract: Recent advances in rotation-invariant (RI) learning for 3D point clouds typically replace raw coordinates with handcrafted RI features to ensure robustness under arbitrary rotations. However, these approaches often suffer from the loss of global pose information, making them incapable of distinguishing geometrically similar but spatially distinct structures. We identify that this limitation stems from the restricted receptive field in existing RI methods, leading to Wing-tip feature collapse, a failure to differentiate symmetric components (e.g., left and right airplane wings) due to indistinguishable local geometries. To overcome this challenge, we introduce the Shadow-informed Pose Feature (SiPF), which augments local RI descriptors with a globally consistent reference point (referred to as the 'shadow') derived from a learned shared rotation. This mechanism enables the model to preserve global pose awareness while maintaining rotation invariance. We further propose Rotation-invariant Attention Convolution (RIAttnConv), an attention-based operator that integrates SiPFs into the feature aggregation process, thereby enhancing the model's capacity to distinguish structurally similar components. Additionally, we design a task-adaptive shadow locating module based on the Bingham distribution over unit quaternions, which dynamically learns the optimal global rotation for constructing consistent shadows. Extensive experiments on 3D classification and part segmentation benchmarks demonstrate that our approach substantially outperforms existing RI methods, particularly in tasks requiring fine-grained spatial discrimination under arbitrary rotations.[73] SasMamba: A Lightweight Structure-Aware Stride State Space Model for 3D Human Pose Estimation
Hu Cui,Wenqiang Hua,Renjing Huang,Shurui Jia,Tessai Hayama
Main category: cs.CV
TL;DR: 提出了一种基于骨架结构感知的步幅状态空间模型(SAS-SSM),用于3D人体姿态估计,有效保留空间结构并建模时空依赖。
Details
Motivation: 现有基于SSM的方法通过手动设计的扫描操作破坏了人体姿态的固有空间结构,导致难以捕捉复杂的姿态依赖关系。 Method: 提出SAS-SSM,结合结构感知的时空卷积和基于步幅的扫描策略,动态捕获关节间局部交互并构建多尺度全局结构表示。 Result: 在保持线性计算复杂度的同时,实现了对局部和全局姿态信息的灵活建模,模型SasMamba在参数量更少的情况下达到具有竞争力的性能。 Conclusion: SAS-SSM有效解决了空间结构破坏和时空特征纠缠问题,为3D人体姿态估计提供了高效且轻量的解决方案。 Abstract: Recently, the Mamba architecture based on State Space Models (SSMs) has gained attention in 3D human pose estimation due to its linear complexity and strong global modeling capability. However, existing SSM-based methods typically apply manually designed scan operations to flatten detected 2D pose sequences into purely temporal sequences, either locally or globally. This approach disrupts the inherent spatial structure of human poses and entangles spatial and temporal features, making it difficult to capture complex pose dependencies. To address these limitations, we propose the Skeleton Structure-Aware Stride SSM (SAS-SSM), which first employs a structure-aware spatiotemporal convolution to dynamically capture essential local interactions between joints, and then applies a stride-based scan strategy to construct multi-scale global structural representations. This enables flexible modeling of both local and global pose information while maintaining linear computational complexity. Built upon SAS-SSM, our model SasMamba achieves competitive 3D pose estimation performance with significantly fewer parameters compared to existing hybrid models. The source code is available at https://hucui2022.github.io/sasmamba_proj/.[74] Improve Contrastive Clustering Performance by Multiple Fusing-Augmenting ViT Blocks
Cheng Wang,Shuisheng Zhou,Fengjiao Peng,Jin Sheng,Feng Ye,Yinli Dong
Main category: cs.CV
TL;DR: 提出基于Vision Transformer的多融合增强ViT块(MFAVBs),通过显式融合正样本对特征提升图像聚类性能,在七个数据集上优于现有方法。
Details
Motivation: 现有对比学习网络隐式交互限制了正样本对互补性和相似性的充分利用,难以充分提取聚类特征。 Method: 设计MFAVBs结构,使用共享权重ViT处理增广正样本,融合输出特征后输入更大ViT;特征分裂回传至后续FAVBs实现多轮融合增强;结合CLIP预提取特征,并在实例级和聚类级空间联合优化交叉熵损失。 Result: 在七个公开数据集上的实验表明,该方法在聚类性能上优于当前最先进的技术。 Conclusion: MFAVBs能有效显式融合正样本特征,显著提升对比聚类效果,结合CLIP特征进一步增强了模型对相似图像的区分能力。 Abstract: In the field of image clustering, the widely used contrastive learning networks improve clustering performance by maximizing the similarity between positive pairs and the dissimilarity of negative pairs of the inputs. Extant contrastive learning networks, whose two encoders often implicitly interact with each other by parameter sharing or momentum updating, may not fully exploit the complementarity and similarity of the positive pairs to extract clustering features from input data. To explicitly fuse the learned features of positive pairs, we design a novel multiple fusing-augmenting ViT blocks (MFAVBs) based on the excellent feature learning ability of Vision Transformers (ViT). Firstly, two preprocessed augmentions as positive pairs are separately fed into two shared-weight ViTs, then their output features are fused to input into a larger ViT. Secondly, the learned features are split into a pair of new augmented positive samples and passed to the next FAVBs, enabling multiple fusion and augmention through MFAVBs operations. Finally, the learned features are projected into both instance-level and clustering-level spaces to calculate the cross-entropy loss, followed by parameter updates by backpropagation to finalize the training process. To further enhance ability of the model to distinguish between similar images, our input data for the network we propose is preprocessed augmentions with features extracted from the CLIP pretrained model. Our experiments on seven public datasets demonstrate that MFAVBs serving as the backbone for contrastive clustering outperforms the state-of-the-art techniques in terms of clustering performance.[75] Classifying Histopathologic Glioblastoma Sub-regions with EfficientNet
Sanyukta Adap,Ujjwal Baid,Spyridon Bakas
Main category: cs.CV
TL;DR: 本研究提出了一种基于EfficientNet的四步深度学习方法,用于自动分类胶质母细胞瘤(GBM)中的六种组织病理学区域,并在BraTS-Path 2024数据集上进行了评估,结果显示模型在训练集上表现优异(F1=0.98),但在验证和测试集上性能显著下降(F1分别为0.546和0.517),突显了模型泛化能力的挑战。
Details
Motivation: 胶质母细胞瘤(GBM)预后差,现有诊断手段未能显著改善患者生存率;通过自动化识别GBM中不同的组织学亚区,有望在大规模上深入理解其形态学特征,从而提升诊断与研究水平。 Method: 采用四步深度学习框架,基于BraTS-Path 2024挑战赛的H&E染色全切片图像数据,使用多种EfficientNet变体(B0-B4)对六种病理区域进行分类,并通过5折交叉验证评估性能。 Result: EfficientNet-B1和B4在训练集上达到0.98的F1分数,但在独立验证集和隐藏测试集上的F1分数分别降至0.546和0.517,显示出明显的过拟合问题。 Conclusion: 尽管模型在训练数据上表现极佳,但其在新数据上的泛化能力有限,提示在临床实际应用中需进一步优化模型鲁棒性与泛化性能。 Abstract: Glioblastoma (GBM) is the most common aggressive, fast-growing brain tumor, with a grim prognosis. Despite clinical diagnostic advancements, there have not been any substantial improvements to patient prognosis. Histopathological assessment of excised tumors is the first line of clinical diagnostic routine. We hypothesize that automated, robust, and accurate identification of distinct histological sub-regions within GBM could contribute to morphologically understanding this disease at scale. In this study, we designed a four-step deep learning approach to classify six (6) histopathological regions and quantitatively evaluated it on the BraTS-Path 2024 challenge dataset, which includes digitized Hematoxylin \& Eosin (H\&E) stained GBM tissue sections annotated for six distinct regions. We used the challenge's publicly available training dataset to develop and evaluate the effectiveness of several variants of EfficientNet architectures (i.e., B0, B1, B2, B3, B4). EfficientNet-B1 and EfficientNet-B4 achieved the best performance, achieving an F1 score of 0.98 in a 5-fold cross-validation configuration using the BraTS-Path training set. The quantitative performance evaluation of our proposed approach with EfficientNet-B1 on the BraTS-Path hold-out validation data and the final hidden testing data yielded F1 scores of 0.546 and 0.517, respectively, for the associated 6-class classification task. The difference in the performance on training, validation, and testing data highlights the challenge of developing models that generalize well to new data, which is crucial for clinical applications. The source code of the proposed approach can be found at the GitHub repository of Indiana University Division of Computational Pathology: https://github.com/IUCompPath/brats-path-2024-enet.[76] Improving VisNet for Object Recognition
Mehdi Fatan Serj,C. Alejandro Parraga,Xavier Otazu
Main category: cs.CV
TL;DR: 本研究探讨了VisNet及其多种增强变体在物体识别和对称性分类中的应用,通过引入RBF神经元、基于马氏距离的学习和类视网膜预处理,结合赫布学习和时间连续性原理,提升了模型的识别准确性和特征不变性。
Details
Motivation: 受生物视觉系统高效物体识别能力的启发,旨在提升人工系统在复杂变换下的物体识别性能,并增强模型的生物合理性和可解释性。 Method: 采用VisNet及其改进版本,结合径向基函数(RBF)神经元、基于马氏距离的学习机制和类视网膜预处理,利用赫布学习和时间连续性原则构建不变特征表示。 Result: 在MNIST、CIFAR10和自定义对称物体数据集上,改进的VisNet变体显著优于基线模型,表现出更高的识别准确率和更强的变换不变性。 Conclusion: 增强型VisNet架构不仅提升了物体识别性能,还体现了生物启发模型在人工智能与神经科学交叉领域的潜力,为视觉识别提供了可解释且高效的框架。 Abstract: Object recognition plays a fundamental role in how biological organisms perceive and interact with their environment. While the human visual system performs this task with remarkable efficiency, reproducing similar capabilities in artificial systems remains challenging. This study investigates VisNet, a biologically inspired neural network model, and several enhanced variants incorporating radial basis function neurons, Mahalanobis distance based learning, and retinal like preprocessing for both general object recognition and symmetry classification. By leveraging principles of Hebbian learning and temporal continuity associating temporally adjacent views to build invariant representations. VisNet and its extensions capture robust and transformation invariant features. Experimental results across multiple datasets, including MNIST, CIFAR10, and custom symmetric object sets, show that these enhanced VisNet variants substantially improve recognition accuracy compared with the baseline model. These findings underscore the adaptability and biological relevance of VisNet inspired architectures, offering a powerful and interpretable framework for visual recognition in both neuroscience and artificial intelligence. Keywords: VisNet, Object Recognition, Symmetry Detection, Hebbian Learning, RBF Neurons, Mahalanobis Distance, Biologically Inspired Models, Invariant Representations[77] Asymmetric Cross-Modal Knowledge Distillation: Bridging Modalities with Weak Semantic Consistency
Riling Wei,Kelu Yao,Chuanguang Yang,Jin Wang,Zhuoyan Gao,Chao Li
Main category: cs.CV
TL;DR: 本文提出了一种在弱语义一致性下进行跨模态知识蒸馏的新方法——非对称跨模态知识蒸馏(ACKD),并设计了SemBridge框架以解决知识传输成本高的问题,在遥感场景分类任务中取得了领先性能。
Details
Motivation: 由于现实场景中成对模态数据稀缺,传统的对称跨模态知识蒸馏(SCKD)应用受限,因此需要研究在弱语义一致性下的通用知识蒸馏方法。 Method: 提出了SemBridge框架,包含学生友好的匹配模块和语义感知知识对齐模块;前者通过自监督学习动态选择相关教师样本为学生提供个性化指导,后者基于拉格朗日优化寻找最优传输路径。 Result: 在多个数据集和6种模型架构上验证了所提方法的有效性,性能优于7种现有方法。 Conclusion: SemBridge有效缓解了弱语义一致性下的知识传输成本问题,推动了非对称跨模态知识蒸馏的发展,并在遥感图像分类任务中展现出优越性能。 Abstract: Cross-modal Knowledge Distillation has demonstrated promising performance on paired modalities with strong semantic connections, referred to as Symmetric Cross-modal Knowledge Distillation (SCKD). However, implementing SCKD becomes exceedingly constrained in real-world scenarios due to the limited availability of paired modalities. To this end, we investigate a general and effective knowledge learning concept under weak semantic consistency, dubbed Asymmetric Cross-modal Knowledge Distillation (ACKD), aiming to bridge modalities with limited semantic overlap. Nevertheless, the shift from strong to weak semantic consistency improves flexibility but exacerbates challenges in knowledge transmission costs, which we rigorously verified based on optimal transport theory. To mitigate the issue, we further propose a framework, namely SemBridge, integrating a Student-Friendly Matching module and a Semantic-aware Knowledge Alignment module. The former leverages self-supervised learning to acquire semantic-based knowledge and provide personalized instruction for each student sample by dynamically selecting the relevant teacher samples. The latter seeks the optimal transport path by employing Lagrangian optimization. To facilitate the research, we curate a benchmark dataset derived from two modalities, namely Multi-Spectral (MS) and asymmetric RGB images, tailored for remote sensing scene classification. Comprehensive experiments exhibit that our framework achieves state-of-the-art performance compared with 7 existing approaches on 6 different model architectures across various datasets.[78] LLM-Guided Probabilistic Fusion for Label-Efficient Document Layout Analysis
Ibne Farabi Shihab,Sanjeda Akter,Anuj Sharma
Main category: cs.CV
TL;DR: 提出一种融合视觉预测与文本预训练大语言模型(LLM)结构先验的半监督文档布局检测框架,通过逆方差融合生成高质量伪标签,在少量标注数据下显著提升性能。
Details
Motivation: 现有文档布局理解方法仍依赖大量标注数据,尽管有半监督学习进展,但性能提升有限。希望利用LLM的结构先验知识来增强半监督检测,减少对标注数据的依赖。 Method: 构建OCR-LLM流水线以推断文档的层次化区域,并将这些结构先验与教师检测器的输出通过基于逆方差的加权融合,生成更精确的伪标签;引入实例自适应门控机制进行动态权重调整。 Result: 在仅使用5%标签的PubLayNet上,轻量级SwiftFormer达到88.2 AP,LayoutLMv3结合该框架达到89.7 AP,优于标准半监督方法并媲美需大规模预训练的UDOP;LLM可提供语义消歧(+3.8 AP)、支持隐私保护部署,且系统成本可控。 Conclusion: LLM的结构先验能有效补充视觉模型,适用于不同规模架构,在低标注数据场景下显著提升文档布局理解性能,兼具实用性与扩展性。 Abstract: Document layout understanding remains data-intensive despite advances in semi-supervised learning. We present a framework that enhances semi-supervised detection by fusing visual predictions with structural priors from text-pretrained LLMs via principled probabilistic weighting. Given unlabeled documents, an OCR-LLM pipeline infers hierarchical regions which are combined with teacher detector outputs through inverse-variance fusion to generate refined pseudo-labels.Our method demonstrates consistent gains across model scales. With a lightweight SwiftFormer backbone (26M params), we achieve 88.2$\pm$0.3 AP using only 5\% labels on PubLayNet. When applied to document-pretrained LayoutLMv3 (133M params), our fusion framework reaches 89.7$\pm$0.4 AP, surpassing both LayoutLMv3 with standard semi-supervised learning (89.1$\pm$0.4 AP, p=0.02) and matching UDOP~\cite{udop} (89.8 AP) which requires 100M+ pages of multimodal pretraining. This demonstrates that LLM structural priors are complementary to both lightweight and pretrained architectures. Key findings include: (1) learned instance-adaptive gating improves over fixed weights by +0.9 AP with data-dependent PAC bounds correctly predicting convergence; (2) open-source LLMs enable privacy-preserving deployment with minimal loss (Llama-3-70B: 87.1 AP lightweight, 89.4 AP with LayoutLMv3); (3) LLMs provide targeted semantic disambiguation (18.7\% of cases, +3.8 AP gain) beyond simple text heuristics.Total system cost includes \$12 for GPT-4o-mini API or 17 GPU-hours for local Llama-3-70B per 50K pages, amortized across training runs.[79] Consistency Change Detection Framework for Unsupervised Remote Sensing Change Detection
Yating Liu,Yan Lu
Main category: cs.CV
TL;DR: 提出一种一致性变化检测框架(CCDF),通过循环一致性和语义一致性模块解决生成器过拟合问题,提升无监督遥感变化检测性能。
Details
Motivation: 现有无监督方法因生成器过拟合导致遥感图像变化检测性能不佳。 Method: 设计循环一致性(CC)模块减少生成器过拟合,引入语义一致性(SC)模块实现细节重建。 Result: 实验表明该方法在多个数据集上优于当前最先进的无监督变化检测方法。 Conclusion: CCDF有效缓解了生成器过拟合问题,显著提升了无监督遥感变化检测的准确性和鲁棒性。 Abstract: Unsupervised remote sensing change detection aims to monitor and analyze changes from multi-temporal remote sensing images in the same geometric region at different times, without the need for labeled training data. Previous unsupervised methods attempt to achieve style transfer across multi-temporal remote sensing images through reconstruction by a generator network, and then capture the unreconstructable areas as the changed regions. However, it often leads to poor performance due to generator overfitting. In this paper, we propose a novel Consistency Change Detection Framework (CCDF) to address this challenge. Specifically, we introduce a Cycle Consistency (CC) module to reduce the overfitting issues in the generator-based reconstruction. Additionally, we propose a Semantic Consistency (SC) module to enable detail reconstruction. Extensive experiments demonstrate that our method outperforms other state-of-the-art approaches.[80] HitoMi-Cam: A Shape-Agnostic Person Detection Method Using the Spectral Characteristics of Clothing
Shuji Ono
Main category: cs.CV
TL;DR: HitoMi-Cam是一种基于光谱反射特性的轻量级、形状无关的人员检测方法,可在资源受限的边缘设备上实现实时运行,在搜救等形状不可预测的场景中表现优于传统CNN模型。
Details
Motivation: 传统基于CNN的目标检测方法对训练数据中未包含的姿态和形状敏感,导致在实际应用(如搜救)中性能下降,因此需要一种不依赖形状的检测方法。 Method: 提出HitoMi-Cam,利用衣物的光谱反射特性进行人员检测,并在无GPU的边缘设备上实现和评估其性能。 Result: 系统达到23.2 fps的处理速度,且在模拟搜救场景中平均精度达93.5%,显著高于最佳CNN模型的53.8%,同时保持极低的误报率。 Conclusion: 光谱-based人员检测方法HitoMi-Cam在形状多变的真实环境中具有实用性和高效性,可作为CNN方法在特定条件下的有效补充。 Abstract: While convolutional neural network (CNN)-based object detection is widely used, it exhibits a shape dependency that degrades performance for postures not included in the training data. Building upon our previous simulation study published in this journal, this study implements and evaluates the spectral-based approach on physical hardware to address this limitation. Specifically, this paper introduces HitoMi-Cam, a lightweight and shape-agnostic person detection method that uses the spectral reflectance properties of clothing. The author implemented the system on a resource-constrained edge device without a GPU to assess its practical viability. The results indicate that a processing speed of 23.2 frames per second (fps) (253x190 pixels) is achievable, suggesting that the method can be used for real-time applications. In a simulated search and rescue scenario where the performance of CNNs declines, HitoMi-Cam achieved an average precision (AP) of 93.5%, surpassing that of the compared CNN models (best AP of 53.8%). Throughout all evaluation scenarios, the occurrence of false positives remained minimal. This study positions the HitoMi-Cam method not as a replacement for CNN-based detectors but as a complementary tool under specific conditions. The results indicate that spectral-based person detection can be a viable option for real-time operation on edge devices in real-world environments where shapes are unpredictable, such as disaster rescue.[81] Negative Entity Suppression for Zero-Shot Captioning with Synthetic Images
Zimao Lu,Hui Xu,Bing Liu,Ke Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为负实体抑制(NES)的方法,以解决零样本图像描述生成中跨域泛化差和幻觉问题。NES通过合成图像、过滤检索内容中的负实体以及注意力级抑制来减少幻觉,提升了跨域性能。
Details
Motivation: 现有的纯文本训练方法在跨域场景下表现不佳,容易产生幻觉内容;检索方法可能引入无关实体,加剧幻觉问题。需要一种能有效识别并抑制负实体的方法。 Method: 提出负实体抑制(NES)框架,包含三个阶段:使用合成图像保证图文检索一致性;过滤检索内容中的负实体;在注意力层面抑制负实体影响。 Result: 在多个基准上评估显示,NES在保持领域内竞争力的同时,显著提升跨域迁移能力,降低幻觉率,达到当前最优性能。 Conclusion: NES有效缓解了零样本图像描述中的跨域泛化与幻觉问题,为检索增强的生成系统提供了新的优化方向。 Abstract: Text-only training provides an attractive approach to address data scarcity challenges in zero-shot image captioning (ZIC), avoiding the expense of collecting paired image-text annotations. However, although these approaches perform well within training domains, they suffer from poor cross-domain generalization, often producing hallucinated content when encountering novel visual environments. Retrieval-based methods attempt to mitigate this limitation by leveraging external knowledge, but they can paradoxically exacerbate hallucination when retrieved captions contain entities irrelevant to the inputs. We introduce the concept of negative entities--objects that appear in generated caption but are absent from the input--and propose Negative Entity Suppression (NES) to tackle this challenge. NES seamlessly integrates three stages: (1) it employs synthetic images to ensure consistent image-to-text retrieval across both training and inference; (2) it filters negative entities from retrieved content to enhance accuracy; and (3) it applies attention-level suppression using identified negative entities to further minimize the impact of hallucination-prone features. Evaluation across multiple benchmarks demonstrates that NES maintains competitive in-domain performance while improving cross-domain transfer and reducing hallucination rates, achieving new state-of-the-art results in ZIC. Our code is available at https://github.com/nidongpinyinme/NESCap.[82] SPEED-Q: Staged Processing with Enhanced Distillation towards Efficient Low-bit On-device VLM Quantization
Tianyu Guo,Shanwei Zhao,Shiai Zhu,Chenguang Ma
Main category: cs.CV
TL;DR: 本文提出了SPEED-Q,一种针对小型十亿参数级视觉-语言模型(VLM)的低比特权重量化框架,解决了模态间量化敏感性差异和低精度训练不稳定性问题,在2-bit和4-bit设置下显著优于现有方法。
Details
Motivation: 在边缘设备上部署VLM面临资源限制,现有量化方法对1B-2B参数规模的VLM研究不足,尤其缺乏对低比特量化的有效支持,亟需高效、稳定且适用于多模态模型的量化方案。 Method: 提出SPEED-Q框架,包含分阶段敏感度自适应机制以协调视觉与语言模块的量化差异,并结合增强蒸馏策略提升训练稳定性并降低数据依赖,实现全模型权重量化。 Result: 在多个基准测试中,SPEED-Q在2-bit设置下比现有方法最高提升6倍准确率,在2-bit和4-bit设置下均优于先前的端侧VLM方法。 Conclusion: SPEED-Q是首个支持完整小型十亿级VLM低比特量化的框架,实现了高精度、稳定且数据高效的量化,推动了复杂VLM在边缘设备上的实际部署。 Abstract: Deploying Vision-Language Models (VLMs) on edge devices (e.g., smartphones and robots) is crucial for enabling low-latency and privacy-preserving intelligent applications. Given the resource constraints of these devices, quantization offers a promising solution by improving memory efficiency and reducing bandwidth requirements, thereby facilitating the deployment of VLMs. However, existing research has rarely explored aggressive quantization on VLMs, particularly for the models ranging from 1B to 2B parameters, which are more suitable for resource-constrained edge devices. In this paper, we propose SPEED-Q, a novel Staged Processing with Enhanced Distillation framework for VLM low-bit weight-only quantization that systematically addresses the following two critical obstacles: (1) significant discrepancies in quantization sensitivity between vision (ViT) and language (LLM) components in VLMs; (2) training instability arising from the reduced numerical precision inherent in low-bit quantization. In SPEED-Q, a staged sensitivity adaptive mechanism is introduced to effectively harmonize performance across different modalities. We further propose a distillation-enhanced quantization strategy to stabilize the training process and reduce data dependence. Together, SPEED-Q enables accurate, stable, and data-efficient quantization of complex VLMs. SPEED-Q is the first framework tailored for quantizing entire small-scale billion-parameter VLMs to low bits. Extensive experiments across multiple benchmarks demonstrate that SPEED-Q achieves up to 6x higher accuracy than existing quantization methods under 2-bit settings and consistently outperforms prior on-device VLMs under both 2-bit and 4-bit settings. Our code and models are available at https://github.com/antgroup/SPEED-Q.[83] Machines Serve Human: A Novel Variable Human-machine Collaborative Compression Framework
Zifu Zhang,Shengxi Li,Xiancheng Sun,Mai Xu,Zhengyuan Liu,Jingyuan Xia
Main category: cs.CV
TL;DR: 本文提出了一种基于机器视觉导向的新型人机协同图像/视频压缩方法Diff-FCHM,以机器视觉为基础,通过扩散先验逐步聚合语义信息并恢复人类视觉所需的高保真细节,实现了在机器和人类视觉任务上均优越的压缩性能。
Details
Motivation: 现有基于人类视觉压缩流程的协作压缩方法在处理机器视觉任务时存在复杂性和码率冗余问题,而机器视觉仅关注图像中的核心区域,所需信息更少,因此需要一种以机器视觉为导向的新压缩范式。 Method: 提出Diff-FCHM方法,采用机器视觉导向的压缩框架,将机器视觉作为基础,并设计即插即用的可变码率策略;通过逐步融合机器视觉压缩中的语义信息,并结合扩散先验来重建适合人类视觉的高保真细节。 Result: 实验结果表明,Diff-FCHM在机器视觉任务(如目标检测、语义分割)和人类视觉质量(如PSNR、LPIPS)方面均显著优于现有方法,兼顾低码率与高性能。 Conclusion: 本文验证了以机器视觉为基础的人机协同压缩新路径的可行性与优势,为未来高效联合压缩系统提供了新思路。 Abstract: Human-machine collaborative compression has been receiving increasing research efforts for reducing image/video data, serving as the basis for both human perception and machine intelligence. Existing collaborative methods are dominantly built upon the de facto human-vision compression pipeline, witnessing deficiency on complexity and bit-rates when aggregating the machine-vision compression. Indeed, machine vision solely focuses on the core regions within the image/video, requiring much less information compared with the compressed information for human vision. In this paper, we thus set out the first successful attempt by a novel collaborative compression method based on the machine-vision-oriented compression, instead of human-vision pipeline. In other words, machine vision serves as the basis for human vision within collaborative compression. A plug-and-play variable bit-rate strategy is also developed for machine vision tasks. Then, we propose to progressively aggregate the semantics from the machine-vision compression, whilst seamlessly tailing the diffusion prior to restore high-fidelity details for human vision, thus named as diffusion-prior based feature compression for human and machine visions (Diff-FCHM). Experimental results verify the consistently superior performances of our Diff-FCHM, on both machine-vision and human-vision compression with remarkable margins. Our code will be released upon acceptance.[84] From Structure to Detail: Hierarchical Distillation for Efficient Diffusion Model
Hanbo Cheng,Peng Wang,Kaixiang Lei,Qi Li,Zhen Zou,Pengfei Hu,Jun Du
Main category: cs.CV
TL;DR: 提出层次化蒸馏(HD)框架,结合轨迹蒸馏与分布蒸馏的优势,并引入自适应加权判别器(AWD),实现高质量单步扩散模型,在FID等指标上达到SOTA。
Details
Motivation: 现有扩散模型蒸馏方法在保持生成质量与训练稳定性之间存在权衡:轨迹蒸馏保留结构但丢失细节,分布蒸馏易出现模式崩溃。需要一种能兼顾结构完整性和高保真细节的新方法。 Method: 提出层次化蒸馏(HD)框架,先用轨迹蒸馏生成结构‘草图’作为初始化,再通过分布蒸馏进行精细化;设计自适应加权判别器(AWD),动态关注局部缺陷以提升细节恢复能力。 Result: 在ImageNet 256×256上单步生成FID达2.26,接近250步教师模型;在MJHQ文本生成图像任务中表现优异,验证了方法的通用性与高保真生成能力。 Conclusion: HD框架有效融合两类蒸馏优势,结合AWD实现稳定且高质量的单步生成,为高保真实时扩散模型提供了新范式。 Abstract: The inference latency of diffusion models remains a critical barrier to their real-time application. While trajectory-based and distribution-based step distillation methods offer solutions, they present a fundamental trade-off. Trajectory-based methods preserve global structure but act as a "lossy compressor", sacrificing high-frequency details. Conversely, distribution-based methods can achieve higher fidelity but often suffer from mode collapse and unstable training. This paper recasts them from independent paradigms into synergistic components within our novel Hierarchical Distillation (HD) framework. We leverage trajectory distillation not as a final generator, but to establish a structural ``sketch", providing a near-optimal initialization for the subsequent distribution-based refinement stage. This strategy yields an ideal initial distribution that enhances the ceiling of overall performance. To further improve quality, we introduce and refine the adversarial training process. We find standard discriminator structures are ineffective at refining an already high-quality generator. To overcome this, we introduce the Adaptive Weighted Discriminator (AWD), tailored for the HD pipeline. By dynamically allocating token weights, AWD focuses on local imperfections, enabling efficient detail refinement. Our approach demonstrates state-of-the-art performance across diverse tasks. On ImageNet $256\times256$, our single-step model achieves an FID of 2.26, rivaling its 250-step teacher. It also achieves promising results on the high-resolution text-to-image MJHQ benchmark, proving its generalizability. Our method establishes a robust new paradigm for high-fidelity, single-step diffusion models.[85] Boosting Adversarial Transferability via Ensemble Non-Attention
Yipeng Zou,Qin Liu,Jie Wu,Yu Peng,Guo Chen,Hui Zhou,Guanghui Ye
Main category: cs.CV
TL;DR: 本文提出了一种新的集成攻击方法NAMEA,首次利用集成模型中非注意力区域的梯度来提升跨架构的对抗迁移性,通过元学习融合注意力和非注意力区域的梯度,在ImageNet上显著优于现有方法。
Details
Motivation: 现有集成攻击在异构模型间迁移性能不佳,主要因为不同架构模型的梯度更新方向差异大,难以有效降低梯度方差并充分利用各模型的优势。 Method: 提出NAMEA,通过分离并融合注意力与非注意力区域的梯度,利用元学习进行梯度合并,首次将非注意力区域的梯度引入迭代优化过程。 Result: 在ImageNet上的实验表明,NAMEA比当前最先进的AdaEA和SMER分别平均提升15.0%和9.6%的攻击成功率。 Conclusion: NAMEA有效提升了跨架构对抗样本的迁移性,是首个探索非注意力区域在集成攻击中作用的工作,为构建高效集成攻击提供了新思路。 Abstract: Ensemble attacks integrate the outputs of surrogate models with diverse architectures, which can be combined with various gradient-based attacks to improve adversarial transferability. However, previous work shows unsatisfactory attack performance when transferring across heterogeneous model architectures. The main reason is that the gradient update directions of heterogeneous surrogate models differ widely, making it hard to reduce the gradient variance of ensemble models while making the best of individual model. To tackle this challenge, we design a novel ensemble attack, NAMEA, which for the first time integrates the gradients from the non-attention areas of ensemble models into the iterative gradient optimization process. Our design is inspired by the observation that the attention areas of heterogeneous models vary sharply, thus the non-attention areas of ViTs are likely to be the focus of CNNs and vice versa. Therefore, we merge the gradients respectively from the attention and non-attention areas of ensemble models so as to fuse the transfer information of CNNs and ViTs. Specifically, we pioneer a new way of decoupling the gradients of non-attention areas from those of attention areas, while merging gradients by meta-learning. Empirical evaluations on ImageNet dataset indicate that NAMEA outperforms AdaEA and SMER, the state-of-the-art ensemble attacks by an average of 15.0% and 9.6%, respectively. This work is the first attempt to explore the power of ensemble non-attention in boosting cross-architecture transferability, providing new insights into launching ensemble attacks.[86] Neural B-frame Video Compression with Bi-directional Reference Harmonization
Yuxi Liu,Dengchao Jin,Shuai Huo,Jiawen Gu,Chao Zhou,Huihui Bai,Ming Lu,Zhan Ma
Main category: cs.CV
TL;DR: 提出了一种新的神经B帧视频压缩方法BRHVC,通过双向运动收敛(BMC)和双向上下文融合(BCF)优化双向参考帧的信息利用,显著提升了压缩性能,优于现有最先进方法甚至传统VTM-RA编码。
Details
Motivation: 神经B帧视频压缩(NBVC)在分层编码中存在参考帧贡献不平衡问题,影响连续时间预测,亟需优化双向参考信息的利用。 Method: 提出BRHVC方法,包含BMC模块用于聚合多光流以实现大范围精确运动补偿,以及BCF模块根据运动补偿精度显式建模参考上下文权重。 Result: 实验表明BRHVC在HEVC数据集上优于现有最先进的神经视频压缩方法,并在随机访问配置下超过传统VTM-RA编码。 Conclusion: BRHVC有效解决了NBVC中双向参考帧信息利用不均衡的问题,实现了更高效的运动与上下文建模,显著提升压缩性能。 Abstract: Neural video compression (NVC) has made significant progress in recent years, while neural B-frame video compression (NBVC) remains underexplored compared to P-frame compression. NBVC can adopt bi-directional reference frames for better compression performance. However, NBVC's hierarchical coding may complicate continuous temporal prediction, especially at some hierarchical levels with a large frame span, which could cause the contribution of the two reference frames to be unbalanced. To optimize reference information utilization, we propose a novel NBVC method, termed Bi-directional Reference Harmonization Video Compression (BRHVC), with the proposed Bi-directional Motion Converge (BMC) and Bi-directional Contextual Fusion (BCF). BMC converges multiple optical flows in motion compression, leading to more accurate motion compensation on a larger scale. Then BCF explicitly models the weights of reference contexts under the guidance of motion compensation accuracy. With more efficient motions and contexts, BRHVC can effectively harmonize bi-directional references. Experimental results indicate that our BRHVC outperforms previous state-of-the-art NVC methods, even surpassing the traditional coding, VTM-RA (under random access configuration), on the HEVC datasets. The source code is released at https://github.com/kwai/NVC.[87] FGM-HD: Boosting Generation Diversity of Fractal Generative Models through Hausdorff Dimension Induction
Haowei Zhang,Yuanpei Zhao,Jizhe Zhou,Mao Li
Main category: cs.CV
TL;DR: 本文提出了一种基于豪斯多夫维度(HD)的新方法,以在保持高视觉质量的同时提升分形生成模型(FGM)的生成多样性。通过可学习的HD估计、动量驱动的训练调度策略和HD引导的拒绝采样,显著提高了输出多样性(提升39%),且不牺牲图像质量。
Details
Motivation: FGM虽然能高效生成高质量图像,但其固有的自相似性限制了输出多样性。为在保持视觉质量的同时提升多样性,需引入能够量化结构复杂度的理论工具。 Method: 提出基于豪斯多夫维度(HD)的FGM增强框架(FGM-HD):1)设计可学习的HD估计模块,从图像嵌入中直接预测HD;2)采用具有单调动量调度策略的HD损失,在训练中逐步优化超参数;3)在推理阶段使用HD引导的拒绝采样选择几何更丰富的输出。 Result: 在ImageNet上的实验表明,与原始FGM相比,所提方法在保持相当图像质量的前提下,输出多样性提升了39%。同时验证了HD作为多样性度量的有效性和理论贡献。 Conclusion: 这是首次将豪斯多夫维度引入FGM的工作。所提FGM-HD框架有效解决了生成多样性和视觉质量之间的权衡问题,不仅提升了实际性能,也为FGM的发展提供了新的理论视角。 Abstract: Improving the diversity of generated results while maintaining high visual quality remains a significant challenge in image generation tasks. Fractal Generative Models (FGMs) are efficient in generating high-quality images, but their inherent self-similarity limits the diversity of output images. To address this issue, we propose a novel approach based on the Hausdorff Dimension (HD), a widely recognized concept in fractal geometry used to quantify structural complexity, which aids in enhancing the diversity of generated outputs. To incorporate HD into FGM, we propose a learnable HD estimation method that predicts HD directly from image embeddings, addressing computational cost concerns. However, simply introducing HD into a hybrid loss is insufficient to enhance diversity in FGMs due to: 1) degradation of image quality, and 2) limited improvement in generation diversity. To this end, during training, we adopt an HD-based loss with a monotonic momentum-driven scheduling strategy to progressively optimize the hyperparameters, obtaining optimal diversity without sacrificing visual quality. Moreover, during inference, we employ HD-guided rejection sampling to select geometrically richer outputs. Extensive experiments on the ImageNet dataset demonstrate that our FGM-HD framework yields a 39\% improvement in output diversity compared to vanilla FGMs, while preserving comparable image quality. To our knowledge, this is the very first work introducing HD into FGM. Our method effectively enhances the diversity of generated outputs while offering a principled theoretical contribution to FGM development.[88] AuthSig: Safeguarding Scanned Signatures Against Unauthorized Reuse in Paperless Workflows
RuiQiang Zhang,Zehua Ma,Guanjie Wang,Chang Liu,Hengyi Wang,Weiming Zhang
Main category: cs.CV
TL;DR: 本文提出了一种基于生成模型和水印技术的新型静态电子签名框架AuthSig,通过在签名图像中隐式嵌入认证信息,实现“一次签名,一次使用”的安全策略。
Details
Motivation: 随着无纸化工作流程的普及,静态扫描签名因便捷性仍被广泛使用,但其缺乏有效的身份认证能力,易被复制和滥用,亟需一种可靠的认证机制。 Method: 利用生成模型在签名生成过程中精细调节风格嵌入,结合人眼对细微风格变化不敏感的特性,将水印信息隐式编码到签名图像中;并提出一种基于关键点的数据增强策略,以提升手写签名数据的风格多样性,支持鲁棒的水印嵌入。 Result: 实验结果表明,AuthSig在数字域失真和特定签名退化情况下水印提取准确率超过98%,即使在打印-扫描场景下仍保持有效。 Conclusion: AuthSig为静态电子签名提供了一种高效、安全的认证方案,实现了高鲁棒性的水印嵌入与提取,有效防止签名的非法复制和重复使用。 Abstract: With the deepening trend of paperless workflows, signatures as a means of identity authentication are gradually shifting from traditional ink-on-paper to electronic formats.Despite the availability of dynamic pressure-sensitive and PKI-based digital signatures, static scanned signatures remain prevalent in practice due to their convenience. However, these static images, having almost lost their authentication attributes, cannot be reliably verified and are vulnerable to malicious copying and reuse. To address these issues, we propose AuthSig, a novel static electronic signature framework based on generative models and watermark, which binds authentication information to the signature image. Leveraging the human visual system's insensitivity to subtle style variations, AuthSig finely modulates style embeddings during generation to implicitly encode watermark bits-enforcing a One Signature, One Use policy.To overcome the scarcity of handwritten signature data and the limitations of traditional augmentation methods, we introduce a keypoint-driven data augmentation strategy that effectively enhances style diversity to support robust watermark embedding. Experimental results show that AuthSig achieves over 98% extraction accuracy under both digital-domain distortions and signature-specific degradations, and remains effective even in print-scan scenarios.[89] Efficient and Effective In-context Demonstration Selection with Coreset
Zihua Wang,Jiarui Wang,Haiyang Xu,Ming Yan,Fei Huang,Xu Yang,Xiu-Shen Wei,Siya Mi,Yu Zhang
Main category: cs.CV
TL;DR: 本文提出了一种名为Coreset-based Dual Retrieval (CoDR)的新框架,用于提升大视觉语言模型中的上下文学习(ICL)效果,通过构建多样化的coreset和双检索机制实现高效且有效的示例选择。
Details
Motivation: 传统示例选择方法在效率与效果之间难以平衡,且性能受限于NP难问题,需要更优的解决方案。 Method: 引入基于coreset的聚类剪枝方法构建多样化子集,并设计双检索机制以实现全局优化与高效性兼顾的示例选择。 Result: 实验结果表明,所提方法在ICL性能上显著优于现有策略,实现了更高效的示例选择。 Conclusion: CoDR为大视觉语言模型中的上下文学习提供了一种有效且高效的示例选择方案,具有良好的应用潜力。 Abstract: In-context learning (ICL) has emerged as a powerful paradigm for Large Visual Language Models (LVLMs), enabling them to leverage a few examples directly from input contexts. However, the effectiveness of this approach is heavily reliant on the selection of demonstrations, a process that is NP-hard. Traditional strategies, including random, similarity-based sampling and infoscore-based sampling, often lead to inefficiencies or suboptimal performance, struggling to balance both efficiency and effectiveness in demonstration selection. In this paper, we propose a novel demonstration selection framework named Coreset-based Dual Retrieval (CoDR). We show that samples within a diverse subset achieve a higher expected mutual information. To implement this, we introduce a cluster-pruning method to construct a diverse coreset that aligns more effectively with the query while maintaining diversity. Additionally, we develop a dual retrieval mechanism that enhances the selection process by achieving global demonstration selection while preserving efficiency. Experimental results demonstrate that our method significantly improves the ICL performance compared to the existing strategies, providing a robust solution for effective and efficient demonstration selection.[90] WDT-MD: Wavelet Diffusion Transformers for Microaneurysm Detection in Fundus Images
Yifei Sun,Yuzhi He,Junhao Jia,Jinhong Wang,Ruiquan Ge,Changmiao Wang,Hongxia Xu
Main category: cs.CV
TL;DR: 提出了一种基于小波扩散Transformer的微动脉瘤检测框架(WDT-MD),通过噪声编码条件机制、伪正常模式合成和多尺度小波分析,有效解决了扩散模型在糖尿病视网膜病变早期标志物检测中的身份映射、高假阳性及正常特征重建不足问题,在IDRiD和e-ophtha数据集上表现优于现有方法。
Details
Motivation: 微动脉瘤(MAs)是糖尿病视网膜病变的早期标志,因其尺寸小、形态多样,手动筛查费时且易出错;现有扩散模型在自动检测中面临身份映射、难以区分其他异常及正常结构重建差的问题,限制了临床应用。 Method: 提出WDT-MD框架:1)噪声编码图像条件机制防止训练中的身份映射;2)通过修复生成伪正常图像以实现像素级监督,提升MAs与其他异常的区分能力;3)结合扩散Transformer与小波分析的多尺度架构,增强正常视网膜特征的重建。 Result: 在IDRiD和e-ophtha MA数据集上的实验表明,WDT-MD在像素级和图像级检测任务中均优于当前最先进的方法,显著降低假阳性率并提升检测精度。 Conclusion: WDT-MD有效提升了微动脉瘤的自动检测性能,解决了扩散模型的关键缺陷,具有推动糖尿病视网膜病变早期筛查临床应用的潜力。 Abstract: Microaneurysms (MAs), the earliest pathognomonic signs of Diabetic Retinopathy (DR), present as sub-60 $μm$ lesions in fundus images with highly variable photometric and morphological characteristics, rendering manual screening not only labor-intensive but inherently error-prone. While diffusion-based anomaly detection has emerged as a promising approach for automated MA screening, its clinical application is hindered by three fundamental limitations. First, these models often fall prey to "identity mapping", where they inadvertently replicate the input image. Second, they struggle to distinguish MAs from other anomalies, leading to high false positives. Third, their suboptimal reconstruction of normal features hampers overall performance. To address these challenges, we propose a Wavelet Diffusion Transformer framework for MA Detection (WDT-MD), which features three key innovations: a noise-encoded image conditioning mechanism to avoid "identity mapping" by perturbing image conditions during training; pseudo-normal pattern synthesis via inpainting to introduce pixel-level supervision, enabling discrimination between MAs and other anomalies; and a wavelet diffusion Transformer architecture that combines the global modeling capability of diffusion Transformers with multi-scale wavelet analysis to enhance reconstruction of normal retinal features. Comprehensive experiments on the IDRiD and e-ophtha MA datasets demonstrate that WDT-MD outperforms state-of-the-art methods in both pixel-level and image-level MA detection. This advancement holds significant promise for improving early DR screening.[91] An ICTM-RMSAV Framework for Bias-Field Aware Image Segmentation under Poisson and Multiplicative Noise
Xinyu Wang,Wenjun Yao,Fanghui Song,Zhichang Guo
Main category: cs.CV
TL;DR: 提出了一种结合去噪项的变分分割模型,适用于受伽马分布乘性噪声和泊松噪声污染的图像,在强度不均匀情况下表现出优越的准确性和鲁棒性。
Details
Motivation: 现有图像分割方法在面对严重噪声和强度不均匀时性能下降,需要更鲁棒的模型。 Method: 在迭代卷积阈值法(ICTM)框架下,引入I-散度项和自适应全变差正则化进行去噪,并利用灰度指示器引导空间自适应权重,同时估计平滑偏置场以校正强度不均匀。 Result: 在合成和真实图像上的实验表明,该方法在多种噪声类型和强度不均匀条件下均优于现有方法。 Conclusion: 所提模型有效提升了噪声和强度不均匀影响下的图像分割精度与鲁棒性。 Abstract: Image segmentation is a core task in image processing, yet many methods degrade when images are heavily corrupted by noise and exhibit intensity inhomogeneity. Within the iterative-convolution thresholding method (ICTM) framework, we propose a variational segmentation model that integrates denoising terms. Specifically, the denoising component consists of an I-divergence term and an adaptive total-variation (TV) regularizer, making the model well suited to images contaminated by Gamma--distributed multiplicative noise and Poisson noise. A spatially adaptive weight derived from a gray-level indicator guides diffusion differently across regions of varying intensity. To further address intensity inhomogeneity, we estimate a smoothly varying bias field, which improves segmentation accuracy. Regions are represented by characteristic functions, with contour length encoded accordingly. For efficient optimization, we couple ICTM with a relaxed modified scalar auxiliary variable (RMSAV) scheme. Extensive experiments on synthetic and real-world images with intensity inhomogeneity and diverse noise types show that the proposed model achieves superior accuracy and robustness compared with competing approaches.[92] T-Rex-Omni: Integrating Negative Visual Prompt in Generic Object Detection
Jiazhou Zhou,Qing Jiang,Kanghao Chen,Lutao Jiang,Yuanhuiyi Lyu,Ying-Cong Chen,Lei Zhang
Main category: cs.CV
TL;DR: 本文提出了T-Rex-Omni框架,通过引入负向视觉提示来抑制难负样本干扰,提升开放集目标检测性能。
Details
Motivation: 现有开放集检测器仅依赖正向提示,易受视觉相似但语义不同的干扰物影响,缺乏对负向信息的利用。 Method: 提出统一的视觉提示编码器,结合无需训练的NNC模块在概率计算阶段动态抑制负响应,并设计NNH损失函数 fine-tuning 以增强正负样本区分性。 Result: 在零样本检测中表现优异,显著缩小了视觉提示与文本提示方法间的性能差距,在长尾场景下达到51.2 AP_r(LVIS-minival)。 Conclusion: 负向提示是推进开放集视觉识别系统的一个关键新维度。 Abstract: Object detection methods have evolved from closed-set to open-set paradigms over the years. Current open-set object detectors, however, remain constrained by their exclusive reliance on positive indicators based on given prompts like text descriptions or visual exemplars. This positive-only paradigm experiences consistent vulnerability to visually similar but semantically different distractors. We propose T-Rex-Omni, a novel framework that addresses this limitation by incorporating negative visual prompts to negate hard negative distractors. Specifically, we first introduce a unified visual prompt encoder that jointly processes positive and negative visual prompts. Next, a training-free Negating Negative Computing (NNC) module is proposed to dynamically suppress negative responses during the probability computing stage. To further boost performance through fine-tuning, our Negating Negative Hinge (NNH) loss enforces discriminative margins between positive and negative embeddings. T-Rex-Omni supports flexible deployment in both positive-only and joint positive-negative inference modes, accommodating either user-specified or automatically generated negative examples. Extensive experiments demonstrate remarkable zero-shot detection performance, significantly narrowing the performance gap between visual-prompted and text-prompted methods while showing particular strength in long-tailed scenarios (51.2 AP_r on LVIS-minival). This work establishes negative prompts as a crucial new dimension for advancing open-set visual recognition systems.[93] Causally-Grounded Dual-Path Attention Intervention for Object Hallucination Mitigation in LVLMs
Liu Yu,Zhonghao Chen,Ping Kuang,Zhikun Feng,Fan Zhou,Lan Wang,Gillian Dobbie
Main category: cs.CV
TL;DR: 本文提出Owl框架,通过双模态注意力重加权和对比解码策略,有效缓解大视觉语言模型中的物体幻觉问题。
Details
Motivation: 现有方法通常独立调节视觉或文本注意力,忽略了二者交互对幻觉的影响,需从因果角度建模其联合作用机制。 Method: 构建结构化因果图,提出VTACR指标量化模态贡献比,并设计基于该信号的细粒度注意力干预机制与双路径对比解码策略。 Result: 在POPE和CHAIR基准上显著降低幻觉率,达到当前最优的忠实性表现,同时保持模型理解能力。 Conclusion: Owl通过因果驱动的双模态注意力调控,有效抑制了文本先验主导导致的幻觉,提升了LVLMs的视觉接地性。 Abstract: Object hallucination remains a critical challenge in Large Vision-Language Models (LVLMs), where models generate content inconsistent with visual inputs. Existing language-decoder based mitigation approaches often regulate visual or textual attention independently, overlooking their interaction as two key causal factors. To address this, we propose Owl (Bi-mOdal attention reWeighting for Layer-wise hallucination mitigation), a causally-grounded framework that models hallucination process via a structural causal graph, treating decomposed visual and textual attentions as mediators. We introduce VTACR (Visual-to-Textual Attention Contribution Ratio), a novel metric that quantifies the modality contribution imbalance during decoding. Our analysis reveals that hallucinations frequently occur in low-VTACR scenarios, where textual priors dominate and visual grounding is weakened. To mitigate this, we design a fine-grained attention intervention mechanism that dynamically adjusts token- and layer-wise attention guided by VTACR signals. Finally, we propose a dual-path contrastive decoding strategy: one path emphasizes visually grounded predictions, while the other amplifies hallucinated ones -- letting visual truth shine and hallucination collapse. Experimental results on the POPE and CHAIR benchmarks show that Owl achieves significant hallucination reduction, setting a new SOTA in faithfulness while preserving vision-language understanding capability. Our code is available at https://github.com/CikZ2023/OWL[94] Dense Cross-Scale Image Alignment With Fully Spatial Correlation and Just Noticeable Difference Guidance
Jinkun You,Jiaxue Li,Jie Zhang,Yicong Zhou
Main category: cs.CV
TL;DR: 提出了一种密集跨尺度图像对齐模型,通过考虑跨尺度特征的相关性来降低对齐难度,并引入全空间相关模块和可感知差异机制以提高精度并减少计算成本。
Details
Motivation: 现有无监督图像对齐方法存在精度低和计算复杂度高的问题。 Method: 提出密集跨尺度图像对齐模型,结合跨尺度特征相关性、全空间相关模块,并利用可感知差异(JND)机制优化对齐效果。 Result: 实验表明该方法在定量和定性评估中均优于现有最先进方法,且支持精度与效率之间的灵活权衡。 Conclusion: 所提方法有效提升了无监督图像对齐的精度和效率,具有良好的应用前景。 Abstract: Existing unsupervised image alignment methods exhibit limited accuracy and high computational complexity. To address these challenges, we propose a dense cross-scale image alignment model. It takes into account the correlations between cross-scale features to decrease the alignment difficulty. Our model supports flexible trade-offs between accuracy and efficiency by adjusting the number of scales utilized. Additionally, we introduce a fully spatial correlation module to further improve accuracy while maintaining low computational costs. We incorporate the just noticeable difference to encourage our model to focus on image regions more sensitive to distortions, eliminating noticeable alignment errors. Extensive quantitative and qualitative experiments demonstrate that our method surpasses state-of-the-art approaches.[95] USF-Net: A Unified Spatiotemporal Fusion Network for Ground-Based Remote Sensing Cloud Image Sequence Extrapolation
Penghui Niu,Taotao Cai,Jiashuai She,Yajuan Zhang,Junhua Gua,Ping Zhanga,Jungong Hane,Jianxin Li
Main category: cs.CV
TL;DR: 本文提出了一种用于地面遥感云图序列外推的统一时空融合网络USF-Net,结合自适应大核卷积与低复杂度注意力机制,并引入新的ASI-CIS数据集,显著提升了预测精度与计算效率之间的平衡。
Details
Motivation: 现有云图外推方法在动态特征提取、长距离时空依赖建模和计算效率方面存在不足,需设计更高效且精准的模型以支持光伏发电系统的发展。 Method: 提出USF-Net,包含编码器中的SiB(SSM)和TiB(TAM)模块,分别捕获多尺度上下文信息和长时序依赖;引入DSM与TGM实现时空联合建模;解码器使用DUM缓解鬼影效应,并利用初始时态作为注意力算子保留运动特征。 Result: 在新发布的ASI-CIS数据集上实验表明,USF-Net在预测性能和计算效率方面均优于现有最先进方法。 Conclusion: USF-Net通过自适应特征提取与高效注意力机制,有效解决了云图外推中的关键挑战,为实际光伏应用提供了高性能解决方案。 Abstract: Ground-based remote sensing cloud image sequence extrapolation is a key research area in the development of photovoltaic power systems. However, existing approaches exhibit several limitations:(1)they primarily rely on static kernels to augment feature information, lacking adaptive mechanisms to extract features at varying resolutions dynamically;(2)temporal guidance is insufficient, leading to suboptimal modeling of long-range spatiotemporal dependencies; and(3)the quadratic computational cost of attention mechanisms is often overlooked, limiting efficiency in practical deployment. To address these challenges, we propose USF-Net, a Unified Spatiotemporal Fusion Network that integrates adaptive large-kernel convolutions and a low-complexity attention mechanism, combining temporal flow information within an encoder-decoder framework. Specifically, the encoder employs three basic layers to extract features. Followed by the USTM, which comprises:(1)a SiB equipped with a SSM that dynamically captures multi-scale contextual information, and(2)a TiB featuring a TAM that effectively models long-range temporal dependencies while maintaining computational efficiency. In addition, a DSM with a TGM is introduced to enable unified modeling of temporally guided spatiotemporal dependencies. On the decoder side, a DUM is employed to address the common "ghosting effect." It utilizes the initial temporal state as an attention operator to preserve critical motion signatures. As a key contribution, we also introduce and release the ASI-CIS dataset. Extensive experiments on ASI-CIS demonstrate that USF-Net significantly outperforms state-of-the-art methods, establishing a superior balance between prediction accuracy and computational efficiency for ground-based cloud extrapolation. The dataset and source code will be available at https://github.com/she1110/ASI-CIS.[96] 4KDehazeFlow: Ultra-High-Definition Image Dehazing via Flow Matching
Xingchi Chen,Pu Wang,Xuerui Li,Chaopeng Li,Juxiang Zhou,Jianhou Gan,Dianjie Lu,Guijuan Zhang,Wenqi Ren,Zhuoran Zheng
Main category: cs.CV
TL;DR: 提出4KDehazeFlow,一种基于Flow Matching和Haze-Aware向量场的超高清图像去雾方法,通过连续向量场流的渐进优化实现高效、高质量去雾。
Details
Motivation: 现有基于先验的方法场景适应性有限,而深度学习方法存在计算复杂度高和颜色失真问题。 Method: 引入Flow Matching建模去雾过程;设计可学习3D查找表(LUT)编码雾霾变换参数;采用四阶Runge-Kutta(RK4)ODE求解器稳定求解去雾流场。 Result: 在多个实验中超越7种最先进方法,PSNR提升2dB,在浓雾场景和色彩保真度方面表现更优。 Conclusion: 4KDehazeFlow是一种通用、高效且高质量的UHD图像去雾方法,兼容多种网络结构,显著提升去雾性能与推理效率。 Abstract: Ultra-High-Definition (UHD) image dehazing faces challenges such as limited scene adaptability in prior-based methods and high computational complexity with color distortion in deep learning approaches. To address these issues, we propose 4KDehazeFlow, a novel method based on Flow Matching and the Haze-Aware vector field. This method models the dehazing process as a progressive optimization of continuous vector field flow, providing efficient data-driven adaptive nonlinear color transformation for high-quality dehazing. Specifically, our method has the following advantages: 1) 4KDehazeFlow is a general method compatible with various deep learning networks, without relying on any specific network architecture. 2) We propose a learnable 3D lookup table (LUT) that encodes haze transformation parameters into a compact 3D mapping matrix, enabling efficient inference through precomputed mappings. 3) We utilize a fourth-order Runge-Kutta (RK4) ordinary differential equation (ODE) solver to stably solve the dehazing flow field through an accurate step-by-step iterative method, effectively suppressing artifacts. Extensive experiments show that 4KDehazeFlow exceeds seven state-of-the-art methods. It delivers a 2dB PSNR increase and better performance in dense haze and color fidelity.[97] PAN: A World Model for General, Interactable, and Long-Horizon World Simulation
PAN Team,Jiannan Xiang,Yi Gu,Zihan Liu,Zeyu Feng,Qiyue Gao,Yiyan Hu,Benhao Huang,Guangyi Liu,Yichi Yang,Kun Zhou,Davit Abrahamyan,Arif Ahmad,Ganesh Bannur,Junrong Chen,Kimi Chen,Mingkai Deng,Ruobing Han,Xinqi Huang,Haoqiang Kang,Zheqi Li,Enze Ma,Hector Ren,Yashowardhan Shinde,Rohan Shingre,Ramsundar Tanikella,Kaiming Tao,Dequan Yang,Xinle Yu,Cong Zeng,Binglin Zhou,Hector Liu,Zhiting Hu,Eric P. Xing
Main category: cs.CV
TL;DR: PAN是一种通用、可交互且支持长时域预测的世界模型,通过结合基于大语言模型的潜在动力学架构与视频扩散解码器,实现基于历史和自然语言动作的高质量视频模拟,统一了潜在空间推理与现实世界动态。
Details
Motivation: 现有视频生成模型缺乏因果控制和长期一致性,而传统世界模型局限于特定领域且泛化能力差,难以支持开放环境下的交互式推理。 Method: 提出PAN模型,采用生成式潜在预测(GLP)架构:以基于大语言模型的自回归潜在动力学主干进行语言条件下的动作建模,结合视频扩散解码器生成高保真、时间连贯的视觉序列。 Result: 在多种任务上表现出色,包括动作条件下的世界模拟、长时域预测和模拟推理,显著优于现有视频生成模型和世界模型。 Conclusion: PAN推动了通用世界模型的发展,实现了开放域、可交互、长时程一致的未来状态预测,为智能体的规划与决策提供了可靠的基础。 Abstract: A world model enables an intelligent agent to imagine, predict, and reason about how the world evolves in response to its actions, and accordingly to plan and strategize. While recent video generation models produce realistic visual sequences, they typically operate in the prompt-to-full-video manner without causal control, interactivity, or long-horizon consistency required for purposeful reasoning. Existing world modeling efforts, on the other hand, often focus on restricted domains (e.g., physical, game, or 3D-scene dynamics) with limited depth and controllability, and struggle to generalize across diverse environments and interaction formats. In this work, we introduce PAN, a general, interactable, and long-horizon world model that predicts future world states through high-quality video simulation conditioned on history and natural language actions. PAN employs the Generative Latent Prediction (GLP) architecture that combines an autoregressive latent dynamics backbone based on a large language model (LLM), which grounds simulation in extensive text-based knowledge and enables conditioning on language-specified actions, with a video diffusion decoder that reconstructs perceptually detailed and temporally coherent visual observations, to achieve a unification between latent space reasoning (imagination) and realizable world dynamics (reality). Trained on large-scale video-action pairs spanning diverse domains, PAN supports open-domain, action-conditioned simulation with coherent, long-term dynamics. Extensive experiments show that PAN achieves strong performance in action-conditioned world simulation, long-horizon forecasting, and simulative reasoning compared to other video generators and world models, taking a step towards general world models that enable predictive simulation of future world states for reasoning and acting.[98] VietMEAgent: Culturally-Aware Few-Shot Multimodal Explanation for Vietnamese Visual Question Answering
Hai-Dang Nguyen,Minh-Anh Dang,Minh-Tan Le,Minh-Tuan Le
Main category: cs.CV
TL;DR: 本文提出了一种用于越南文化理解的多模态可解释框架VietMEAgent,结合文化对象检测与程序生成,并构建了越南文化VQA数据集,以提升文化相关视觉问答的可解释性与文化敏感性。
Details
Motivation: 现有视觉问答系统在处理文化特定内容时受限,因训练数据中文化知识不足且推理过程缺乏可解释性。 Method: 提出VietMEAgent框架,结合文化目标检测、结构化程序生成、双模态解释模块,并利用越南文化知识库增强背景信息。 Result: 在自建的越南文化VQA数据集上验证了方法的有效性,系统能生成包含计算逻辑与文化背景的透明解释。 Conclusion: 该方法提升了文化相关VQA系统的可解释性与文化敏感性,有助于教育与文化保护。 Abstract: Contemporary Visual Question Answering (VQA) systems remain constrained when confronted with culturally specific content, largely because cultural knowledge is under-represented in training corpora and the reasoning process is not rendered interpretable to end users. This paper introduces VietMEAgent, a multimodal explainable framework engineered for Vietnamese cultural understanding. The method integrates a cultural object detection backbone with a structured program generation layer, yielding a pipeline in which answer prediction and explanation are tightly coupled. A curated knowledge base of Vietnamese cultural entities serves as an explicit source of background information, while a dual-modality explanation module combines attention-based visual evidence with structured, human-readable textual rationales. We further construct a Vietnamese Cultural VQA dataset sourced from public repositories and use it to demonstrate the practicality of programming-based methodologies for cultural AI. The resulting system provides transparent explanations that disclose both the computational rationale and the underlying cultural context, supporting education and cultural preservation with an emphasis on interpretability and cultural sensitivity.[99] Diversifying Counterattacks: Orthogonal Exploration for Robust CLIP Inference
Chengze Jiang,Minjing Dong,Xinli Shi,Jie Gui
Main category: cs.CV
TL;DR: 提出了一种名为Directional Orthogonal Counterattack (DOC) 的新方法,通过引入正交梯度方向和动量更新来增强测试时反攻击的多样性和覆盖范围,从而提升视觉-语言预训练模型在面对对抗性攻击时的鲁棒性。
Details
Motivation: 现有测试时反攻击方法(如TTC)因优化目标限制,生成的反攻击多样性不足,易过拟合特定对抗模式,难以应对多种对抗扰动。 Method: 提出DOC方法,结合正交梯度方向和动量更新扩展反攻击搜索空间,并设计基于平均余弦相似度的方向敏感性评分来自适应调节反攻击强度。 Result: 在16个数据集上的实验表明,DOC在多种攻击下均提升了对抗鲁棒性,同时保持了较高的干净样本准确率。 Conclusion: 增强反攻击的多样性与覆盖范围对提升测试时防御的有效性至关重要,DOC为视觉-语言模型提供了更通用且鲁棒的防御机制。 Abstract: Vision-language pre-training models (VLPs) demonstrate strong multimodal understanding and zero-shot generalization, yet remain vulnerable to adversarial examples, raising concerns about their reliability. Recent work, Test-Time Counterattack (TTC), improves robustness by generating perturbations that maximize the embedding deviation of adversarial inputs using PGD, pushing them away from their adversarial representations. However, due to the fundamental difference in optimization objectives between adversarial attacks and counterattacks, generating counterattacks solely based on gradients with respect to the adversarial input confines the search to a narrow space. As a result, the counterattacks could overfit limited adversarial patterns and lack the diversity to fully neutralize a broad range of perturbations. In this work, we argue that enhancing the diversity and coverage of counterattacks is crucial to improving adversarial robustness in test-time defense. Accordingly, we propose Directional Orthogonal Counterattack (DOC), which augments counterattack optimization by incorporating orthogonal gradient directions and momentum-based updates. This design expands the exploration of the counterattack space and increases the diversity of perturbations, which facilitates the discovery of more generalizable counterattacks and ultimately improves the ability to neutralize adversarial perturbations. Meanwhile, we present a directional sensitivity score based on averaged cosine similarity to boost DOC by improving example discrimination and adaptively modulating the counterattack strength. Extensive experiments on 16 datasets demonstrate that DOC improves adversarial robustness under various attacks while maintaining competitive clean accuracy. Code is available at https://github.com/bookman233/DOC.[100] Composition-Incremental Learning for Compositional Generalization
Zhen Li,Yuwei Wu,Chenchen Jing,Che Sun,Chuanhao Li,Yunde Jia
Main category: cs.CV
TL;DR: 本文提出了Composition-Incremental Learning for Compositional Generalization (CompIL) 框架,旨在通过持续学习新组合来逐步提升模型在组合零样本学习(CZSL)任务中的泛化能力,并构建了两个新的基准数据集MIT-States-CompIL和C-GQA-CompIL进行评估。
Details
Motivation: 现实世界中的数据不断涌现,可能的组合近乎无限且呈现长尾分布,传统基于静态训练数据的方法难以应对,因此需要一种能够渐进式提升组合泛化能力的增量学习机制。 Method: 提出了一种伪回放框架,利用视觉合成器生成已学组合的视觉表征,并通过语言基元蒸馏机制保持学习过程中基元表征的一致性,从而实现组合知识的持续学习。 Result: 在新构建的MIT-States-CompIL和C-GQA-CompIL基准上进行了大量实验,验证了所提方法在组合增量学习下的有效性和优越性。 Conclusion: 该研究为实现模型在开放动态环境中持续提升组合泛化能力提供了可行路径,推动了组合泛化从静态封闭设置向增量开放设置的发展。 Abstract: Compositional generalization has achieved substantial progress in computer vision on pre-collected training data. Nonetheless, real-world data continually emerges, with possible compositions being nearly infinite, long-tailed, and not entirely visible. Thus, an ideal model is supposed to gradually improve the capability of compositional generalization in an incremental manner. In this paper, we explore Composition-Incremental Learning for Compositional Generalization (CompIL) in the context of the compositional zero-shot learning (CZSL) task, where models need to continually learn new compositions, intending to improve their compositional generalization capability progressively. To quantitatively evaluate CompIL, we develop a benchmark construction pipeline leveraging existing datasets, yielding MIT-States-CompIL and C-GQA-CompIL. Furthermore, we propose a pseudo-replay framework utilizing a visual synthesizer to synthesize visual representations of learned compositions and a linguistic primitive distillation mechanism to maintain aligned primitive representations across the learning process. Extensive experiments demonstrate the effectiveness of the proposed framework.[101] Ultra-Light Test-Time Adaptation for Vision--Language Models
Byunghyun Kim
Main category: cs.CV
TL;DR: 提出了一种名为UL-TTA的超轻量级测试时适应框架,仅通过调整logit层参数(类原型、类先验和温度)实现对视觉语言模型在域偏移下的高效自适应,无需更新主干网络,具有低延迟、高准确性和良好校准性的优点。
Details
Motivation: 现有测试时适应方法通常需要反向传播、协方差估计或大量内存,在流式和边缘场景中不实用;同时,视觉语言模型在域偏移下存在特征漂移、类别先验不匹配和严重校准不良问题。 Method: 冻结主干网络,仅调整logit-level参数:采用选择性样本过滤、闭式贝叶斯更新(锚定文本和Dirichlet先验)、解耦温度(预测与校准分离)以及轻量级保护机制(范数裁剪、先验KL约束、平滑温度)的在线EM风格流程。 Result: 在多个大规模跨域和OOD基准上(如PACS、Office-Home等,约72.6万测试样本),相比零样本CLIP平均提升4.7个点top-1精度,ECE降低20-30%,延迟增加不到8%;长序列实验(达20万样本)无崩溃现象。 Conclusion: logit-level的贝叶斯自适应足以在无需更新主干参数的情况下,为视觉语言模型提供最先进的准确性-校准权衡,适用于资源受限的流式场景。 Abstract: Vision-Language Models (VLMs) such as CLIP achieve strong zero-shot recognition by comparing image embeddings to text-derived class prototypes. However, under domain shift, they suffer from feature drift, class-prior mismatch, and severe miscalibration. Existing test-time adaptation (TTA) methods often require backpropagation through large backbones, covariance estimation, or heavy memory/state, which is problematic for streaming and edge scenarios. We propose Ultra-Light Test-Time Adaptation (UL-TTA), a fully training-free and backprop-free framework that freezes the backbone and adapts only logit-level parameters: class prototypes, class priors, and temperature. UL-TTA performs an online EM-style procedure with (i) selective sample filtering to use only confident predictions, (ii) closed-form Bayesian updates for prototypes and priors anchored by text and Dirichlet priors, (iii) decoupled temperatures for prediction vs. calibration, and (iv) lightweight guards (norm clipping, prior KL constraints, smoothed temperature) to prevent drift in long streams. Across large-scale cross-domain and OOD benchmarks (PACS, Office-Home, DomainNet, Terra Incognita, ImageNet-R/A/V2/Sketch; ~726K test samples) and strong TTA baselines including Tent, T3A, CoTTA, SAR, Tip-Adapter, and FreeTTA, UL-TTA consistently improves top-1 accuracy (e.g., +4.7 points over zero-shot CLIP on average) while reducing ECE by 20-30%, with less than 8% latency overhead. Long-stream experiments up to 200K samples show no collapse. Our results demonstrate that logit-level Bayesian adaptation is sufficient to obtain state-of-the-art accuracy-calibration trade-offs for VLMs under domain shift, without updating any backbone parameters.[102] DKDS: A Benchmark Dataset of Degraded Kuzushiji Documents with Seals for Detection and Binarization
Rui-Yang Ju,Kohei Yamashita,Hirotaka Kameko,Shinsuke Mori
Main category: cs.CV
TL;DR: 本文介绍了Degraded Kuzushiji Documents with Seals (DKDS) 数据集,用于解决现有OCR方法在处理带有噪声(如文档退化和印章)的古日文文献时识别准确率低的问题。
Details
Motivation: 现有的OCR方法在处理干净的古日文文献时表现良好,但在面对文档退化和印章等噪声时效果不佳,且缺乏专门针对此类问题的数据集。 Method: 构建了包含文本和印章检测、文档二值化两个任务的DKDS数据集,并采用YOLO模型、传统二值化算法结合K-means聚类以及基于GAN的方法进行基线实验。 Result: 提供了两个任务的基线结果,表明所提出的数据集能有效评估不同方法在复杂退化场景下的性能。 Conclusion: DKDS数据集填补了退化古日文文献识别领域的空白,为后续研究提供了重要基准。 Abstract: Kuzushiji, a pre-modern Japanese cursive script, can currently be read and understood by only a few thousand trained experts in Japan. With the rapid development of deep learning, researchers have begun applying Optical Character Recognition (OCR) techniques to transcribe Kuzushiji into modern Japanese. Although existing OCR methods perform well on clean pre-modern Japanese documents written in Kuzushiji, they often fail to consider various types of noise, such as document degradation and seals, which significantly affect recognition accuracy. To the best of our knowledge, no existing dataset specifically addresses these challenges. To address this gap, we introduce the Degraded Kuzushiji Documents with Seals (DKDS) dataset as a new benchmark for related tasks. We describe the dataset construction process, which required the assistance of a trained Kuzushiji expert, and define two benchmark tracks: (1) text and seal detection and (2) document binarization. For the text and seal detection track, we provide baseline results using multiple versions of the You Only Look Once (YOLO) models for detecting Kuzushiji characters and seals. For the document binarization track, we present baseline results from traditional binarization algorithms, traditional algorithms combined with K-means clustering, and Generative Adversarial Network (GAN)-based methods. The DKDS dataset and the implementation code for baseline methods are available at https://ruiyangju.github.io/DKDS.[103] PIFF: A Physics-Informed Generative Flow Model for Real-Time Flood Depth Mapping
ChunLiang Wu,Tsunhua Yang,Hungying Chen
Main category: cs.CV
TL;DR: 提出了一种名为PIFF的物理信息驱动的生成式神经网络,用于近实时洪水深度估计,结合数字高程模型与简化的水动力模型,在台湾台南地区验证了其在多种降雨情景下的有效性。
Details
Motivation: 传统洪水制图方法(如数值模拟和航拍摄影)在效率和可靠性方面存在局限,难以满足实时洪水预测的需求。 Method: 构建基于图像到图像生成框架的PIFF模型,结合数字地形模型(DEM)与简化淹水模型(SPM)作为物理先验,并引入基于Transformer的降雨编码器以捕捉降水的时间依赖性,通过物理信息约束与数据驱动学习相结合实现洪水深度预测。 Result: 在台湾台南26平方公里区域、182种不同降雨情景下的实验表明,PIFF能高效生成准确的洪水深度图,性能优于传统模拟方法,具备近实时预测能力。 Conclusion: PIFF为洪水映射提供了一种高效、可靠的数据驱动解决方案,能够在保留物理一致性的同时替代昂贵的数值模拟,适用于实际洪水应急响应。 Abstract: Flood mapping is crucial for assessing and mitigating flood impacts, yet traditional methods like numerical modeling and aerial photography face limitations in efficiency and reliability. To address these challenges, we propose PIFF, a physics-informed, flow-based generative neural network for near real-time flood depth estimation. Built on an image-to-image generative framework, it efficiently maps Digital Elevation Models (DEM) to flood depth predictions. The model is conditioned on a simplified inundation model (SPM) that embeds hydrodynamic priors into the training process. Additionally, a transformer-based rainfall encoder captures temporal dependencies in precipitation. Integrating physics-informed constraints with data-driven learning, PIFF captures the causal relationships between rainfall, topography, SPM, and flooding, replacing costly simulations with accurate, real-time flood maps. Using a 26 km study area in Tainan, Taiwan, with 182 rainfall scenarios ranging from 24 mm to 720 mm over 24 hours, our results demonstrate that PIFF offers an effective, data-driven alternative for flood prediction and response.[104] MACEval: A Multi-Agent Continual Evaluation Network for Large Models
Zijian Chen,Yuze Sun,Yuan Tian,Wenjun Zhang,Guangtao Zhai
Main category: cs.CV
TL;DR: 本文提出了MACEval,一个用于大模型动态评估的多智能体持续评估网络,并定义了一套新的纵向、可持续性能度量指标。
Details
Motivation: 现有大模型评估基准多为封闭式,易因训练数据污染导致过拟合,且维护成本高、更新滞后,难以适应大模型快速发展。 Method: MACEval采用多智能体级联网络,通过角色分配、过程中数据生成和评估路由,实现交互式、自主化的评估模式。 Result: 在9个开放性任务和23个大模型上的实验表明,MACEval具有无需人工、高效经济、灵活可扩展等优势,能有效支持大模型的持续评估。 Conclusion: MACEval为大模型评估提供了自动化、可持续的新范式,有望推动未来评估体系的发展方向。 Abstract: Hundreds of benchmarks dedicated to evaluating large models from multiple perspectives have been presented over the past few years. Albeit substantial efforts, most of them remain closed-ended and are prone to overfitting due to the potential data contamination in the ever-growing training corpus of large models, thereby undermining the credibility of the evaluation. Moreover, the increasing scale and scope of current benchmarks with transient metrics, as well as the heavily human-dependent curation procedure, pose significant challenges for timely maintenance and adaptation to gauge the advancing capabilities of large models. In this paper, we introduce MACEval, a \Multi-Agent Continual Evaluation network for dynamic evaluation of large models, and define a new set of metrics to quantify performance longitudinally and sustainably. MACEval adopts an interactive and autonomous evaluation mode that employs role assignment, in-process data generation, and evaluation routing through a cascaded agent network. Extensive experiments on 9 open-ended tasks with 23 participating large models demonstrate that MACEval is (1) human-free and automatic, mitigating laborious result processing with inter-agent judgment guided; (2) efficient and economical, reducing a considerable amount of data and overhead to obtain similar results compared to related benchmarks; and (3) flexible and scalable, migrating or integrating existing benchmarks via customized evaluation topologies. We hope that MACEval can broaden future directions of large model evaluation.[105] PressTrack-HMR: Pressure-Based Top-Down Multi-Person Global Human Mesh Recovery
Jiayue Yuan,Fangting Xie,Guangwen Ouyang,Changhai Ma,Ziyu Wu,Heyu Ding,Quan Wan,Yi Ke,Yuchen Wu,Xiaohui Cai
Main category: cs.CV
TL;DR: 本文提出了一种基于压力信号的多人全局人体网格恢复方法PressTrack-HMR,采用检测后跟踪策略从多人交互的压力数据中分离个体压力信号并实现人体姿态估计,同时构建了多个人交互压力数据集MIP,实验结果表明该方法在隐私保护和无遮挡条件下具有良好的应用潜力。
Details
Motivation: 传统基于视觉的多人HMR方法在现实场景中受限于相互遮挡、光照不足和隐私问题,而基于触觉地垫的压力信号提供了一种无遮挡且隐私友好的替代方案。然而,当多人同时行走时,如何区分混合的压力信号并提取每个人的时序数据仍是挑战。 Method: 提出PressTrack-HMR,采用自上而下的检测-跟踪框架,通过tracking-by-detection策略从原始压力数据中识别和分割每个人的信号,并对每个个体进行独立的人体网格恢复;同时构建了多个人压力交互数据集MIP以支持相关研究。 Result: 实验结果显示,该方法在多人大规模HMR任务中表现优异,MPJPE为89.2 mm,WA-MPJPE$_{100}$为112.6 mm,验证了基于压力信号进行多人动作识别的可行性与潜力。 Conclusion: PressTrack-HMR成功实现了仅基于压力信号的多个人体网格恢复,结合MIP数据集推动了基于触觉的多人运动分析发展,展示了触觉地垫在普适性、隐私保护的动作识别中的前景。 Abstract: Multi-person global human mesh recovery (HMR) is crucial for understanding crowd dynamics and interactions. Traditional vision-based HMR methods sometimes face limitations in real-world scenarios due to mutual occlusions, insufficient lighting, and privacy concerns. Human-floor tactile interactions offer an occlusion-free and privacy-friendly alternative for capturing human motion. Existing research indicates that pressure signals acquired from tactile mats can effectively estimate human pose in single-person scenarios. However, when multiple individuals walk randomly on the mat simultaneously, how to distinguish intermingled pressure signals generated by different persons and subsequently acquire individual temporal pressure data remains a pending challenge for extending pressure-based HMR to the multi-person situation. In this paper, we present \textbf{PressTrack-HMR}, a top-down pipeline that recovers multi-person global human meshes solely from pressure signals. This pipeline leverages a tracking-by-detection strategy to first identify and segment each individual's pressure signal from the raw pressure data, and subsequently performs HMR for each extracted individual signal. Furthermore, we build a multi-person interaction pressure dataset \textbf{MIP}, which facilitates further research into pressure-based human motion analysis in multi-person scenarios. Experimental results demonstrate that our method excels in multi-person HMR using pressure data, with 89.2~$mm$ MPJPE and 112.6~$mm$ WA-MPJPE$_{100}$, and these showcase the potential of tactile mats for ubiquitous, privacy-preserving multi-person action recognition. Our dataset \& code are available at https://github.com/Jiayue-Yuan/PressTrack-HMR.[106] HOTFLoc++: End-to-End Hierarchical LiDAR Place Recognition, Re-Ranking, and 6-DoF Metric Localisation in Forests
Ethan Griffiths,Maryam Haghighat,Simon Denman,Clinton Fookes,Milad Ramezani
Main category: cs.CV
TL;DR: HOTFLoc++ 是一个用于森林环境中 LiDAR 地点识别、重排序和 6-DoF 度量定位的端到端框架,采用基于八叉树的 Transformer 和多尺度几何验证模块,显著提升了鲁棒性和定位精度。
Details
Motivation: 在森林等复杂环境中,由于遮挡、自相似性和视角变化,LiDAR 地点识别和精确定位面临挑战,现有方法在重排序和注册方面容易失败。 Method: 提出 HOTFLoc++,利用八叉树 Transformer 提取多粒度层次局部描述符,并设计可学习的多尺度几何验证模块进行重排序,结合粗到精的配准策略实现高效 6-DoF 定位。 Result: 在多个公开数据集上优于现有方法,Recall@1 达到 90.7%(CS-Wild-Places)和 96.0%(MulRan),97.2% 的 6-DoF 配准误差小于 2米/5度,重排序使定位误差平均降低约2倍,运行速度比 RANSAC 快两个数量级。 Conclusion: HOTFLoc++ 在复杂自然环境中实现了高精度、高效率的 LiDAR 定位,具有良好的鲁棒性和实际应用潜力。 Abstract: This article presents HOTFLoc++, an end-to-end framework for LiDAR place recognition, re-ranking, and 6-DoF metric localisation in forests. Leveraging an octree-based transformer, our approach extracts hierarchical local descriptors at multiple granularities to increase robustness to clutter, self-similarity, and viewpoint changes in challenging scenarios, including ground-to-ground and ground-to-aerial in forest and urban environments. We propose a learnable multi-scale geometric verification module to reduce re-ranking failures in the presence of degraded single-scale correspondences. Our coarse-to-fine registration approach achieves comparable or lower localisation errors to baselines, with runtime improvements of two orders of magnitude over RANSAC for dense point clouds. Experimental results on public datasets show the superiority of our approach compared to state-of-the-art methods, achieving an average Recall@1 of 90.7% on CS-Wild-Places: an improvement of 29.6 percentage points over baselines, while maintaining high performance on single-source benchmarks with an average Recall@1 of 91.7% and 96.0% on Wild-Places and MulRan, respectively. Our method achieves under 2 m and 5 degrees error for 97.2% of 6-DoF registration attempts, with our multi-scale re-ranking module reducing localisation errors by ~2$\times$ on average. The code will be available upon acceptance.[107] DBINDS -- Can Initial Noise from Diffusion Model Inversion Help Reveal AI-Generated Videos?
Yanlin Wu,Xiaogang Yuan,Dezhi An
Main category: cs.CV
TL;DR: 提出DBINDS,一种基于扩散模型反演的视频检测方法,通过分析潜在空间动态而非像素级特征,实现对生成视频的高效检测。
Details
Motivation: 现有检测方法依赖像素级视觉线索,难以泛化到未见的生成器,需提升检测模型的泛化性和鲁棒性。 Method: 利用扩散模型反演恢复初始噪声序列,构建初始噪声差异序列(INDS),提取多域多尺度特征,结合优化特征与贝叶斯调优的LightGBM分类器进行检测。 Result: DBINDS在单个生成器上训练后,在GenVidBench上展现出强跨生成器性能,且在数据有限场景下具有良好的泛化与鲁棒性。 Conclusion: 基于潜在空间动态分析的DBINDS为AI生成视频检测提供了新思路,显著优于依赖像素级特征的现有方法。 Abstract: AI-generated video has advanced rapidly and poses serious challenges to content security and forensic analysis. Existing detectors rely mainly on pixel-level visual cues and generalize poorly to unseen generators. We propose DBINDS, a diffusion-model-inversion based detector that analyzes latent-space dynamics rather than pixels. We find that initial noise sequences recovered by diffusion inversion differ systematically between real and generated videos. Building on this, DBINDS forms an Initial Noise Difference Sequence (INDS) and extracts multi-domain, multi-scale features. With feature optimization and a LightGBM classifier tuned by Bayesian search, DBINDS (trained on a single generator) achieves strong cross-generator performance on GenVidBench, demonstrating good generalization and robustness in limited-data settings.[108] Towards Trustworthy Dermatology MLLMs: A Benchmark and Multimodal Evaluator for Diagnostic Narratives
Yuhao Shen,Jiahe Qian,Shuping Zhang,Zhangtianyi Chen,Tao Lu,Juexiao Zhou
Main category: cs.CV
TL;DR: 提出了一种结合DermBench和DermEval的新型评估框架,用于对多模态大语言模型在皮肤科诊断中的表现进行可靠、可重复且可扩展的评估。
Details
Motivation: 现有的多模态大语言模型在皮肤病诊断中缺乏可靠且具有临床意义的评估方法,限制了其在临床实践中的负责任部署。 Method: 构建包含4000个真实皮肤病图像与专家认证诊断叙述的DermBench基准,并开发基于大语言模型的裁判系统;同时训练一个无需参考答案的多模态评估器DermEval,以生成结构化批评和评分。 Result: 在4500个病例上的实验表明,DermBench和DermEval与专家评分高度一致,平均偏差分别为0.251和0.117(满分5分),能有效评估不同多模态大模型的诊断能力和可信度。 Conclusion: 该框架实现了对皮肤病诊断生成模型的细粒度、可解释且临床上有意义的自动化评估,有助于发现模型局限性和偏见,推动其在临床环境中的安全应用。 Abstract: Multimodal large language models (LLMs) are increasingly used to generate dermatology diagnostic narratives directly from images. However, reliable evaluation remains the primary bottleneck for responsible clinical deployment. We introduce a novel evaluation framework that combines DermBench, a meticulously curated benchmark, with DermEval, a robust automatic evaluator, to enable clinically meaningful, reproducible, and scalable assessment. We build DermBench, which pairs 4,000 real-world dermatology images with expert-certified diagnostic narratives and uses an LLM-based judge to score candidate narratives across clinically grounded dimensions, enabling consistent and comprehensive evaluation of multimodal models. For individual case assessment, we train DermEval, a reference-free multimodal evaluator. Given an image and a generated narrative, DermEval produces a structured critique along with an overall score and per-dimension ratings. This capability enables fine-grained, per-case analysis, which is critical for identifying model limitations and biases. Experiments on a diverse dataset of 4,500 cases demonstrate that DermBench and DermEval achieve close alignment with expert ratings, with mean deviations of 0.251 and 0.117 (out of 5), respectively, providing reliable measurement of diagnostic ability and trustworthiness across different multimodal LLMs.[109] Taming Object Hallucinations with Verified Atomic Confidence Estimation
Jiarui Liu,Weihao Xuan,Zhijing Jin,Mona Diab
Main category: cs.CV
TL;DR: 提出TACO框架,通过自验证和置信度校准减轻多模态大语言模型的幻觉问题。
Details
Motivation: 多模态大语言模型(MLLMs)常出现对象存在性、属性或关系的幻觉错误,影响可靠性。 Method: 将响应分解为原子查询,通过改写降低对措辞的敏感性,并利用自洽性(黑盒)或自信心(灰盒)聚合估计置信度,最后用语言模型优化答案。 Result: 在五个基准测试中使用两个MLLM展示了TACO优于直接提示和视觉对比解码的效果,减少了系统性偏差并改善了置信度校准。 Conclusion: TACO能有效提升MLLMs输出的忠实性和可靠性。 Abstract: Multimodal Large Language Models (MLLMs) often suffer from hallucinations, particularly errors in object existence, attributes, or relations, which undermine their reliability. We introduce TACO (Verified Atomic Confidence Estimation), a simple framework that mitigates hallucinations through self-verification and confidence calibration without relying on external vision experts. TACO decomposes responses into atomic queries, paraphrases them to reduce sensitivity to wording, and estimates confidence using self-consistency (black-box) or self-confidence (gray-box) aggregation, before refining answers with a language model. Experiments on five benchmarks (POPE, MME, HallusionBench, AMBER, and MM-Hal Bench) with two MLLMs (\texttt{LLaVA-1.5-7B} and \texttt{CogVLM2}) show that TACO consistently outperforms direct prompting and Visual Contrastive Decoding, reduces systematic biases, and improves confidence calibration, demonstrating its effectiveness in enhancing the faithfulness of MLLMs.[110] Spatial Information Bottleneck for Interpretable Visual Recognition
Kaixiang Shu,Kai Meng,Junqin Luo
Main category: cs.CV
TL;DR: 提出了一种基于信息瓶颈的空间解耦方法S-IB,通过优化反向传播中的Vector-Jacobian乘积来增强模型的可解释性和鲁棒性。
Details
Motivation: 深度神经网络通常学习到空间上纠缠的表示,将前景特征与背景噪声混淆,影响模型的可解释性和鲁棒性。 Method: 从信息论角度重新理解基于梯度的归因方法,提出编码-解码视角:前向传播编码输入到类别空间,反向传播中的VJP将其解码回特征空间;在此基础上设计空间信息瓶颈(S-IB),通过最大化前景区域互信息、最小化背景区域互信息来实现空间解耦。 Result: 在五个基准数据集上验证了S-IB的有效性,显著提升了六种解释方法的可视化质量,增强了前景聚焦与背景抑制,并带来了分类准确率的提升。 Conclusion: S-IB通过直接优化训练过程中的VJP空间结构,实现了跨解释方法的通用性能提升,增强了模型的可解释性与判别能力。 Abstract: Deep neural networks typically learn spatially entangled representations that conflate discriminative foreground features with spurious background correlations, thereby undermining model interpretability and robustness. We propose a novel understanding framework for gradient-based attribution from an information-theoretic perspective. We prove that, under mild conditions, the Vector-Jacobian Products (VJP) computed during backpropagation form minimal sufficient statistics of input features with respect to class labels. Motivated by this finding, we propose an encoding-decoding perspective : forward propagation encodes inputs into class space, while VJP in backpropagation decodes this encoding back to feature space. Therefore, we propose Spatial Information Bottleneck (S-IB) to spatially disentangle information flow. By maximizing mutual information between foreground VJP and inputs while minimizing mutual information in background regions, S-IB encourages networks to encode information only in class-relevant spatial regions. Since post-hoc explanation methods fundamentally derive from VJP computations, directly optimizing VJP's spatial structure during training improves visualization quality across diverse explanation paradigms. Experiments on five benchmarks demonstrate universal improvements across six explanation methods, achieving better foreground concentration and background suppression without method-specific tuning, alongside consistent classification accuracy gains.[111] GRACE: Designing Generative Face Video Codec via Agile Hardware-Centric Workflow
Rui Wan,Qi Zheng,Ruoyu Zhang,Bu Chen,Jiaming Liu,Min Li,Minge Jing,Jinjia Zhou,Yibo Fan
Main category: cs.CV
TL;DR: 本文提出了一种面向FPGA的动画生成编解码器(AGC)部署方案,用于边缘计算视频服务,通过软硬件协同设计实现高效能、低功耗的面部视频压缩解码。
Details
Motivation: 由于AGC解码器参数多、计算量大、功耗高,难以在资源受限的边缘设备上部署,因此需要一种高效的硬件加速方案。 Method: 采用网络压缩技术(如训练后静态量化和层融合),结合软硬件协同设计,构建基于FPGA的重叠式加速器,利用双缓冲流水线和循环展开等并行优化策略。 Result: 在PYNQ-Z1平台上实现了AGC FPGA原型系统,相比CPU和GPU分别实现了24.9倍和4.1倍的能效提升,每像素重建仅需11.7微焦耳。 Conclusion: 所提出的FPGA导向的AGC部署方案显著提升了能效,适合应用于资源受限的边缘设备上的视频服务。 Abstract: The Animation-based Generative Codec (AGC) is an emerging paradigm for talking-face video compression. However, deploying its intricate decoder on resource and power-constrained edge devices presents challenges due to numerous parameters, the inflexibility to adapt to dynamically evolving algorithms, and the high power consumption induced by extensive computations and data transmission. This paper for the first time proposes a novel field programmable gate arrays (FPGAs)-oriented AGC deployment scheme for edge-computing video services. Initially, we analyze the AGC algorithm and employ network compression methods including post-training static quantization and layer fusion techniques. Subsequently, we design an overlapped accelerator utilizing the co-processor paradigm to perform computations through software-hardware co-design. The hardware processing unit comprises engines such as convolution, grid sampling, upsample, etc. Parallelization optimization strategies like double-buffered pipelines and loop unrolling are employed to fully exploit the resources of FPGA. Ultimately, we establish an AGC FPGA prototype on the PYNQ-Z1 platform using the proposed scheme, achieving \textbf{24.9$\times$} and \textbf{4.1$\times$} higher energy efficiency against commercial Central Processing Unit (CPU) and Graphic Processing Unit (GPU), respectively. Specifically, only \textbf{11.7} microjoules ($\upmu$J) are required for one pixel reconstructed by this FPGA system.[112] Deep Learning for Metabolic Rate Estimation from Biosignals: A Comparative Study of Architectures and Signal Selection
Sarvenaz Babakhani,David Remy,Alina Roitberg
Main category: cs.CV
TL;DR: 本文系统评估了不同神经网络架构和生理信号组合在能量消耗估计中的表现,发现分钟通气量是最具预测性的单一信号,Transformer模型表现最佳(RMSE为0.87 W/kg),而多信号组合适合轻量模型如CNN和ResNet。低强度活动预测效果更好,个体间差异显著,需自适应建模策略。
Details
Motivation: 现有深度学习方法较少区分神经网络架构与信号选择的影响,本文旨在系统性评估这两者对能量消耗估计性能的作用。 Method: 比较经典回归方法与多种深度学习模型(如Transformer、CNN、ResNet with attention)在单一信号、信号对及多信号组合上的表现,使用心率、呼吸、加速度计等生理信号,针对不同体力活动进行分析。 Result: 分钟通气量是最佳单一信号(Transformer模型RMSE为0.87 W/kg);多信号组合适用于更快的模型(如CNN、ResNet);低强度活动误差更低(RMSE低至0.29 W/kg,NRMSE=0.04);高强度活动绝对误差较大但归一化误差相近;个体间差异显著。 Conclusion: 信号选择对能量消耗估计的影响大于模型架构,分钟通气量最具预测力;多信号可提升轻量模型效率;需发展个性化和自适应建模方法以应对个体差异。 Abstract: Energy expenditure estimation aims to infer human metabolic rate from physiological signals such as heart rate, respiration, or accelerometer data, and has been studied primarily with classical regression methods. The few existing deep learning approaches rarely disentangle the role of neural architecture from that of signal choice. In this work, we systematically evaluate both aspects. We compare classical baselines with newer neural architectures across single signals, signal pairs, and grouped sensor inputs for diverse physical activities. Our results show that minute ventilation is the most predictive individual signal, with a transformer model achieving the lowest root mean square error (RMSE) of 0.87 W/kg across all activities. Paired and grouped signals, such as those from the Hexoskin smart shirt (five signals), offer good alternatives for faster models like CNN and ResNet with attention. Per-activity evaluation revealed mixed outcomes: notably better results in low-intensity activities (RMSE down to 0.29 W/kg; NRMSE = 0.04), while higher-intensity tasks showed larger RMSE but more comparable normalized errors. Finally, subject-level analysis highlights strong inter-individual variability, motivating the need for adaptive modeling strategies. Our code and models will be publicly available at https://github.com/Sarvibabakhani/deeplearning-biosignals-ee .[113] Enriching Knowledge Distillation with Cross-Modal Teacher Fusion
Amir M. Mansourian,Amir Mohammad Babaei,Shohreh Kasaei
Main category: cs.CV
TL;DR: 本文提出了一种新的多教师知识蒸馏方法RichKD,利用CLIP的视觉-语言模型提供跨模态监督,融合传统教师与CLIP的logits和特征,提升了学生模型的准确性、预测可靠性及分布外鲁棒性。
Details
Motivation: 现有知识蒸馏方法多依赖单一视觉模态,缺乏知识多样性,本文旨在引入CLIP的跨模态语义信息以增强蒸馏效果。 Method: 提出RichKD框架,融合传统教师模型与CLIP模型的logits和特征,并利用CLIP的多提示文本引导生成更丰富的监督信号。 Result: RichKD在多个基准上优于现有方法,显著提升学生模型的准确率、置信度校准和非目标类别的语义一致性,并在分布偏移和输入干扰下表现出更强鲁棒性。 Conclusion: 通过引入CLIP的跨模态知识,RichKD有效丰富了知识蒸馏的信息来源,验证了多模态监督在知识蒸馏中的潜力。 Abstract: Multi-teacher knowledge distillation (KD), a more effective technique than traditional single-teacher methods, transfers knowledge from expert teachers to a compact student model using logit or feature matching. However, most existing approaches lack knowledge diversity, as they rely solely on unimodal visual information, overlooking the potential of cross-modal representations. In this work, we explore the use of CLIP's vision-language knowledge as a complementary source of supervision for KD, an area that remains largely underexplored. We propose a simple yet effective framework that fuses the logits and features of a conventional teacher with those from CLIP. By incorporating CLIP's multi-prompt textual guidance, the fused supervision captures both dataset-specific and semantically enriched visual cues. Beyond accuracy, analysis shows that the fused teacher yields more confident and reliable predictions, significantly increasing confident-correct cases while reducing confidently wrong ones. Moreover, fusion with CLIP refines the entire logit distribution, producing semantically meaningful probabilities for non-target classes, thereby improving inter-class consistency and distillation quality. Despite its simplicity, the proposed method, Enriching Knowledge Distillation (RichKD), consistently outperforms most existing baselines across multiple benchmarks and exhibits stronger robustness under distribution shifts and input corruptions.[114] DensiCrafter: Physically-Constrained Generation and Fabrication of Self-Supporting Hollow Structures
Shengqi Dang,Fu Chai,Jiaxin Li,Chao Yuan,Wei Ye,Nan Cao
Main category: cs.CV
TL;DR: 本文提出DensiCrafter框架,通过优化密度场生成轻量化且自支撑的3D中空结构,兼容预训练模型,在文本到3D任务中实现最高43%的材料减重,并提升稳定性和几何保真度。
Details
Motivation: 现有3D生成模型忽略物理约束和可制造性,难以生成轻量化且自支撑的设计。 Method: 基于Trellis生成的粗体素网格,将其解释为连续密度场,引入三种可微分、物理约束且无需仿真的损失项,结合质量正则化和优化域限制以保持外表面。 Result: 在文本到3D任务中实现最高43%的材料质量减少,相比现有方法提升了结构稳定性并保持高几何保真度,真实3D打印验证了可制造性和自支撑能力。 Conclusion: DensiCrafter能有效生成轻量、自支撑且可制造的3D中空结构,无需修改现有模型架构,具有实际应用潜力。 Abstract: The rise of 3D generative models has enabled automatic 3D geometry and texture synthesis from multimodal inputs (e.g., text or images). However, these methods often ignore physical constraints and manufacturability considerations. In this work, we address the challenge of producing 3D designs that are both lightweight and self-supporting. We present DensiCrafter, a framework for generating lightweight, self-supporting 3D hollow structures by optimizing the density field. Starting from coarse voxel grids produced by Trellis, we interpret these as continuous density fields to optimize and introduce three differentiable, physically constrained, and simulation-free loss terms. Additionally, a mass regularization penalizes unnecessary material, while a restricted optimization domain preserves the outer surface. Our method seamlessly integrates with pretrained Trellis-based models (e.g., Trellis, DSO) without any architectural changes. In extensive evaluations, we achieve up to 43% reduction in material mass on the text-to-3D task. Compared to state-of-the-art baselines, our method could improve the stability and maintain high geometric fidelity. Real-world 3D-printing experiments confirm that our hollow designs can be reliably fabricated and could be self-supporting.[115] DualFete: Revisiting Teacher-Student Interactions from a Feedback Perspective for Semi-supervised Medical Image Segmentation
Le Yi,Wei Huang,Lei Zhang,Kefu Zhao,Yan Wang,Zizhou Wang
Main category: cs.CV
TL;DR: 提出一种基于反馈机制的双教师模型,用于解决半监督医学图像分割中教师-学生框架下的错误传播问题。
Details
Motivation: 传统教师-学生框架在医学图像分割中易因图像模糊性导致错误监督,且学生模型会迭代确认这些错误,产生自强化偏差。现有方法多依赖外部修改,忽视框架本身的纠错潜力。 Method: 在教师-学生框架中引入反馈机制,学生根据教师伪标签引起的变化提供反馈,教师据此修正伪标签。关键组件包括反馈归因者(识别触发学生更新的伪标签)和反馈接收者(决定反馈应用位置),并进一步提出双教师模型,通过交叉监督解决分歧,避免一致错误。 Result: 在三个医学图像基准上的实验表明,该方法能有效抑制错误传播,提升分割性能。 Conclusion: 所提出的反馈机制充分挖掘了教师-学生框架内在的纠错能力,双教师结构增强了反馈动态性,在半监督医学图像分割中显著优于传统方法。 Abstract: The teacher-student paradigm has emerged as a canonical framework in semi-supervised learning. When applied to medical image segmentation, the paradigm faces challenges due to inherent image ambiguities, making it particularly vulnerable to erroneous supervision. Crucially, the student's iterative reconfirmation of these errors leads to self-reinforcing bias. While some studies attempt to mitigate this bias, they often rely on external modifications to the conventional teacher-student framework, overlooking its intrinsic potential for error correction. In response, this work introduces a feedback mechanism into the teacher-student framework to counteract error reconfirmations. Here, the student provides feedback on the changes induced by the teacher's pseudo-labels, enabling the teacher to refine these labels accordingly. We specify that this interaction hinges on two key components: the feedback attributor, which designates pseudo-labels triggering the student's update, and the feedback receiver, which determines where to apply this feedback. Building on this, a dual-teacher feedback model is further proposed, which allows more dynamics in the feedback loop and fosters more gains by resolving disagreements through cross-teacher supervision while avoiding consistent errors. Comprehensive evaluations on three medical image benchmarks demonstrate the method's effectiveness in addressing error propagation in semi-supervised medical image segmentation.[116] FQ-PETR: Fully Quantized Position Embedding Transformation for Multi-View 3D Object Detection
Jiangyong Yu,Changyong Shu,Sifan Zhou,Zichen Yu,Xing Hu,Yan Chen,Dawei Yang
Main category: cs.CV
TL;DR: 本文提出FQ-PETR,一种面向PETR系列模型的全量化框架,通过量化友好型位置编码、双查表非线性近似和数值稳定后量化三项技术,在保持接近浮点精度的同时显著降低计算延迟。
Details
Motivation: PETR类方法在多视角3D检测中表现优异但计算开销大,现有量化方法直接应用会导致严重精度下降,主要由于图像特征与位置嵌入间的幅度差异及非线性算子量化误差。 Method: 提出FQ-PETR框架:1)设计LiDAR引导的单点采样位置嵌入(QFPE)以对齐特征尺度并消除非线性;2)采用双查表(DULUT)高效近似非线性函数;3)在softmax数值稳定后再进行量化(QANS)以减少注意力畸变。 Result: 在W8A8量化下,FQ-PETR仅造成约1%精度损失,同时最高降低75%推理延迟,显著优于现有PTQ和QAT方法。 Conclusion: FQ-PETR有效解决了PETR模型量化中的特征尺度不一致和非线性操作难题,实现了高性能、低延迟的多视角3D检测,利于实际部署。 Abstract: Camera-based multi-view 3D detection is crucial for autonomous driving. PETR and its variants (PETRs) excel in benchmarks but face deployment challenges due to high computational cost and memory footprint. Quantization is an effective technique for compressing deep neural networks by reducing the bit width of weights and activations. However, directly applying existing quantization methods to PETRs leads to severe accuracy degradation. This issue primarily arises from two key challenges: (1) significant magnitude disparity between multi-modal features-specifically, image features and camera-ray positional embeddings (PE), and (2) the inefficiency and approximation error of quantizing non-linear operators, which commonly rely on hardware-unfriendly computations. In this paper, we propose FQ-PETR, a fully quantized framework for PETRs, featuring three key innovations: (1) Quantization-Friendly LiDAR-ray Position Embedding (QFPE): Replacing multi-point sampling with LiDAR-prior-guided single-point sampling and anchor-based embedding eliminates problematic non-linearities (e.g., inverse-sigmoid) and aligns PE scale with image features, preserving accuracy. (2) Dual-Lookup Table (DULUT): This algorithm approximates complex non-linear functions using two cascaded linear LUTs, achieving high fidelity with minimal entries and no specialized hardware. (3) Quantization After Numerical Stabilization (QANS): Performing quantization after softmax numerical stabilization mitigates attention distortion from large inputs. On PETRs (e.g. PETR, StreamPETR, PETRv2, MV2d), FQ-PETR under W8A8 achieves near-floating-point accuracy (1% degradation) while reducing latency by up to 75%, significantly outperforming existing PTQ and QAT baselines.[117] Spatio-Temporal Context Learning with Temporal Difference Convolution for Moving Infrared Small Target Detection
Houzhang Fang,Shukai Guo,Qiuhuan Chen,Yi Chang,Luxin Yan
Main category: cs.CV
TL;DR: 提出了一种用于移动红外小目标检测的新型网络TDCNet,通过融合时间差和3D卷积实现高效的时空特征建模,并引入注意力机制提升检测性能。
Details
Motivation: 现有方法在提取时空特征时存在局限,如时间差方法空间特征提取能力弱,3D卷积缺乏对运动动态的显式感知,因此需要一种能同时有效捕捉运动线索和空间上下文信息的方法。 Method: 设计了时间差卷积(TDC)重参数化模块,包含三个并行TDC块,融合时间差与3D卷积;提出TDC引导的时空注意力机制,结合TDC主干与3D主干的特征进行全局语义依赖建模。 Result: 在IRSTD-UAV和公开红外数据集上实验表明,TDCNet在检测性能上达到最先进的水平。 Conclusion: TDCNet通过有效的时空特征增强策略显著提升了复杂背景下移动红外小目标的检测精度。 Abstract: Moving infrared small target detection (IRSTD) plays a critical role in practical applications, such as surveillance of unmanned aerial vehicles (UAVs) and UAV-based search system. Moving IRSTD still remains highly challenging due to weak target features and complex background interference. Accurate spatio-temporal feature modeling is crucial for moving target detection, typically achieved through either temporal differences or spatio-temporal (3D) convolutions. Temporal difference can explicitly leverage motion cues but exhibits limited capability in extracting spatial features, whereas 3D convolution effectively represents spatio-temporal features yet lacks explicit awareness of motion dynamics along the temporal dimension. In this paper, we propose a novel moving IRSTD network (TDCNet), which effectively extracts and enhances spatio-temporal features for accurate target detection. Specifically, we introduce a novel temporal difference convolution (TDC) re-parameterization module that comprises three parallel TDC blocks designed to capture contextual dependencies across different temporal ranges. Each TDC block fuses temporal difference and 3D convolution into a unified spatio-temporal convolution representation. This re-parameterized module can effectively capture multi-scale motion contextual features while suppressing pseudo-motion clutter in complex backgrounds, significantly improving detection performance. Moreover, we propose a TDC-guided spatio-temporal attention mechanism that performs cross-attention between the spatio-temporal features from the TDC-based backbone and a parallel 3D backbone. This mechanism models their global semantic dependencies to refine the current frame's features. Extensive experiments on IRSTD-UAV and public infrared datasets demonstrate that our TDCNet achieves state-of-the-art detection performance in moving target detection.[118] Learning by Neighbor-Aware Semantics, Deciding by Open-form Flows: Towards Robust Zero-Shot Skeleton Action Recognition
Yang Chen,Miaoge Li,Zhijie Rao,Deze Zeng,Song Guo,Jingcai Guo
Main category: cs.CV
TL;DR: 提出了一种用于零样本骨架动作识别的新方法Flora,通过灵活的邻居感知语义调谐和分布感知流分类器,解决了现有方法在对齐和分类上的局限性。
Details
Motivation: 由于缺乏相应的骨骼先验,识别未见过的动作类别极具挑战性。现有方法存在脆弱的点对点对齐和受限的静态分类边界问题。 Method: 提出Flora方法,采用灵活的邻居感知语义调谐形成方向感知区域语义,并结合跨模态几何一致性目标实现稳健的点到区域对齐;利用无噪声流匹配缩小模态分布差距,通过无条件对比正则化增强判别能力。 Result: 在三个基准数据集上进行了广泛实验,验证了该方法的有效性,尤其在仅使用10%已见数据训练时仍表现出色。 Conclusion: Flora通过改进语义对齐和分类机制,在零样本骨架动作识别中实现了更鲁棒和细粒度的性能提升。 Abstract: Recognizing unseen skeleton action categories remains highly challenging due to the absence of corresponding skeletal priors. Existing approaches generally follow an "align-then-classify" paradigm but face two fundamental issues, i.e., (i) fragile point-to-point alignment arising from imperfect semantics, and (ii) rigid classifiers restricted by static decision boundaries and coarse-grained anchors. To address these issues, we propose a novel method for zero-shot skeleton action recognition, termed $\texttt{$\textbf{Flora}$}$, which builds upon $\textbf{F}$lexib$\textbf{L}$e neighb$\textbf{O}$r-aware semantic attunement and open-form dist$\textbf{R}$ibution-aware flow cl$\textbf{A}$ssifier. Specifically, we flexibly attune textual semantics by incorporating neighboring inter-class contextual cues to form direction-aware regional semantics, coupled with a cross-modal geometric consistency objective that ensures stable and robust point-to-region alignment. Furthermore, we employ noise-free flow matching to bridge the modality distribution gap between semantic and skeleton latent embeddings, while a condition-free contrastive regularization enhances discriminability, leading to a distribution-aware classifier with fine-grained decision boundaries achieved through token-level velocity predictions. Extensive experiments on three benchmark datasets validate the effectiveness of our method, showing particularly impressive performance even when trained with only 10\% of the seen data. Code is available at https://github.com/cseeyangchen/Flora.[119] OUGS: Active View Selection via Object-aware Uncertainty Estimation in 3DGS
Haiyi Li,Qi Chen,Denis Kalkofen,Hsiang-Ting Chen
Main category: cs.CV
TL;DR: 本文提出了一种名为OUGS的新框架,通过基于3D高斯基元物理参数的不确定性建模,结合语义分割,实现了面向对象的高效主动重建,显著提升了复杂场景中目标物体的重建质量和效率。
Details
Motivation: 现有主动重建方法依赖于场景级不确定性度量,易受背景干扰,在复杂场景中难以高效获取特定物体的高保真重建。 Method: 提出OUGS框架,从3D高斯基元的位置、尺度、旋转等物理参数出发,通过协方差传播与渲染雅可比矩阵推导出可解释的不确定性模型,并融合语义分割掩码实现面向对象的不确定性估计,指导更有效的视角选择。 Result: 在公开数据集上的实验表明,OUGS相比现有最先进方法能更高效地提升目标物体的重建质量,同时具备良好的全局场景不确定性估计能力。 Conclusion: OUGS通过物理可解释的不确定性建模和语义感知的结合,为复杂场景中的对象级3DGS重建提供了高效且鲁棒的解决方案。 Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have achieved state-of-the-art results for novel view synthesis. However, efficiently capturing high-fidelity reconstructions of specific objects within complex scenes remains a significant challenge. A key limitation of existing active reconstruction methods is their reliance on scene-level uncertainty metrics, which are often biased by irrelevant background clutter and lead to inefficient view selection for object-centric tasks. We present OUGS, a novel framework that addresses this challenge with a more principled, physically-grounded uncertainty formulation for 3DGS. Our core innovation is to derive uncertainty directly from the explicit physical parameters of the 3D Gaussian primitives (e.g., position, scale, rotation). By propagating the covariance of these parameters through the rendering Jacobian, we establish a highly interpretable uncertainty model. This foundation allows us to then seamlessly integrate semantic segmentation masks to produce a targeted, object-aware uncertainty score that effectively disentangles the object from its environment. This allows for a more effective active view selection strategy that prioritizes views critical to improving object fidelity. Experimental evaluations on public datasets demonstrate that our approach significantly improves the efficiency of the 3DGS reconstruction process and achieves higher quality for targeted objects compared to existing state-of-the-art methods, while also serving as a robust uncertainty estimator for the global scene.[120] BronchOpt : Vision-Based Pose Optimization with Fine-Tuned Foundation Models for Accurate Bronchoscopy Navigation
Hongchao Shu,Roger D. Soberanis-Mukul,Jiru Xu,Hao Ding,Morgan Ringel,Mali Shen,Saif Iftekar Sayed,Hedyeh Rafii-Tari,Mathias Unberath
Main category: cs.CV
TL;DR: 提出了一种基于视觉的支气管镜导航框架,利用合成数据实现跨域泛化,并发布了首个公开的合成基准数据集。
Details
Motivation: 由于呼吸运动、解剖变异和CT与实际身体之间的差异,术中支气管镜定位存在挑战,现有方法在不同域和患者间泛化能力差。 Method: 提出一种基于视觉的姿态优化框架,通过微调模态和域不变编码器,实现内窥镜RGB图像与CT渲染深度图之间的直接相似性计算,并利用可微分渲染模块通过深度一致性迭代优化相机姿态。 Result: 在合成数据上训练的模型在基准测试中达到平均2.65 mm的平移误差和0.19 rad的旋转误差,在真实患者数据上也表现出良好的跨域对齐效果。 Conclusion: 该框架实现了鲁棒、域不变的支气管镜定位,所提出的合成基准数据集为未来研究提供了标准化评估基础。 Abstract: Accurate intra-operative localization of the bronchoscope tip relative to patient anatomy remains challenging due to respiratory motion, anatomical variability, and CT-to-body divergence that cause deformation and misalignment between intra-operative views and pre-operative CT. Existing vision-based methods often fail to generalize across domains and patients, leading to residual alignment errors. This work establishes a generalizable foundation for bronchoscopy navigation through a robust vision-based framework and a new synthetic benchmark dataset that enables standardized and reproducible evaluation. We propose a vision-based pose optimization framework for frame-wise 2D-3D registration between intra-operative endoscopic views and pre-operative CT anatomy. A fine-tuned modality- and domain-invariant encoder enables direct similarity computation between real endoscopic RGB frames and CT-rendered depth maps, while a differentiable rendering module iteratively refines camera poses through depth consistency. To enhance reproducibility, we introduce the first public synthetic benchmark dataset for bronchoscopy navigation, addressing the lack of paired CT-endoscopy data. Trained exclusively on synthetic data distinct from the benchmark, our model achieves an average translational error of 2.65 mm and a rotational error of 0.19 rad, demonstrating accurate and stable localization. Qualitative results on real patient data further confirm strong cross-domain generalization, achieving consistent frame-wise 2D-3D alignment without domain-specific adaptation. Overall, the proposed framework achieves robust, domain-invariant localization through iterative vision-based optimization, while the new benchmark provides a foundation for standardized progress in vision-based bronchoscopy navigation.[121] Hand Held Multi-Object Tracking Dataset in American Football
Rintaro Otsubo,Kanta Sawafuji,Hideo Saito
Main category: cs.CV
TL;DR: 本文提出了首个针对美式足球运动员的检测与跟踪数据集,填补了现有数据集在高密度、频繁遮挡场景下的空白,并通过微调检测和重识别模型显著提升了跟踪精度。
Details
Motivation: 现有的多目标跟踪方法通常在日常场景或特定体育项目(如足球、篮球)中进行评估,而美式足球由于球员频繁接触和遮挡,缺乏公开的标准数据集,导致方法间难以公平比较。 Method: 构建了一个专门用于美式足球运动员的检测与跟踪数据集,并对多种检测与跟踪方法进行了对比评估;采用微调后的检测模型和重识别模型集成到跟踪系统中。 Result: 实验结果表明,在拥挤场景下也能实现准确的检测与跟踪;微调显著提升了检测性能;结合微调后的检测与重识别模型进一步提高了跟踪准确性。 Conclusion: 该工作为在高密度、复杂交互场景下的美式足球运动员提供了可靠的检测与跟踪解决方案,推动了相关技术在该领域的应用与发展。 Abstract: Multi-Object Tracking (MOT) plays a critical role in analyzing player behavior from videos, enabling performance evaluation. Current MOT methods are often evaluated using publicly available datasets. However, most of these focus on everyday scenarios such as pedestrian tracking or are tailored to specific sports, including soccer and basketball. Despite the inherent challenges of tracking players in American football, such as frequent occlusion and physical contact, no standardized dataset has been publicly available, making fair comparisons between methods difficult. To address this gap, we constructed the first dedicated detection and tracking dataset for the American football players and conducted a comparative evaluation of various detection and tracking methods. Our results demonstrate that accurate detection and tracking can be achieved even in crowded scenarios. Fine-tuning detection models improved performance over pre-trained models. Furthermore, when these fine-tuned detectors and re-identification models were integrated into tracking systems, we observed notable improvements in tracking accuracy compared to existing approaches. This work thus enables robust detection and tracking of American football players in challenging, high-density scenarios previously underserved by conventional methods.[122] Revisiting Cross-Architecture Distillation: Adaptive Dual-Teacher Transfer for Lightweight Video Models
Ying Peng,Hongsen Ye,Changxin Huang,Xiping Hu,Jian Chen,Runhao Zeng
Main category: cs.CV
TL;DR: 提出了一种双教师知识蒸馏框架,结合ViT和CNN教师模型,通过动态加权和结构差异感知策略,提升轻量级CNN在视频动作识别中的性能。
Details
Motivation: 现有跨架构知识蒸馏方法存在架构不匹配问题,且忽视了强同构CNN教师的价值,限制了轻量CNN的性能。 Method: 引入双教师框架,包含异构ViT教师和同构CNN教师;采用基于置信度和预测差异的动态加权融合机制,并设计结构差异感知蒸馏策略,通过轻量辅助分支学习ViT与CNN教师间的残差特征。 Result: 在HMDB51、EPIC-KITCHENS-100和Kinetics-400等基准上显著优于现有蒸馏方法,HMDB51上最高提升5.95%准确率。 Conclusion: 该方法有效缓解了架构差异带来的负面影响,充分利用两类教师模型的优势,显著提升了轻量级CNN学生的识别精度。 Abstract: Vision Transformers (ViTs) have achieved strong performance in video action recognition, but their high computational cost limits their practicality. Lightweight CNNs are more efficient but suffer from accuracy gaps. Cross-Architecture Knowledge Distillation (CAKD) addresses this by transferring knowledge from ViTs to CNNs, yet existing methods often struggle with architectural mismatch and overlook the value of stronger homogeneous CNN teachers. To tackle these challenges, we propose a Dual-Teacher Knowledge Distillation framework that leverages both a heterogeneous ViT teacher and a homogeneous CNN teacher to collaboratively guide a lightweight CNN student. We introduce two key components: (1) Discrepancy-Aware Teacher Weighting, which dynamically fuses the predictions from ViT and CNN teachers by assigning adaptive weights based on teacher confidence and prediction discrepancy with the student, enabling more informative and effective supervision; and (2) a Structure Discrepancy-Aware Distillation strategy, where the student learns the residual features between ViT and CNN teachers via a lightweight auxiliary branch, focusing on transferable architectural differences without mimicking all of ViT's high-dimensional patterns. Extensive experiments on benchmarks including HMDB51, EPIC-KITCHENS-100, and Kinetics-400 demonstrate that our method consistently outperforms state-of-the-art distillation approaches, achieving notable performance improvements with a maximum accuracy gain of 5.95% on HMDB51.[123] DreamPose3D: Hallucinative Diffusion with Prompt Learning for 3D Human Pose Estimation
Jerrin Bright,Yuhao Chen,John S. Zelek
Main category: cs.CV
TL;DR: 本文提出DreamPose3D,一种基于扩散模型的3D人体姿态估计框架,结合动作感知推理与时间想象机制,通过动作提示、关节亲和注意力和幻觉解码器实现高精度、时序连贯的3D姿态预测。
Details
Motivation: 现有方法多依赖几何线索且逐帧独立预测,难以处理运动模糊和真实场景泛化问题,缺乏对高层动作意图和关节结构关系的建模。 Method: 采用扩散模型框架,引入动作感知的条件去噪机制,利用2D姿态序列提取动作提示;设计融合运动学关节亲和力的注意力机制编码器;构建具有时间一致性的幻觉姿态解码器以模拟人类对运动轨迹的重建过程。 Result: 在Human3.6M和MPI-3DHP数据集上实现了最先进的性能,并在广播级棒球数据集上展现出对噪声和模糊输入的强鲁棒性,有效保持时序一致性与动作意图的一致性。 Conclusion: DreamPose3D通过结合动作理解与时间想象机制,显著提升了3D人体姿态估计的准确性与鲁棒性,验证了意图驱动建模在复杂动作理解中的有效性。 Abstract: Accurate 3D human pose estimation remains a critical yet unresolved challenge, requiring both temporal coherence across frames and fine-grained modeling of joint relationships. However, most existing methods rely solely on geometric cues and predict each 3D pose independently, which limits their ability to resolve ambiguous motions and generalize to real-world scenarios. Inspired by how humans understand and anticipate motion, we introduce DreamPose3D, a diffusion-based framework that combines action-aware reasoning with temporal imagination for 3D pose estimation. DreamPose3D dynamically conditions the denoising process using task-relevant action prompts extracted from 2D pose sequences, capturing high-level intent. To model the structural relationships between joints effectively, we introduce a representation encoder that incorporates kinematic joint affinity into the attention mechanism. Finally, a hallucinative pose decoder predicts temporally coherent 3D pose sequences during training, simulating how humans mentally reconstruct motion trajectories to resolve ambiguity in perception. Extensive experiments on benchmarked Human3.6M and MPI-3DHP datasets demonstrate state-of-the-art performance across all metrics. To further validate DreamPose3D's robustness, we tested it on a broadcast baseball dataset, where it demonstrated strong performance despite ambiguous and noisy 2D inputs, effectively handling temporal consistency and intent-driven motion variations.[124] vMFCoOp: Towards Equilibrium on a Unified Hyperspherical Manifold for Prompting Biomedical VLMs
Minye Shao,Sihan Guo,Xinrun Li,Xingyu Miao,Haoran Duan,Yang Long
Main category: cs.CV
TL;DR: 提出vMFCoOp框架,通过在超球面流形上逆估计von Mises-Fisher分布,利用统一语义锚点对齐任意LLM与CLIP主干的语义偏差,实现鲁棒的生物医学提示和少样本分类。
Details
Motivation: 现有基于上下文优化的方法存在LLM与CLIP变体间的语义不对齐问题,且传统欧氏空间优化难以建模统一表示和局部几何约束,限制了在复杂生物医学图像中的少样本适应能力。 Method: 提出vMFCoOp框架,在共享超球面流形上逆估计von Mises-Fisher分布,引入三个互补约束,通过统一语义锚点对齐不同LLM与CLIP骨干网络之间的语义偏差。 Result: 在14个医学数据集、12种医学成像模态和13个解剖区域上均表现出一致提升,优于当前最先进方法,在准确性、泛化性和临床适用性方面表现更优。 Conclusion: vMFCoOp有效解决了语义不对齐和模态差距问题,具备良好的可扩展性,适用于不断演进的基础模型家族,并展现出广泛的下游应用潜力。 Abstract: Recent advances in context optimization (CoOp) guided by large language model (LLM)-distilled medical semantic priors offer a scalable alternative to manual prompt engineering and full fine-tuning for adapting biomedical CLIP-based vision-language models (VLMs). However, prompt learning in this context is challenged by semantic misalignment between LLMs and CLIP variants due to divergent training corpora and model architectures; it further lacks scalability across continuously evolving families of foundation models. More critically, pairwise multimodal alignment via conventional Euclidean-space optimization lacks the capacity to model unified representations or apply localized geometric constraints, which tends to amplify modality gaps in complex biomedical imaging and destabilize few-shot adaptation. In this work, we propose vMFCoOp, a framework that inversely estimates von Mises-Fisher (vMF) distributions on a shared Hyperspherical Manifold, aligning semantic biases between arbitrary LLMs and CLIP backbones via Unified Semantic Anchors to achieve robust biomedical prompting and superior few-shot classification. Grounded in three complementary constraints, vMFCoOp demonstrates consistent improvements across 14 medical datasets, 12 medical imaging modalities, and 13 anatomical regions, outperforming state-of-the-art methods in accuracy, generalization, and clinical applicability. This work will be continuously expanded to encompass more downstream applications, and the corresponding resources are intended to be shared through https://github.com/VinyehShaw/UniEqui.[125] RF-DETR: Neural Architecture Search for Real-Time Detection Transformers
Isaac Robinson,Peter Robicheaux,Matvei Popov,Deva Ramanan,Neehar Peri
Main category: cs.CV
TL;DR: 本文提出了RF-DETR,一种轻量级的专用检测Transformer,通过权重共享的神经架构搜索(NAS)在目标数据集上发现精度-延迟的帕累托前沿,显著提升了在COCO和Roboflow100-VL等数据集上的实时检测性能。