cs.CL [Back]

[1] How Deep Is Representational Bias in LLMs? The Cases of Caste and Religion

Agrima Seth,Monojit Choudhary,Sunayana Sitaram,Kentaro Toyama,Aditya Vashistha,Kalika Bali

Main category: cs.CL

TL;DR: The study finds that GPT-4 Turbo exhibits significant representational bias in generating stories about Indian life events, overrepresenting dominant cultural groups and showing that bias mitigation needs more than training data diversification.

Details

Motivation: To expand research on representational bias in LLMs beyond Global North-centric identities and single-response interactions. Method: Systematic audit of GPT-4 Turbo by generating over 7,200 stories about significant life events in India, comparing religious and caste representation in outputs against census data. Result: GPT-4 Turbo overrepresented culturally dominant groups beyond their statistical presence, showing a winner-take-all quality in representational bias that is more biased than likely distribution bias in training data. Conclusion: Training data diversification may not be enough to correct representational bias in LLMs, suggesting the need for more fundamental changes in model development. Abstract: Representational bias in large language models (LLMs) has predominantly been measured through single-response interactions and has focused on Global North-centric identities like race and gender. We expand on that research by conducting a systematic audit of GPT-4 Turbo to reveal how deeply encoded representational biases are and how they extend to less-explored dimensions of identity. We prompt GPT-4 Turbo to generate over 7,200 stories about significant life events (such as weddings) in India, using prompts designed to encourage diversity to varying extents. Comparing the diversity of religious and caste representation in the outputs against the actual population distribution in India as recorded in census data, we quantify the presence and "stickiness" of representational bias in the LLM for religion and caste. We find that GPT-4 responses consistently overrepresent culturally dominant groups far beyond their statistical representation, despite prompts intended to encourage representational diversity. Our findings also suggest that representational bias in LLMs has a winner-take-all quality that is more biased than the likely distribution bias in their training data, and repeated prompt-based nudges have limited and inconsistent efficacy in dislodging these biases. These results suggest that diversifying training data alone may not be sufficient to correct LLM bias, highlighting the need for more fundamental changes in model development. Dataset and Codebook: https://github.com/agrimaseth/How-Deep-Is-Representational-Bias-in-LLMs

[2] FeynTune: Large Language Models for High-Energy Theory

Paul Richmond,Prarit Agarwal,Borun Chowdhury,Vasilis Niarchos,Constantinos Papageorgakis

Main category: cs.CL

TL;DR: The study fine-tuned specialized Large Language Models for High-Energy Physics using Llama-3.1 and showed their superiority over general and commercial models in domain-specific tasks.

Details

Motivation: To develop specialized Large Language Models tailored for High-Energy Theoretical Physics that outperform general-purpose models in domain-specific tasks. Method: The authors created 20 fine-tuned variants of the Llama-3.1 model using Low-Rank Adaptation approaches, trained on arXiv abstracts from hep-th, hep-ph, gr-qc, and other fields like q-bio and cs. Result: All fine-tuned models outperformed the base Llama-3.1 model on hep-th abstract completion tasks and also surpassed leading commercial LLMs in performance. Conclusion: The paper concludes that fine-tuned specialized Large Language Models outperform the base model and commercial LLMs in High-Energy Theoretical Physics tasks, offering insights for further development. Abstract: We present specialized Large Language Models for theoretical High-Energy Physics, obtained as 20 fine-tuned variants of the 8-billion parameter Llama-3.1 model. Each variant was trained on arXiv abstracts (through August 2024) from different combinations of hep-th, hep-ph and gr-qc. For a comparative study, we also trained models on datasets that contained abstracts from disparate fields such as the q-bio and cs categories. All models were fine-tuned using two distinct Low-Rank Adaptation fine-tuning approaches and varying dataset sizes, and outperformed the base model on hep-th abstract completion tasks. We compare performance against leading commercial LLMs (ChatGPT, Claude, Gemini, DeepSeek) and derive insights for further developing specialized language models for High-Energy Theoretical Physics.

[3] Intent Aware Context Retrieval for Multi-Turn Agricultural Question Answering

Abhay Vijayvargia,Ajay Nagpal,Kundeshwar Pundalik,Atharva Savarkar,Smita Gautam,Pankaj Singh,Rohit Saluja,Ganesh Ramakrishnan

Main category: cs.CL

TL;DR: Krishi Sathi is an AI-powered agricultural chatbot designed to provide personalized, accessible advice to Indian farmers, demonstrating improved quality and accessibility of digital agricultural support through its unique methodology.

Details

Motivation: Indian farmers often lack timely, accessible, and language-friendly agricultural advice, especially in rural areas with low literacy. This paper aims to address this gap in accessibility by presenting Krishi Sathi. Method: The chatbot uses an IFT model refined through fine-tuning on Indian agricultural knowledge across three curated datasets. Krishi Sathi follows a structured, multi-turn conversation flow and performs Retrieval-Augmented Generation (RAG) by fetching information from a curated agricultural database before generating a tailored response. The chatbot supports both English and Hindi languages, with speech input and output features for accessibility. Result: Krishi Sathi achieved a query response accuracy of 97.53%, 91.35% contextual relevance and personalization, and a query completion rate of 97.53%. The average response time remained under 6 seconds, ensuring timely support for users across both English and Hindi interactions. Conclusion: Krishi Sathi, an AI-powered agricultural chatbot, demonstrates how combining intent-driven dialogue flows, instruction-tuned models, and retrieval-based generation can improve the quality and accessibility of digital agricultural support in India. Abstract: Indian farmers often lack timely, accessible, and language-friendly agricultural advice, especially in rural areas with low literacy. To address this gap in accessibility, this paper presents a novel AI-powered agricultural chatbot, Krishi Sathi, designed to support Indian farmers by providing personalized, easy-to-understand answers to their queries through both text and speech. The system's intelligence stems from an IFT model, subsequently refined through fine-tuning on Indian agricultural knowledge across three curated datasets. Unlike traditional chatbots that respond to one-off questions, Krishi Sathi follows a structured, multi-turn conversation flow to gradually collect the necessary details from the farmer, ensuring the query is fully understood before generating a response. Once the intent and context are extracted, the system performs Retrieval-Augmented Generation (RAG) by first fetching information from a curated agricultural database and then generating a tailored response using the IFT model. The chatbot supports both English and Hindi languages, with speech input and output features (via ASR and TTS) to make it accessible for users with low literacy or limited digital skills. This work demonstrates how combining intent-driven dialogue flows, instruction-tuned models, and retrieval-based generation can improve the quality and accessibility of digital agricultural support in India. This approach yielded strong results, with the system achieving a query response accuracy of 97.53%, 91.35% contextual relevance and personalization, and a query completion rate of 97.53%. The average response time remained under 6 seconds, ensuring timely support for users across both English and Hindi interactions.

[4] Hierarchical Verification of Speculative Beams for Accelerating LLM Inference

Jaydip Sen,Harshitha Puvvala,Subhasis Dasgupta

Main category: cs.CL

TL;DR: This paper proposes the Hierarchical Verification Tree (HVT), a novel framework for accelerating large language model inference by prioritizing high-likelihood drafts and pruning suboptimal candidates, resulting in improved efficiency and reduced energy consumption.

Details

Motivation: Large language models face challenges in inference efficiency due to their autoregressive nature. Traditional speculative decoding methods verify draft sequences sequentially without prioritization, leading to unnecessary computational overhead. Method: The study introduces the Hierarchical Verification Tree (HVT), a framework that restructures speculative beam decoding by prioritizing high-likelihood drafts and enabling early pruning of suboptimal candidates. Theoretical foundations and a formal verification-pruning algorithm are developed. Result: Experimental evaluations show that HVT consistently outperforms existing speculative decoding schemes, achieving substantial reductions in inference time and energy consumption while maintaining or enhancing output quality. Conclusion: HVT provides a promising approach to accelerate large language model inference, offering efficiency improvements in terms of inference time and energy consumption without compromising output quality. Abstract: Large language models (LLMs) have achieved remarkable success across diverse natural language processing tasks but face persistent challenges in inference efficiency due to their autoregressive nature. While speculative decoding and beam sampling offer notable improvements, traditional methods verify draft sequences sequentially without prioritization, leading to unnecessary computational overhead. This work proposes the Hierarchical Verification Tree (HVT), a novel framework that restructures speculative beam decoding by prioritizing high-likelihood drafts and enabling early pruning of suboptimal candidates. Theoretical foundations and a formal verification-pruning algorithm are developed to ensure correctness and efficiency. Integration with standard LLM inference pipelines is achieved without requiring retraining or architecture modification. Experimental evaluations across multiple datasets and models demonstrate that HVT consistently outperforms existing speculative decoding schemes, achieving substantial reductions in inference time and energy consumption while maintaining or enhancing output quality. The findings highlight the potential of hierarchical verification strategies as a new direction for accelerating large language model inference.

[5] WINELL: Wikipedia Never-Ending Updating with LLM Agents

Revanth Gangi Reddy,Tanay Dixit,Jiaxin Qin,Cheng Qian,Daniel Lee,Jiawei Han,Kevin Small,Xing Fan,Ruhi Sarikaya,Heng Ji

Main category: cs.CL

TL;DR: 本文提出了一种基于LLM代理的框架WiNELL，用于持续更新维基百科文章，通过多代理框架聚合在线信息，选择新知识并生成精确的编辑建议，从而提高知识库的更新效率和信息覆盖范围。

Details

Motivation: 维基百科依赖人工编辑，难以保持内容及时更新，而NELL的持续知识获取理念与LLM代理技术的进步为自动更新知识库提供了新思路。 Method: 采用多代理框架聚合在线信息，选择目标实体的新知识，生成精确的编辑建议，并基于维基百科历史编辑数据训练编辑模型，使其行为与人工编辑一致。 Result: WiNELL在信息覆盖和编辑效率方面优于开源指令遵循模型和闭源LLM（如GPT-4o），端到端评估表明其能有效识别并提出及时的事实更新建议。 Conclusion: WiNELL展示了LLM代理在持续更新知识库方面的潜力，为自动化知识管理开辟了新的研究方向。 Abstract: Wikipedia, a vast and continuously consulted knowledge base, faces significant challenges in maintaining up-to-date content due to its reliance on manual human editors. Inspired by the vision of continuous knowledge acquisition in NELL and fueled by advances in LLM-based agents, this paper introduces WiNELL, an agentic framework for continuously updating Wikipedia articles. Our approach employs a multi-agent framework to aggregate online information, select new and important knowledge for a target entity in Wikipedia, and then generate precise edit suggestions for human review. Our fine-grained editing models, trained on Wikipedia's extensive history of human edits, enable incorporating updates in a manner consistent with human editing behavior. Our editor models outperform both open-source instruction-following baselines and closed-source LLMs (e.g., GPT-4o) in key information coverage and editing efficiency. End-to-end evaluation on high-activity Wikipedia pages demonstrates WiNELL's ability to identify and suggest timely factual updates. This opens up a promising research direction in LLM agents for automatically updating knowledge bases in a never-ending fashion.

[6] GanitBench: A bi-lingual benchmark for evaluating mathematical reasoning in Vision Language Models

Ashutosh Bandooni,Brindha Subburaj

Main category: cs.CL

TL;DR: GanitBench 是一个包含英语和印地语的视觉数学问题基准测试，旨在解决非英语基准测试的缺失问题，并评估模型在零样本和两样本思维链设置下的性能。

Details

Motivation: 最近几年，针对视觉语言模型 (VLMs) 的多个领域和学科的基准测试越来越多，但这些测试通常是单语的，主要以英语为主。此外，除了理解与翻译任务之外，印地语的数据集也很缺乏。 Method: 从印度的 JEE Advanced 和 CBSE 董事会考试中收集了 1527 个仅视觉问题，用于评估零样本思维链 (CoT) 和两样本 CoT 环境下的两个闭源模型。 Result: GPT-4o mini 在 GanitBench 上表现更为突出，其最高平均准确率为 38.15%。此外，在 "Double Lock" 约束下，模型的表现下降明显，而双样本 CoT 在此环境下表现更有效。 Conclusion: GanitBench 是一个包含英语和印地语的具有挑战性的基准测试，旨在促进印地语等语言在研究中的应用。 Abstract: Benchmarks for evaluating reasoning among Vision Language Models (VLMs) on several fields and domains are being curated more frequently over the last few years. However these are often monolingual, mostly available in English. Additionally there also is a lack of datasets available in Hindi on tasks apart from comprehension and translation. We introduce GanitBench, a tough benchmark consisting of 1527 vision-only questions covering several topics in Mathematics - available in languages English and Hindi. Collected from two major examinations from India, the JEE Advanced and the CBSE Boards examinations, this benchmark includes questions in the form of images comprising of figures essential to a question as well as text. We evaluate two closed source models for the same, in zero-shot Chain-of-Thought (CoT) and two-shot CoT settings. GPT-4o mini is found to be the more dominant model on the benchmark, with it's highest average accuracy being 38.15%. We also evaluate models through a "Double Lock" constraint, which brings down the performance of the models by considerable margins. We observe that two-shot CoT appears to be a more effective setting under this environment. Performance of the two VLMs also decreases when answering the same questions in the Hindi language. We hope to facilitate the inclusion of languages like Hindi in research through our work.

[7] AttnTrace: Attention-based Context Traceback for Long-Context LLMs

Yanting Wang,Runpeng Geng,Ying Chen,Jinyuan Jia

Main category: cs.CL

TL;DR: 本文提出了一种高效的上下文回溯方法 AttnTrace，用于改进长上下文大语言模型的提示注入检测。

Details

Motivation: 现有的上下文回溯解决方案，如 TracLLM，计算成本高，需要更高效和准确的方法来提高大型语言模型输出的可解释性和可信度。 Method: 介绍两种旨在增强 AttnTrace 有效性的技术，并基于注意力权重进行系统评估。 Result: AttnTrace 在上下文回溯方面比现有方法更准确和高效，并且可以有效检测长上下文中的提示注入攻击。 Conclusion: AttnTrace 是一种基于注意力权重的新型上下文回溯方法，比现有的最先进的上下文回溯方法更准确、更高效，并且可以通过归因前检测范式改进对长上下文下提示注入的检测。 Abstract: Long-context large language models (LLMs), such as Gemini-2.5-Pro and Claude-Sonnet-4, are increasingly used to empower advanced AI systems, including retrieval-augmented generation (RAG) pipelines and autonomous agents. In these systems, an LLM receives an instruction along with a context--often consisting of texts retrieved from a knowledge database or memory--and generates a response that is contextually grounded by following the instruction. Recent studies have designed solutions to trace back to a subset of texts in the context that contributes most to the response generated by the LLM. These solutions have numerous real-world applications, including performing post-attack forensic analysis and improving the interpretability and trustworthiness of LLM outputs. While significant efforts have been made, state-of-the-art solutions such as TracLLM often lead to a high computation cost, e.g., it takes TracLLM hundreds of seconds to perform traceback for a single response-context pair. In this work, we propose AttnTrace, a new context traceback method based on the attention weights produced by an LLM for a prompt. To effectively utilize attention weights, we introduce two techniques designed to enhance the effectiveness of AttnTrace, and we provide theoretical insights for our design choice. We also perform a systematic evaluation for AttnTrace. The results demonstrate that AttnTrace is more accurate and efficient than existing state-of-the-art context traceback methods. We also show that AttnTrace can improve state-of-the-art methods in detecting prompt injection under long contexts through the attribution-before-detection paradigm. As a real-world application, we demonstrate that AttnTrace can effectively pinpoint injected instructions in a paper designed to manipulate LLM-generated reviews. The code is at https://github.com/Wang-Yanting/AttnTrace.

[8] Majority Bit-Aware Watermarking For Large Language Models

Jiahao Xu,Rui Hu,Zikai Zhang

Main category: cs.CL

TL;DR: This paper introduces MajorMark and MajorMark$^+$, novel watermarking methods that improve the balance between text quality and decoding accuracy for Large Language Models.

Details

Motivation: Concerns about the misuse of Large Language Models (LLMs) in generating harmful or deceptive content have led to the need for improved watermarking techniques that balance text quality and decoding accuracy. Method: MajorMark uses majority bit-aware encoding to select preferred token sets and employs a clustering-based decoding strategy. MajorMark$^+$ partitions messages into blocks for independent encoding and deterministic decoding. Result: Experiments on state-of-the-art LLMs show that MajorMark and MajorMark$^+$ outperform prior methods in both decoding accuracy and text generation quality. Conclusion: MajorMark and MajorMark$^+$ watermarking methods significantly enhance decoding accuracy and text generation quality compared to prior multi-bit watermarking baselines. Abstract: The growing deployment of Large Language Models (LLMs) in real-world applications has raised concerns about their potential misuse in generating harmful or deceptive content. To address this issue, watermarking techniques have emerged as a promising solution by embedding identifiable binary messages into generated text for origin verification and misuse tracing. While recent efforts have explored multi-bit watermarking schemes capable of embedding rich information such as user identifiers, they typically suffer from the fundamental trade-off between text quality and decoding accuracy: to ensure reliable message decoding, they have to restrict the size of preferred token sets during encoding, yet such restrictions reduce the quality of the generated content. In this work, we propose MajorMark, a novel watermarking method that improves this trade-off through majority bit-aware encoding. MajorMark selects preferred token sets based on the majority bit of the message, enabling a larger and more flexible sampling of tokens. In contrast to prior methods that rely on token frequency analysis for decoding, MajorMark employs a clustering-based decoding strategy, which maintains high decoding accuracy even when the preferred token set is large, thus preserving both content quality and decoding accuracy. We further introduce MajorMark$^+$, which partitions the message into multiple blocks to independently encode and deterministically decode each block, thereby further enhancing the quality of watermarked text and improving decoding accuracy. Extensive experiments on state-of-the-art LLMs demonstrate that our methods significantly enhance both decoding accuracy and text generation quality, outperforming prior multi-bit watermarking baselines.

[9] Hallucination to Truth: A Review of Fact-Checking and Factuality Evaluation in Large Language Models

Subhey Sadi Rahman,Md. Adnanul Islam,Md. Mahbub Alam,Musarrat Zeba,Md. Abdur Rahman,Sadia Sultana Chowa,Mohaimenul Azam Khan Raiaan,Sami Azam

Main category: cs.CL

TL;DR: 这篇综述论文探讨了如何评估大型语言模型（LLM）生成内容的事实准确性，强调了构建强大的事实核查框架的必要性，并提出了五个指导性的研究问题，以推动更可信和情境感知语言模型的发展。

Details

Motivation: 由于大型语言模型（LLMs）训练于包含不准确或误导性内容的互联网语料库，它们可能生成错误信息，因此需要进行强有力的核查。 Method: 这篇综述系统地分析了从2020年到2025年的相关文献，探讨了评估LLM生成内容的事实准确性的方法和缓解技术。 Result: 关键发现包括当前评估指标的局限性、通过验证的外部证据来支持输出的价值，以及领域特定定制对于提高事实一致性的必要性。 Conclusion: 这篇综述强调了构建不仅准确和可解释，而且适合领域特定事实核查的LLMs的重要性，为迈向更可信和情境感知的语言模型研究做出了贡献。 Abstract: Large Language Models (LLMs) are trained on vast and diverse internet corpora that often include inaccurate or misleading content. Consequently, LLMs can generate misinformation, making robust fact-checking essential. This review systematically analyzes how LLM-generated content is evaluated for factual accuracy by exploring key challenges such as hallucinations, dataset limitations, and the reliability of evaluation metrics. The review emphasizes the need for strong fact-checking frameworks that integrate advanced prompting strategies, domain-specific fine-tuning, and retrieval-augmented generation (RAG) methods. It proposes five research questions that guide the analysis of the recent literature from 2020 to 2025, focusing on evaluation methods and mitigation techniques. The review also discusses the role of instruction tuning, multi-agent reasoning, and external knowledge access via RAG frameworks. Key findings highlight the limitations of current metrics, the value of grounding outputs with validated external evidence, and the importance of domain-specific customization to improve factual consistency. Overall, the review underlines the importance of building LLMs that are not only accurate and explainable but also tailored for domain-specific fact-checking. These insights contribute to the advancement of research toward more trustworthy and context-aware language models.

[10] An Entity Linking Agent for Question Answering

Yajie Luo,Yihong Wu,Muzhi Li,Fengran Mo,Jia Ao Sun,Xinyu Wang,Liheng Ma,Yingxue Zhang,Jian-Yun Nie

Main category: cs.CL

TL;DR: This paper proposes an entity linking agent based on a Large Language Model to improve entity linking performance in short, ambiguous QA contexts.

Details

Motivation: Most existing entity linking methods are designed for long contexts and do not perform well on short, ambiguous user questions in QA tasks. Method: An entity linking agent was developed that simulates human cognitive workflows to actively identify entity mentions, retrieve candidate entities, and make decisions. Result: Experiments on tool-based entity linking and QA task evaluation confirmed the robustness and effectiveness of the proposed agent. Conclusion: The proposed entity linking agent based on a Large Language Model effectively improves the performance of entity linking in short, ambiguous user questions in QA tasks. Abstract: Some Question Answering (QA) systems rely on knowledge bases (KBs) to provide accurate answers. Entity Linking (EL) plays a critical role in linking natural language mentions to KB entries. However, most existing EL methods are designed for long contexts and do not perform well on short, ambiguous user questions in QA tasks. We propose an entity linking agent for QA, based on a Large Language Model that simulates human cognitive workflows. The agent actively identifies entity mentions, retrieves candidate entities, and makes decision. To verify the effectiveness of our agent, we conduct two experiments: tool-based entity linking and QA task evaluation. The results confirm the robustness and effectiveness of our agent.

Haofei Yu,Zhengyang Qi,Yining Zhao,Kolby Nottingham,Keyang Xuan,Bodhisattwa Prasad Majumder,Hao Zhu,Paul Pu Liang,Jiaxuan You

Main category: cs.CL

TL;DR: The paper proposes Sotopia-RL, a novel reinforcement learning framework that improves social intelligence in LLMs by addressing challenges like partial observability and multi-dimensionality, achieving superior performance in social goal completion tasks.

Details

Motivation: Social intelligence is crucial for large language models (LLMs), and while reinforcement learning (RL) is suitable for training such models, challenges like partial observability and multi-dimensionality hinder its effectiveness. Method: The study introduces Sotopia-RL, a framework that transforms coarse episode-level feedback into utterance-level, multi-dimensional rewards to enhance reinforcement learning in social interactions. Result: Sotopia-RL achieves 7.17 on Sotopia-hard and 8.31 on Sotopia-full, significantly outperforming existing approaches. Conclusion: Sotopia-RL addresses the challenges of partial observability and multi-dimensionality in social interactions, achieving state-of-the-art results in social goal completion scores. Abstract: Social intelligence has become a critical capability for large language models (LLMs), enabling them to engage effectively in real-world social tasks such as accommodation, persuasion, collaboration, and negotiation. Reinforcement learning (RL) is a natural fit for training socially intelligent agents because it allows models to learn sophisticated strategies directly through social interactions. However, social interactions have two key characteristics that set barriers for RL training: (1) partial observability, where utterances have indirect and delayed effects that complicate credit assignment, and (2) multi-dimensionality, where behaviors such as rapport-building or knowledge-seeking contribute indirectly to goal achievement. These characteristics make Markov decision process (MDP)-based RL with single-dimensional episode-level rewards inefficient and unstable. To address these challenges, we propose Sotopia-RL, a novel framework that refines coarse episode-level feedback into utterance-level, multi-dimensional rewards. Utterance-level credit assignment mitigates partial observability by attributing outcomes to individual utterances, while multi-dimensional rewards capture the full richness of social interactions and reduce reward hacking. Experiments in Sotopia, an open-ended social learning environment, demonstrate that Sotopia-RL achieves state-of-the-art social goal completion scores (7.17 on Sotopia-hard and 8.31 on Sotopia-full), significantly outperforming existing approaches. Ablation studies confirm the necessity of both utterance-level credit assignment and multi-dimensional reward design for RL training. Our implementation is publicly available at: https://github.com/sotopia-lab/sotopia-rl.

[12] CoAct-1: Computer-using Agents with Coding as Actions

Linxin Song,Yutong Dai,Viraj Prabhu,Jieyu Zhang,Taiwei Shi,Li Li,Junnan Li,Silvio Savarese,Zeyuan Chen,Jieyu Zhao,Ran Xu,Caiming Xiong

Main category: cs.CL

TL;DR: 本研究介绍了一种结合图形用户界面（GUI）交互与编程操作的新型多代理系统CoAct-1，该系统通过动态分配子任务给GUI操作员或程序员代理，显著提高了任务完成的成功率和效率。

Details

Motivation: 传统的自主代理在复杂、长视野任务上通过图形用户界面（GUI）操作计算机时往往效率低下且不可靠。 Method: 提出了CoAct-1，一个结合基于GUI的控制和直接程序执行的多智能体系统。 Result: 在OSWorld基准测试中，CoAct-1达到了60.76%的成功率，并将完成任务所需的平均步骤数减少到10.15步。 Conclusion: 整合编码作为核心操作为通用计算机自动化提供了更强大、高效和可扩展的路径。 Abstract: Autonomous agents that operate computers via Graphical User Interfaces (GUIs) often struggle with efficiency and reliability on complex, long-horizon tasks. While augmenting these agents with planners can improve task decomposition, they remain constrained by the inherent limitations of performing all actions through GUI manipulation, leading to brittleness and inefficiency. In this work, we introduce a more robust and flexible paradigm: enabling agents to use coding as a enhanced action. We present CoAct-1, a novel multi-agent system that synergistically combines GUI-based control with direct programmatic execution. CoAct-1 features an Orchestrator that dynamically delegates subtasks to either a conventional GUI Operator or a specialized Programmer agent, which can write and execute Python or Bash scripts. This hybrid approach allows the agent to bypass inefficient GUI action sequences for tasks like file management and data processing, while still leveraging visual interaction when necessary. We evaluate our system on the challenging OSWorld benchmark, where CoAct-1 achieves a new state-of-the-art success rate of 60.76%, significantly outperforming prior methods. Furthermore, our approach dramatically improves efficiency, reducing the average number of steps required to complete a task to just 10.15, compared to 15 for leading GUI agents. Our results demonstrate that integrating coding as a core action provides a more powerful, efficient, and scalable path toward generalized computer automation.

[13] CAP-LLM: Context-Augmented Personalized Large Language Models for News Headline Generation

Raymond Wilson,Cole Graham,Chase Carter,Zefeng Yang,Ruiqi Gu

Main category: cs.CL

TL;DR: 本文提出CAP-LLM框架，通过整合用户偏好和事实一致性约束，实现更准确和个性化的新闻标题生成。

Details

Motivation: 现有方法难以有效捕捉复杂的用户兴趣并确保事实一致性，导致生成的新闻标题过于泛化或具有误导性。 Method: 提出CAP-LLM框架，包括用户偏好编码器、上下文注入适配器和事实一致性强化模块。 Result: 在PENS数据集上评估，CAP-LLM在所有指标上均达到最先进的性能，显著提高了事实一致性、个性化和内容覆盖率。 Conclusion: CAP-LLM能够有效平衡个性化与事实准确性，在新闻标题生成中取得卓越的表现。 Abstract: In the era of information overload, personalized news headline generation is crucial for engaging users by tailoring content to their preferences while accurately conveying news facts. Existing methods struggle with effectively capturing complex user interests and ensuring factual consistency, often leading to generic or misleading headlines. Leveraging the unprecedented capabilities of Large Language Models (LLMs) in text generation, we propose Context-Augmented Personalized LLM (CAP-LLM), a novel framework that integrates user preferences and factual consistency constraints into a powerful pre-trained LLM backbone. CAP-LLM features a User Preference Encoder to capture long-term user interests, a Context Injection Adapter to seamlessly integrate these preferences and current article context into the LLM's generation process, and a Fact-Consistency Reinforcement Module employing a novel contrastive loss to mitigate hallucination. Evaluated on the real-world PENS dataset, CAP-LLM achieves state-of-the-art performance across all metrics. Notably, it significantly improves factual consistency (FactCC of 87.50) over strong baselines like BART (86.67), while simultaneously enhancing personalization (Pc(avg) 2.73, Pc(max) 17.25) and content coverage (ROUGE-1 26.55, ROUGE-2 9.95, ROUGE-L 23.01). Our ablation studies, human evaluations, and sensitivity analyses further validate the effectiveness of each component and the robustness of our approach, demonstrating CAP-LLM's ability to achieve a superior balance between personalization and factual accuracy in news headline generation.

[14] Data and AI governance: Promoting equity, ethics, and fairness in large language models

Alok Abhishek,Lisa Erickson,Tushar Bandopadhyay

Main category: cs.CL

TL;DR: This paper proposes a governance framework to assess and mitigate bias in Large Language Models throughout their lifecycle, aiming to enhance the safety, fairness, and ethical alignment of generative AI systems.

Details

Motivation: The motivation of the paper is to systematically govern, assess, and quantify bias in the lifecycle of machine learning models, especially in Large Language Models (LLMs), to ensure socially responsible and ethically aligned generative AI applications. Method: The paper builds upon the authors' foundational work on the Bias Evaluation and Assessment Test Suite (BEATS) for Large Language Models (LLMs) and discusses a data and AI governance framework to address Bias, Ethics, Fairness, and Factuality within LLMs. Result: The result is a comprehensive governance framework that enables rigorous benchmarking of LLMs prior to deployment, facilitates continuous real-time evaluation, and proactively governs LLM-generated responses in real-world applications. Conclusion: The paper concludes that implementing data and AI governance across the AI development lifecycle can significantly enhance the safety and responsibility of GenAI systems, helping to mitigate risks of discrimination and protect against reputational harm. Abstract: In this paper, we cover approaches to systematically govern, assess and quantify bias across the complete life cycle of machine learning models, from initial development and validation to ongoing production monitoring and guardrail implementation. Building upon our foundational work on the Bias Evaluation and Assessment Test Suite (BEATS) for Large Language Models, the authors share prevalent bias and fairness related gaps in Large Language Models (LLMs) and discuss data and AI governance framework to address Bias, Ethics, Fairness, and Factuality within LLMs. The data and AI governance approach discussed in this paper is suitable for practical, real-world applications, enabling rigorous benchmarking of LLMs prior to production deployment, facilitating continuous real-time evaluation, and proactively governing LLM generated responses. By implementing the data and AI governance across the life cycle of AI development, organizations can significantly enhance the safety and responsibility of their GenAI systems, effectively mitigating risks of discrimination and protecting against potential reputational or brand-related harm. Ultimately, through this article, we aim to contribute to advancement of the creation and deployment of socially responsible and ethically aligned generative artificial intelligence powered applications.

[15] Confidence-Weighted Token Set Cover for Early Hypothesis Pruning in Self-Consistency

Md Arafat Sultan,Ramón Fernandez Astudillo

Main category: cs.CL

TL;DR: 通过早期假设剪枝提高自洽性的令牌效率。

Details

Motivation: 自洽性虽然简单有效，但其高令牌消耗限制了其实用性，特别是在长链推理任务中。 Method: 生成所有解决方案并行，但周期性地基于模型对各个假设的信心和候选子集的词法覆盖范围来修剪中间假设。 Result: 在三个数学基准测试中，五种LLM的令牌效率提高了10-35%。 Conclusion: 早期假设剪枝可以提高自洽性的令牌效率，同时保持其并行性。 Abstract: Despite its simplicity and efficacy, the high token expenditure of self-consistency can limit its practical utility. Here we investigate if self-consistency can be made more token-efficient for long chain-of-thought reasoning tasks, while preserving its parallelism, through early hypothesis pruning. Concretely, we generate all solutions in parallel, but periodically prune intermediate hypotheses that are deemed unnecessary based on two lightweight indicators: (a) the model's own confidence in individual hypotheses, and (b) lexical coverage of all current hypotheses by candidate subsets that are under consideration for continued retention. We design a fast weighted set cover algorithm that utilizes the two indicators; our evaluation of five LLMs on three math benchmarks shows that this method can improve token efficiency for all models, by 10-35% in many cases.

[16] Are Today's LLMs Ready to Explain Well-Being Concepts?

Bohan Jiang,Dawei Li,Zhen Tan,Chengshuai Zhao,Huan Liu

Main category: cs.CL

TL;DR: This paper explores how fine-tuning LLMs using SFT and DPO can enhance the quality of well-being-related explanations, evaluated through a novel LLM-as-a-judge framework.

Details

Motivation: The motivation is driven by the increasing reliance on LLMs to understand well-being concepts and the need for accurate, audience-tailored explanations that meet the expectations of users with varying levels of expertise. Method: The researchers constructed a dataset of 43,880 explanations for 2,194 well-being concepts using ten diverse LLMs. They introduced a principle-guided LLM-as-a-judge evaluation framework with dual judges to assess explanation quality. Fine-tuning was performed using SFT and DPO to enhance model performance. Result: The results show that (1) LLM judges align well with human evaluations, (2) explanation quality varies across models, audiences, and categories, and (3) DPO- and SFT-finetuned models outperform larger counterparts, highlighting the effectiveness of preference-based learning for explanation tasks. Conclusion: The study concludes that fine-tuning open-source LLMs using SFT and DPO can significantly improve the quality of explanations regarding well-being concepts, and the proposed LLM-as-a-judge evaluation framework aligns well with human evaluations. Abstract: Well-being encompasses mental, physical, and social dimensions essential to personal growth and informed life decisions. As individuals increasingly consult Large Language Models (LLMs) to understand well-being, a key challenge emerges: Can LLMs generate explanations that are not only accurate but also tailored to diverse audiences? High-quality explanations require both factual correctness and the ability to meet the expectations of users with varying expertise. In this work, we construct a large-scale dataset comprising 43,880 explanations of 2,194 well-being concepts, generated by ten diverse LLMs. We introduce a principle-guided LLM-as-a-judge evaluation framework, employing dual judges to assess explanation quality. Furthermore, we show that fine-tuning an open-source LLM using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) can significantly enhance the quality of generated explanations. Our results reveal: (1) The proposed LLM judges align well with human evaluations; (2) explanation quality varies significantly across models, audiences, and categories; and (3) DPO- and SFT-finetuned models outperform their larger counterparts, demonstrating the effectiveness of preference-based learning for specialized explanation tasks.

Xinyu Zhao,Zhen Tan,Maya Enisman,Minjae Seo,Marta R. Durantini,Dolores Albarracin,Tianlong Chen

Main category: cs.CL

TL;DR: This paper introduces a transparent social robot co-facilitator using an interpretable concept bottleneck model (CBM) to support human facilitators in group meetings by detecting social dynamics and enabling real-time corrections, outperforming black-box models and transferring expertise from experienced facilitators to novices.

Details

Motivation: The motivation stems from the challenges faced by human facilitators in managing group meetings, particularly the cognitive load and need for timely interventions. There is a gap in existing technologies that can transparently interpret social interactions and support facilitators without relying on opaque 'black box' models. Method: The researchers developed a transfer learning framework using a concept bottleneck model (CBM), which interprets social dynamics in group meetings based on human-interpretable concepts like engagement and sentiment. The CBM was trained by distilling knowledge from foundation models (FMs) and was evaluated on its ability to predict the need for intervention in real-time while allowing human correction. Result: The concept-driven CBM system significantly outperformed zero-shot foundation models in predicting the need for intervention during group meetings. It enabled real-time human correction and demonstrated robust knowledge transfer across different groups and from expert to novice facilitators. Conclusion: The study concludes that the concept bottleneck model (CBM) effectively transfers expertise from senior facilitators to novice practitioners, enhancing the ability to detect and address social dynamics in group meetings, and providing a reliable blueprint for human-robot collaboration in complex social domains. Abstract: Successful group meetings, such as those implemented in group behavioral-change programs, work meetings, and other social contexts, must promote individual goal setting and execution while strengthening the social relationships within the group. Consequently, an ideal facilitator must be sensitive to the subtle dynamics of disengagement, difficulties with individual goal setting and execution, and interpersonal difficulties that signal a need for intervention. The challenges and cognitive load experienced by facilitators create a critical gap for an embodied technology that can interpret social exchanges while remaining aware of the needs of the individuals in the group and providing transparent recommendations that go beyond powerful but "black box" foundation models (FMs) that identify social cues. We address this important demand with a social robot co-facilitator that analyzes multimodal meeting data and provides discreet cues to the facilitator. The robot's reasoning is powered by an agentic concept bottleneck model (CBM), which makes decisions based on human-interpretable concepts like participant engagement and sentiments, ensuring transparency and trustworthiness. Our core contribution is a transfer learning framework that distills the broad social understanding of an FM into our specialized and transparent CBM. This concept-driven system significantly outperforms direct zero-shot FMs in predicting the need for intervention and enables real-time human correction of its reasoning. Critically, we demonstrate robust knowledge transfer: the model generalizes across different groups and successfully transfers the expertise of senior human facilitators to improve the performance of novices. By transferring an expert's cognitive model into an interpretable robotic partner, our work provides a powerful blueprint for augmenting human capabilities in complex social domains.

[18] HarmonyGuard: Toward Safety and Utility in Web Agents via Adaptive Policy Enhancement and Dual-Objective Optimization

Yurun Chen,Xavier Hu,Yuhan Liu,Keting Yin,Juncheng Li,Zhuosheng Zhang,Shengyu Zhang

Main category: cs.CL

TL;DR: HarmonyGuard 提出了一种多智能体协作框架，通过策略增强和目标优化，在开放网络环境中有效平衡任务性能与安全性，显著提升了策略合规性和任务完成率。

Details

Motivation: 当前研究主要集中在单一目标优化或单轮场景，缺乏在开放网络环境中对安全性和实用性进行协同优化的能力，HarmonyGuard 填补了这一空白。 Method: 提出 HarmonyGuard 框架，包含两个核心组件：(1) Policy Agent，自动提取并维护安全策略；(2) Utility Agent，通过马尔可夫实时推理和元认知能力进行双目标优化。 Result: HarmonyGuard 在多个基准测试中表现出色，与现有基线相比，策略合规性提高了最高 38%，任务完成率提高了最高 20%，所有任务的策略合规性均超过 90%。 Conclusion: HarmonyGuard 是一个有效的多智能体协作框架，能够在提升任务完成度的同时增强安全性，解决了当前在安全性和实用性之间缺乏协同优化的问题。 Abstract: Large language models enable agents to autonomously perform tasks in open web environments. However, as hidden threats within the web evolve, web agents face the challenge of balancing task performance with emerging risks during long-sequence operations. Although this challenge is critical, current research remains limited to single-objective optimization or single-turn scenarios, lacking the capability for collaborative optimization of both safety and utility in web environments. To address this gap, we propose HarmonyGuard, a multi-agent collaborative framework that leverages policy enhancement and objective optimization to jointly improve both utility and safety. HarmonyGuard features a multi-agent architecture characterized by two fundamental capabilities: (1) Adaptive Policy Enhancement: We introduce the Policy Agent within HarmonyGuard, which automatically extracts and maintains structured security policies from unstructured external documents, while continuously updating policies in response to evolving threats. (2) Dual-Objective Optimization: Based on the dual objectives of safety and utility, the Utility Agent integrated within HarmonyGuard performs the Markovian real-time reasoning to evaluate the objectives and utilizes metacognitive capabilities for their optimization. Extensive evaluations on multiple benchmarks show that HarmonyGuard improves policy compliance by up to 38% and task completion by up to 20% over existing baselines, while achieving over 90% policy compliance across all tasks. Our project is available here: https://github.com/YurunChen/HarmonyGuard.

[19] Step More: Going Beyond Single Backpropagation in Meta Learning Based Model Editing

Xiaopeng Li,Shasha Li,Xi Wang,Shezheng Song,Bin Ji,Shangwen Wang,Jun Ma,Xiaodong Liu,Mina Liu,Jie Yu

Main category: cs.CL

TL;DR: SMEdit improves model editing for LLMs by introducing MBPS and weight regularization, achieving better performance in low-data settings and faster training.

Details

Motivation: Existing MLBME methods perform poorly in low-data scenarios and suffer from training inefficiency due to KL divergence computation. Method: SMEdit employs Multiple Backpropagation Steps (MBPS) and norm regularization on weight updates to enhance editing performance and training efficiency. Result: Experimental results show that SMEdit outperforms prior MLBME methods on two datasets and two LLMs, with the MBPS strategy being adaptable to existing methods. Conclusion: SMEdit demonstrates improved performance in low-data scenarios and training efficiency, surpassing previous MLBME baselines. Abstract: Large Language Models (LLMs) underpin many AI applications, but their static nature makes updating knowledge costly. Model editing offers an efficient alternative by injecting new information through targeted parameter modifications. In particular, meta-learning-based model editing (MLBME) methods have demonstrated notable advantages in both editing effectiveness and efficiency. Despite this, we find that MLBME exhibits suboptimal performance in low-data scenarios, and its training efficiency is bottlenecked by the computation of KL divergence. To address these, we propose $\textbf{S}$tep $\textbf{M}$ore $\textbf{Edit}$ ($\textbf{SMEdit}$), a novel MLBME method that adopts $\textbf{M}$ultiple $\textbf{B}$ackpro$\textbf{P}$agation $\textbf{S}$teps ($\textbf{MBPS}$) to improve editing performance under limited supervision and a norm regularization on weight updates to improve training efficiency. Experimental results on two datasets and two LLMs demonstrate that SMEdit outperforms prior MLBME baselines and the MBPS strategy can be seamlessly integrated into existing methods to further boost their performance. Our code will be released soon.

[20] ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents

Zechen Li,Baiyu Chen,Hao Xue,Flora D. Salim

Main category: cs.CL

TL;DR: ZARA是一种无需微调、具有可解释性的零样本学习框架，用于从原始运动时间序列中进行人类活动识别，并实现了最先进的性能。

Details

Motivation: 现有的人类活动识别方法需要为固定活动集训练，并且在出现新行为或传感器设置时需要昂贵的重新训练，而现有尝试使用大语言模型（LLMs）的方法在准确性和可解释性上存在不足。 Method: 提出了一种基于代理的零样本学习框架ZARA，整合了自动提取的成对特征知识库、多传感器检索模块和分层代理流水线，直接从原始运动时间序列中进行活动识别和解释。 Result: 在8个HAR基准测试中，ZARA实现了最先进的零样本性能，宏F1得分超过最强基线2.53倍，并通过消融研究验证了每个模块的必要性。 Conclusion: ZARA是一个有前景的解决方案，用于可解释且无需微调的人类活动识别，推动了可信的即插即用式运动时间序列分析的发展。 Abstract: Motion sensor time-series are central to human activity recognition (HAR), with applications in health, sports, and smart devices. However, existing methods are trained for fixed activity sets and require costly retraining when new behaviours or sensor setups appear. Recent attempts to use large language models (LLMs) for HAR, typically by converting signals into text or images, suffer from limited accuracy and lack verifiable interpretability. We propose ZARA, the first agent-based framework for zero-shot, explainable HAR directly from raw motion time-series. ZARA integrates an automatically derived pair-wise feature knowledge base that captures discriminative statistics for every activity pair, a multi-sensor retrieval module that surfaces relevant evidence, and a hierarchical agent pipeline that guides the LLM to iteratively select features, draw on this evidence, and produce both activity predictions and natural-language explanations. ZARA enables flexible and interpretable HAR without any fine-tuning or task-specific classifiers. Extensive experiments on 8 HAR benchmarks show that ZARA achieves SOTA zero-shot performance, delivering clear reasoning while exceeding the strongest baselines by 2.53x in macro F1. Ablation studies further confirm the necessity of each module, marking ZARA as a promising step toward trustworthy, plug-and-play motion time-series analysis. Our codes are available at https://github.com/zechenli03/ZARA.

[21] Large Reasoning Models Are Autonomous Jailbreak Agents

Thilo Hagendorff,Erik Derner,Nuria Oliver

Main category: cs.CL

TL;DR: Large reasoning models can effectively act as autonomous jailbreaking agents, bypassing safety mechanisms in other AI models with a high success rate, making this process accessible to non-experts.

Details

Motivation: The motivation is to investigate whether large reasoning models (LRMs) can simplify and scale the process of jailbreaking AI models, making it accessible to non-experts, and to assess the implications for AI safety. Method: The researchers evaluated four LRMs by using them as autonomous adversaries to conduct multi-turn conversations with nine target models. The LRMs were given a system prompt and then planned and executed jailbreaks without further supervision. They tested these capabilities using a benchmark of 70 harmful prompts across seven sensitive domains. Result: The experiments showed an overall attack success rate of 97.14% across all model combinations, indicating that LRMs are highly effective at jailbreaking other models. Conclusion: The study concludes that large reasoning models can systematically undermine the safety mechanisms of other AI models, emphasizing the need to align frontier models to resist jailbreak attempts and prevent their misuse as jailbreak agents. Abstract: Jailbreaking -- bypassing built-in safety mechanisms in AI models -- has traditionally required complex technical procedures or specialized human expertise. In this study, we show that the persuasive capabilities of large reasoning models (LRMs) simplify and scale jailbreaking, converting it into an inexpensive activity accessible to non-experts. We evaluated the capabilities of four LRMs (DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini, Qwen3 235B) to act as autonomous adversaries conducting multi-turn conversations with nine widely used target models. LRMs received instructions via a system prompt, before proceeding to planning and executing jailbreaks with no further supervision. We performed extensive experiments with a benchmark of harmful prompts composed of 70 items covering seven sensitive domains. This setup yielded an overall attack success rate across all model combinations of 97.14%. Our study reveals an alignment regression, in which LRMs can systematically erode the safety guardrails of other models, highlighting the urgent need to further align frontier models not only to resist jailbreak attempts, but also to prevent them from being co-opted into acting as jailbreak agents.

[22] DTPA: Dynamic Token-level Prefix Augmentation for Controllable Text Generation

Jiabing Yang,Yixiang Chen,Zichen Wen,Chenhang Cui,Peiyan Li,Yuan Xu,Bowen Fang,Yan Huang,Liang Wang

Main category: cs.CL

TL;DR: This paper proposes DTPA, a framework for controllable text generation that improves the controllability of long texts by dynamically adjusting prefix attention and optionally augmenting prompts.

Details

Motivation: Controllable text generation for long sequences remains underexplored, and existing methods like Air-Decoding face declining controllability with increasing sequence length. Method: Dynamic Token-level Prefix Augmentation (DTPA) based on Air-Decoding is proposed to enhance controllability by dynamically amplifying attention to prefixes and optionally augmenting the original prompt. Result: Experiments show DTPA outperforms other methods in attribute control while maintaining fluency, diversity, and topic relevance, with further analysis highlighting its effectiveness in long text generation. Conclusion: DTPA is an effective framework for controllable text generation that outperforms other methods in attribute control while maintaining text quality. Abstract: Controllable Text Generation (CTG) is a vital subfield in Natural Language Processing (NLP), aiming to generate text that aligns with desired attributes. However, previous studies commonly focus on the quality of controllable text generation for short sequences, while the generation of long-form text remains largely underexplored. In this paper, we observe that the controllability of texts generated by the powerful prefix-based method Air-Decoding tends to decline with increasing sequence length, which we hypothesize primarily arises from the observed decay in attention to the prefixes. Meanwhile, different types of prefixes including soft and hard prefixes are also key factors influencing performance. Building on these insights, we propose a lightweight and effective framework called Dynamic Token-level Prefix Augmentation (DTPA) based on Air-Decoding for controllable text generation. Specifically, it first selects the optimal prefix type for a given task. Then we dynamically amplify the attention to the prefix for the attribute distribution to enhance controllability, with a scaling factor growing exponentially as the sequence length increases. Moreover, based on the task, we optionally apply a similar augmentation to the original prompt for the raw distribution to balance text quality. After attribute distribution reconstruction, the generated text satisfies the attribute constraints well. Experiments on multiple CTG tasks demonstrate that DTPA generally outperforms other methods in attribute control while maintaining competitive fluency, diversity, and topic relevance. Further analysis highlights DTPA's superior effectiveness in long text generation.

[23] PAIRS: Parametric-Verified Adaptive Information Retrieval and Selection for Efficient RAG

Wang Chen,Guanqiang Qi,Weikang Li,Yang Li,Deguo Xia,Jizhou Huang

Main category: cs.CL

TL;DR: PAIRS是一种无需训练的框架，通过结合参数化知识和检索知识，自适应决定是否检索和选择外部信息，从而提高RAG系统的效率和准确性。

Details

Motivation: 当前的RAG系统存在两个关键限制：对每个查询都低效检索信息，以及在查询包含稀疏信息信号时可能检索到不相关文档。 Method: 采用双路径生成机制，包括直接回答和使用自生成伪上下文的上下文增强回答，并在发散情况下激活双路径检索和自适应信息选择模块。 Result: 实验结果表明，PAIRS平均减少了约25%的检索成本（仅触发75%的查询），同时提高了准确性，平均EM提升1.1%，F1提升1.0%。 Conclusion: PAIRS通过结合参数化知识和检索知识，自适应决定是否检索和选择外部信息，从而提高RAG系统的效率和准确性。 Abstract: Retrieval-Augmented Generation (RAG) has become a cornerstone technique for enhancing large language models (LLMs) with external knowledge. However, current RAG systems face two critical limitations: (1) they inefficiently retrieve information for every query, including simple questions that could be resolved using the LLM's parametric knowledge alone, and (2) they risk retrieving irrelevant documents when queries contain sparse information signals. To address these gaps, we introduce Parametric-verified Adaptive Information Retrieval and Selection (PAIRS), a training-free framework that integrates parametric and retrieved knowledge to adaptively determine whether to retrieve and how to select external information. Specifically, PAIRS employs a dual-path generation mechanism: First, the LLM produces both a direct answer and a context-augmented answer using self-generated pseudo-context. When these outputs converge, PAIRS bypasses external retrieval entirely, dramatically improving the RAG system's efficiency. For divergent cases, PAIRS activates a dual-path retrieval (DPR) process guided by both the original query and self-generated contextual signals, followed by an Adaptive Information Selection (AIS) module that filters documents through weighted similarity to both sources. This simple yet effective approach can not only enhance efficiency by eliminating unnecessary retrievals but also improve accuracy through contextually guided retrieval and adaptive information selection. Experimental results on six question-answering (QA) benchmarks show that PAIRS reduces retrieval costs by around 25% (triggering for only 75% of queries) while still improving accuracy-achieving +1.1% EM and +1.0% F1 over prior baselines on average.

[24] Efficient Strategy for Improving Large Language Model (LLM) Capabilities

Julián Camilo Velandia Gutiérrez

Main category: cs.CL

TL;DR: 本文研究了提高大语言模型在资源受限环境下效率的策略，包括数据处理和选择技术、训练策略和架构调整，并验证了这些策略的有效性。

Details

Motivation: 大语言模型需要大量计算资源，限制了其大规模部署，因此需要提高其效率。 Method: 定义了构建可靠数据集的标准，进行了不同配置的受控实验，并对结果变体进行了系统评估。 Result: 通过比较测试，测量了开发变体的性能，并验证了所提出的策略的有效性。 Conclusion: 本文提出了一种高效的策略来提升大语言模型在资源受限环境下的性能，并验证了所提策略的有效性。 Abstract: Large Language Models (LLMs) have become a milestone in the field of artificial intelligence and natural language processing. However, their large-scale deployment remains constrained by the need for significant computational resources. This work proposes starting from a base model to explore and combine data processing and careful data selection techniques, training strategies, and architectural adjustments to improve the efficiency of LLMs in resource-constrained environments and within a delimited knowledge base. The methodological approach included defining criteria for building reliable datasets, conducting controlled experiments with different configurations, and systematically evaluating the resulting variants in terms of capability, versatility, response time, and safety. Finally, comparative tests were conducted to measure the performance of the developed variants and to validate the effectiveness of the proposed strategies. This work is based on the master's thesis in Systems and Computer Engineering titled "Efficient Strategy for Improving the Capabilities of Large Language Models (LLMs)".

[25] ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients"

Zhongyi Zhou,Kohei Uehara,Haoyu Zhang,Jingtao Zhou,Lin Gu,Ruofei Du,Zheng Xu,Tatsuya Harada

Main category: cs.CL

TL;DR: ToolGrad 通过反转现有范式，创建了一个更高效、更复杂的工具使用 LLM 数据集 ToolGrad-5k，解决了先前方法的局限性。

Details

Motivation: 先前的工作在生成工具使用 LLM 数据集时存在不可避免的注释失败和低效的数据生成问题。 Method: 引入了 ToolGrad，这是一种代理框架，通过迭代过程生成有效的工具使用链，并合成相应的用户查询。 Result: ToolGrad-5k 数据集生成更复杂的工具使用、更低的成本和 100% 的通过率。 Conclusion: ToolGrad-5k 优于昂贵的基线数据集和专有 LLM，即使在 OOD 基准上也是如此。 Abstract: Prior work synthesizes tool-use LLM datasets by first generating a user query, followed by complex tool-use annotations like DFS. This leads to inevitable annotation failures and low efficiency in data generation. We introduce ToolGrad, an agentic framework that inverts this paradigm. ToolGrad first constructs valid tool-use chains through an iterative process guided by textual "gradients", and then synthesizes corresponding user queries. This "answer-first" approach led to ToolGrad-5k, a dataset generated with more complex tool use, lower cost, and 100% pass rate. Experiments show that models trained on ToolGrad-5k outperform those on expensive baseline datasets and proprietary LLMs, even on OOD benchmarks.

[26] GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning

Jianghangfan Zhang,Yibo Yan,Kening Zheng,Xin Zou,Song Dai,Xuming Hu

Main category: cs.CL

TL;DR: This paper introduces GM-PRM, a novel framework that enhances multimodal mathematical reasoning by transforming Process Reward Models into active collaborators that analyze and correct reasoning errors, achieving superior performance with minimal data.

Details

Motivation: The motivation is to overcome the limitations of existing Multimodal Large Language Models (MLLMs) and Process Reward Models (PRMs), which struggle with complex mathematical reasoning and lack the ability to correct errors or provide explanations. Method: The study introduces the Generative Multimodal Process Reward Model (GM-PRM), which provides a fine-grained analysis of reasoning steps and generates corrections for errors. It employs a new inference strategy called Refined Best-of-N (Refined-BoN) to enhance solution quality using these corrections. Result: GM-PRM achieves state-of-the-art results on multiple multimodal math benchmarks, significantly improving the performance of policy models with a small training dataset of only 20K samples. Conclusion: The study concludes that the proposed GM-PRM significantly improves the performance of multimodal math reasoning with a novel approach that transforms PRMs into active reasoning collaborators, achieving state-of-the-art results on multiple benchmarks. Abstract: Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities but often struggle with complex, multi-step mathematical reasoning, where minor errors in visual perception or logical deduction can lead to complete failure. While Process Reward Models (PRMs) offer step-by-step supervision, existing multimodal PRMs are limited to being binary verifiers that can identify but not correct errors, offering little explanatory power. To address these deficiencies, we introduce the Generative Multimodal Process Reward Model (GM-PRM), a novel paradigm that transforms the PRM from a passive judge into an active reasoning collaborator. Instead of a simple scalar score, GM-PRM provides a fine-grained, interpretable analysis of each reasoning step, evaluating its step intent, visual alignment, and logical soundness. More critically, GM-PRM is trained to generate a corrected version of the first erroneous step it identifies. This unique corrective capability enables our new test-time inference strategy, Refined Best-of-N (Refined-BoN). This framework actively enhances solution quality by using the PRM's generated correction to guide the policy model toward a more promising reasoning trajectory, thereby improving the diversity and correctness of the solution pool. We demonstrate that GM-PRM achieves state-of-the-art results on multiple multimodal math benchmarks, significantly boosting policy model performance with remarkable data efficiency, requiring only a 20K-sample training dataset. Our code will be released upon acceptance.

[27] Unveiling Over-Memorization in Finetuning LLMs for Reasoning Tasks

Zhiwen Ruan,Yun Chen,Yutao Hou,Peng Li,Yang Liu,Guanhua Chen

Main category: cs.CL

TL;DR: 本文研究了大语言模型（LLMs）微调过程中的过度记忆现象，发现过度记忆模型在保持高测试准确性的同时，表现出高测试困惑度，并可能导致模型鲁棒性和泛化能力下降。

Details

Motivation: 预训练的大语言模型（LLMs）通过标记数据进行微调以提高指令跟随能力和与人类价值观的一致性，但微调过程中存在未被发现的过度记忆现象，这可能影响模型的鲁棒性和泛化能力。 Method: 通过实验研究LLM微调过程中的学习动态，分析导致LLMs过度记忆的条件，例如训练周期和较大的学习率。 Result: 研究发现LLMs在特定微调阶段会过度记忆训练数据，表现为高测试困惑度但保持良好的测试准确性，同时发现过度记忆模型在鲁棒性、分布外泛化和生成多样性方面表现较差。 Conclusion: 研究得出过度记忆现象在不同的任务、模型和微调方法中普遍存在，并强调过参数化和广泛微调的LLMs表现出与传统机器学习模型不同的学习动态。 Abstract: The pretrained large language models (LLMs) are finetuned with labeled data for better instruction following ability and alignment with human values. In this paper, we study the learning dynamics of LLM finetuning on reasoning tasks and reveal the uncovered over-memorization phenomenon during a specific stage of LLM finetuning. At this stage, the LLMs have excessively memorized training data and exhibit high test perplexity while maintaining good test accuracy. We investigate the conditions that lead to LLM over-memorization and find that training epochs and large learning rates contribute to this issue. Although models with over-memorization demonstrate comparable test accuracy to normal models, they suffer from reduced robustness, poor out-of-distribution generalization, and decreased generation diversity. Our experiments unveil the over-memorization to be broadly applicable across different tasks, models, and finetuning methods. Our research highlights that overparameterized, extensively finetuned LLMs exhibit unique learning dynamics distinct from traditional machine learning models. Based on our observations of over-memorization, we provide recommendations on checkpoint and learning rate selection during finetuning.

[28] Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap

Xuan Qi,Rongwu Xu,Zhijing Jin

Main category: cs.CL

TL;DR: This paper introduces a novel, efficient data selection strategy for preference datasets, improving alignment of large language models with human preferences using only a fraction of the data.

Details

Motivation: The motivation is to address the lack of high-quality, cost-effective data selection methods in preference datasets for aligning large language models with human preferences. Method: The method involves a difficulty-based data selection strategy rooted in the DPO implicit reward mechanism, focusing on more challenging cases for improved alignment. Result: The proposed approach outperforms five strong baselines across multiple datasets and alignment tasks, achieving superior results using only 10% of the original data. Conclusion: The paper concludes that selecting preference data examples based on smaller DPO implicit reward gaps improves data efficiency and model alignment, outperforming existing methods. Abstract: Aligning large language models (LLMs) with human preferences is a critical challenge in AI research. While methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are widely used, they often rely on large, costly preference datasets. The current work lacks methods for high-quality data selection specifically for preference data. In this work, we introduce a novel difficulty-based data selection strategy for preference datasets, grounded in the DPO implicit reward mechanism. By selecting preference data examples with smaller DPO implicit reward gaps, which are indicative of more challenging cases, we improve data efficiency and model alignment. Our approach consistently outperforms five strong baselines across multiple datasets and alignment tasks, achieving superior performance with only 10\% of the original data. This principled, efficient selection method offers a promising solution for scaling LLM alignment with limited resources.

[29] The State Of TTS: A Case Study with Human Fooling Rates

Praveen Srinivasa Varadhan,Sherry Thomas,Sai Teja M. S.,Suvrat Bhooshan,Mitesh M. Khapra

Main category: cs.CL

TL;DR: This study introduces the Human Fooling Rate (HFR) metric to evaluate TTS systems' ability to deceive humans, finding that while commercial models perform well, especially in zero-shot settings, overall TTS technology has not yet fully achieved human-level performance in natural conversational speech.

Details

Motivation: The motivation behind the study is to determine whether current TTS systems can truly pass a human deception test similar to the Turing test, and to develop a more accurate metric for evaluating TTS progress. Method: The study introduces a new metric called Human Fooling Rate (HFR) to measure how often machine-generated speech is mistaken for human speech. It involves a large-scale evaluation of both open-source and commercial TTS models using this metric. Result: Key results include the revelation that CMOS-based claims of human parity often fail under deception testing, that evaluating against less expressive human speech sets a low benchmark, that commercial models perform close to human deception in zero-shot settings, and that fine-tuning on high-quality data enhances realism but does not fully match human performance. Conclusion: The study concludes that while there has been rapid progress in TTS technology, current systems still face challenges in achieving human-level performance in natural conversational speech, particularly under rigorous deception testing. The importance of more realistic, human-centric evaluations is emphasized. Abstract: While subjective evaluations in recent years indicate rapid progress in TTS, can current TTS systems truly pass a human deception test in a Turing-like evaluation? We introduce Human Fooling Rate (HFR), a metric that directly measures how often machine-generated speech is mistaken for human. Our large-scale evaluation of open-source and commercial TTS models reveals critical insights: (i) CMOS-based claims of human parity often fail under deception testing, (ii) TTS progress should be benchmarked on datasets where human speech achieves high HFRs, as evaluating against monotonous or less expressive reference samples sets a low bar, (iii) Commercial models approach human deception in zero-shot settings, while open-source systems still struggle with natural conversational speech; (iv) Fine-tuning on high-quality data improves realism but does not fully bridge the gap. Our findings underscore the need for more realistic, human-centric evaluations alongside existing subjective tests.

[30] Hacking Hallucinations of MLLMs with Causal Sufficiency and Necessity

Peizheng Guo,Jingyao Wang,Wenwen Qiang,Huijie Guo,Changwen Zheng,Jiahuan Zhou,Gang Hua

Main category: cs.CL

TL;DR: 该论文提出了一种基于因果完整性的强化学习框架，用于缓解多模态大语言模型（MLLMs）中的幻觉问题。

Details

Motivation: 多模态大语言模型在视觉-语言任务中表现出色，但可能会产生幻觉（hallucinations），即生成与输入图像或文本语义不一致的输出。 Method: 通过因果分析发现，幻觉的产生可能与未能充分捕捉关键因果因素或受到非因果线索误导有关。因此，论文提出了一种新的强化学习框架，基于因果完整性，同时考虑token的因果充分性和必要性。 Result: 实验结果表明，该方法在多个基准数据集和任务上有效缓解了MLLMs中的幻觉问题。 Conclusion: 论文提出的方法能够有效解决多模态大语言模型中的幻觉问题，具有实际应用价值。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across vision-language tasks. However, they may suffer from hallucinations--generating outputs that are semantically inconsistent with the input image or text. Through causal analyses, we find that: (i) hallucinations with omission may arise from the failure to adequately capture essential causal factors, and (ii) hallucinations with fabrication are likely caused by the model being misled by non-causal cues. To address these challenges, we propose a novel reinforcement learning framework guided by causal completeness, which jointly considers both causal sufficiency and causal necessity of tokens. Specifically, we evaluate each token's standalone contribution and counterfactual indispensability to define a token-level causal completeness reward. This reward is used to construct a causally informed advantage function within the GRPO optimization framework, encouraging the model to focus on tokens that are both causally sufficient and necessary for accurate generation. Experimental results across various benchmark datasets and tasks demonstrate the effectiveness of our approach, which effectively mitigates hallucinations in MLLMs.

[31] Characterizing Deep Research: A Benchmark and Formal Definition

Abhinav Java,Ashmit Khandelwal,Sukruta Midigeshi,Aaron Halfaker,Amit Deshpande,Navin Goyal,Ankur Gupta,Nagarajan Natarajan,Amit Sharma

Main category: cs.CL

TL;DR: The paper defines deep research as requiring broad, reasoning-intensive exploration and introduces LiveDRBench, a benchmark for evaluating DR systems. Results show varying performance across systems, with room for improvement in search and reasoning capabilities.

Details

Motivation: The motivation is to define and understand the deep research task, which involves complex search and reasoning, and to evaluate its distinction from other reasoning-intensive problems. The goal is to create a formal framework and benchmark for assessing DR systems. Method: The paper proposes a formal characterization of the deep research task and introduces a benchmark called LiveDRBench with 100 challenging tasks. It evaluates DR systems using F1 scores and analyzes reasoning traces to understand system performance and behavior. Result: The paper introduces LiveDRBench, a benchmark with 100 challenging tasks for evaluating DR systems. The F1 scores of state-of-the-art systems range from 0.02 to 0.72, with OpenAI's model performing best at 0.55. Reasoning traces show insights into system behavior, such as source references and branching events. Conclusion: The paper concludes that deep research tasks require broad and reasoning-intensive exploration, not just producing lengthy reports. The introduced benchmark, LiveDRBench, enables objective evaluation of DR systems, revealing current limitations and future directions for improvement. Abstract: Information tasks such as writing surveys or analytical reports require complex search and reasoning, and have recently been grouped under the umbrella of \textit{deep research} -- a term also adopted by recent models targeting these capabilities. Despite growing interest, the scope of the deep research task remains underdefined and its distinction from other reasoning-intensive problems is poorly understood. In this paper, we propose a formal characterization of the deep research (DR) task and introduce a benchmark to evaluate the performance of DR systems. We argue that the core defining feature of deep research is not the production of lengthy report-style outputs, but rather the high fan-out over concepts required during the search process, i.e., broad and reasoning-intensive exploration. To enable objective evaluation, we define DR using an intermediate output representation that encodes key claims uncovered during search-separating the reasoning challenge from surface-level report generation. Based on this formulation, we propose a diverse, challenging benchmark LiveDRBench with 100 challenging tasks over scientific topics (e.g., datasets, materials discovery, prior art search) and public interest events (e.g., flight incidents, movie awards). Across state-of-the-art DR systems, F1 score ranges between 0.02 and 0.72 for any sub-category. OpenAI's model performs the best with an overall F1 score of 0.55. Analysis of reasoning traces reveals the distribution over the number of referenced sources, branching, and backtracking events executed by current DR systems, motivating future directions for improving their search mechanisms and grounding capabilities. The benchmark is available at https://github.com/microsoft/LiveDRBench.

[32] Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models

Siddhant Panpatil,Hiskias Dingeto,Haon Park

Main category: cs.CL

TL;DR: The paper investigates the vulnerabilities of advanced language models through manual and automated testing, revealing the need for improved alignment strategies to prevent manipulation.

Details

Motivation: The motivation behind this study is to explore the vulnerabilities of state-of-the-art language models in handling narrative immersion, emotional pressure, and strategic framing despite significant advances in alignment techniques. Method: The paper utilizes systematic manual red-teaming with Claude-4-Opus to identify vulnerabilities. It also employs an automated evaluation framework named MISALIGNMENTBENCH for reproducible testing across multiple models. Result: The study found a high vulnerability rate with different models showing significant variation in susceptibility. GPT-4.1 had a 90% susceptibility rate, while Claude-4-Sonnet demonstrated a greater resistance at 40%. Conclusion: This paper concludes that current alignment strategies in language models have critical gaps and are vulnerable to conversational manipulation. This highlights the need for more robust strategies in future AI systems. Abstract: Despite significant advances in alignment techniques, we demonstrate that state-of-the-art language models remain vulnerable to carefully crafted conversational scenarios that can induce various forms of misalignment without explicit jailbreaking. Through systematic manual red-teaming with Claude-4-Opus, we discovered 10 successful attack scenarios, revealing fundamental vulnerabilities in how current alignment methods handle narrative immersion, emotional pressure, and strategic framing. These scenarios successfully elicited a range of misaligned behaviors, including deception, value drift, self-preservation, and manipulative reasoning, each exploiting different psychological and contextual vulnerabilities. To validate generalizability, we distilled our successful manual attacks into MISALIGNMENTBENCH, an automated evaluation framework that enables reproducible testing across multiple models. Cross-model evaluation of our 10 scenarios against five frontier LLMs revealed an overall 76% vulnerability rate, with significant variations: GPT-4.1 showed the highest susceptibility (90%), while Claude-4-Sonnet demonstrated greater resistance (40%). Our findings demonstrate that sophisticated reasoning capabilities often become attack vectors rather than protective mechanisms, as models can be manipulated into complex justifications for misaligned behavior. This work provides (i) a detailed taxonomy of conversational manipulation patterns and (ii) a reusable evaluation framework. Together, these findings expose critical gaps in current alignment strategies and highlight the need for robustness against subtle, scenario-based manipulation in future AI systems.

[33] Reasoning Beyond Labels: Measuring LLM Sentiment in Low-Resource, Culturally Nuanced Contexts

Millicent Ochieng,Anja Thieme,Ignatius Ezeani,Risa Ueno,Samuel Maina,Keshet Ronen,Javier Gonzalez,Jacki O'Neill

Main category: cs.CL

TL;DR: This paper proposes a culturally sensitive framework for sentiment analysis in low-resource contexts, evaluating how well large language models handle nuanced, informal communication from Nairobi youth groups, highlighting the need for improved AI evaluation methods that account for cultural and contextual complexity.

Details

Motivation: The motivation is to address challenges in conventional NLP approaches that assume fixed sentiment labels and universal affective expressions, particularly in low-resource, culturally nuanced contexts like informal, code-mixed WhatsApp messages from Nairobi youth health groups. Method: The study uses a diagnostic framework treating sentiment as context-dependent and culturally embedded. It combines human-annotated data, sentiment-flipped counterfactuals, and rubric-based explanation evaluation to assess LLMs' interpretability, robustness, and alignment with human reasoning. Result: The findings show significant variation in model reasoning quality, with top-tier LLMs demonstrating interpretive stability, while open models struggle with ambiguity or sentiment shifts. Conclusion: The study concludes that sentiment analysis in culturally nuanced, low-resource settings requires culturally sensitive and reasoning-aware AI evaluation, as top-tier LLMs perform better in interpretive stability compared to open models. Abstract: Sentiment analysis in low-resource, culturally nuanced contexts challenges conventional NLP approaches that assume fixed labels and universal affective expressions. We present a diagnostic framework that treats sentiment as a context-dependent, culturally embedded construct, and evaluate how large language models (LLMs) reason about sentiment in informal, code-mixed WhatsApp messages from Nairobi youth health groups. Using a combination of human-annotated data, sentiment-flipped counterfactuals, and rubric-based explanation evaluation, we probe LLM interpretability, robustness, and alignment with human reasoning. Framing our evaluation through a social-science measurement lens, we operationalize and interrogate LLMs outputs as an instrument for measuring the abstract concept of sentiment. Our findings reveal significant variation in model reasoning quality, with top-tier LLMs demonstrating interpretive stability, while open models often falter under ambiguity or sentiment shifts. This work highlights the need for culturally sensitive, reasoning-aware AI evaluation in complex, real-world communication.

[34] ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments

Yuquan Wang,Mi Zhang,Yining Wang,Geng Hong,Xiaoyu You,Min Yang

Main category: cs.CL

TL;DR: ReasoningGuard is a new, efficient safeguard for Large Reasoning Models that identifies critical points in reasoning and injects safety checks to prevent harmful outputs, outperforming existing methods.

Details

Motivation: Large Reasoning Models are vulnerable to harmful content generation, especially in mid-to-late reasoning steps, and current defense mechanisms are limited due to their cost and reliance on expert knowledge. Method: ReasoningGuard uses the model's internal attention behavior to identify critical points in the reasoning path and injects safety 'aha moments' to steer the reasoning process. It also employs a scaling sampling strategy during decoding to choose the optimal reasoning path. Result: ReasoningGuard successfully mitigates three types of jailbreak attacks, outperforms seven existing safeguards, and achieves state-of-the-art safety defenses while avoiding exaggerated safety issues. Conclusion: ReasoningGuard is an effective and scalable inference-time safeguard for Large Reasoning Models that mitigates harmful content generation without requiring additional fine-tuning or expert knowledge. Abstract: Large Reasoning Models (LRMs) have demonstrated impressive performance in reasoning-intensive tasks, but they remain vulnerable to harmful content generation, particularly in the mid-to-late steps of their reasoning processes. Existing defense mechanisms, however, rely on costly fine-tuning and additional expert knowledge, which restricts their scalability. In this work, we propose ReasoningGuard, an inference-time safeguard for LRMs, which injects timely safety aha moments to steer harmless while helpful reasoning processes. Leveraging the model's internal attention behavior, our approach accurately identifies critical points in the reasoning path, and triggers spontaneous, safety-oriented reflection. To safeguard both the subsequent reasoning steps and the final answers, we further implement a scaling sampling strategy during the decoding phase, selecting the optimal reasoning path. Inducing minimal extra inference cost, ReasoningGuard effectively mitigates three types of jailbreak attacks, including the latest ones targeting the reasoning process of LRMs. Our approach outperforms seven existing safeguards, achieving state-of-the-art safety defenses while effectively avoiding the common exaggerated safety issues.

[35] Hierarchical Text Classification Using Black Box Large Language Models

Kosuke Yoshimura,Hisashi Kashima

Main category: cs.CL

TL;DR: 该研究探讨了使用大语言模型（LLM）进行层次文本分类（HTC）的可行性，发现尽管LLM在处理深层次结构时成本较高，但其在少样本设置下的表现优于传统机器学习方法，尤其是在较深的标签层次结构上。研究强调了在性能和计算成本之间找到平衡的重要性。

Details

Motivation: 层次文本分类（HTC）面临数据稀缺和模型复杂度的挑战。本研究探索了使用通过API访问的黑盒大语言模型（LLM）作为替代方案的可行性，以避免传统机器学习方法所需的大量标注数据和计算资源。 Method: 研究评估了三种提示策略——直接叶标签预测（DL）、直接层次标签预测（DH）和自上而下多步层次标签预测（TMH）——在零样本和少样本设置下的准确性和成本效益。实验在两个数据集上进行，比较了不同策略的表现。 Result: 实验结果表明，在少样本设置下分类准确率相比零样本设置有持续提升。在一个具有浅层层次结构的数据集上，传统机器学习模型表现出较高的准确率，但在具有较深层次结构的数据集上，LLM（特别是DH策略）表现更优。然而，由于DH策略在较深的标签层次结构上需要更多的输入token，导致API成本显著增加。 Conclusion: 研究表明，尽管使用大语言模型（LLM）进行层次文本分类（HTC）会带来较高的API成本，但在处理具有较深层次结构的数据集时，LLM（尤其是使用DH策略的模型）往往能超越传统机器学习模型。此外，少样本设置通常比零样本设置提供更高的分类准确率。研究强调了提示策略在性能和成本之间的权衡，并指出在选择提示策略时需要仔细考虑这种平衡。 Abstract: Hierarchical Text Classification (HTC) aims to assign texts to structured label hierarchies; however, it faces challenges due to data scarcity and model complexity. This study explores the feasibility of using black box Large Language Models (LLMs) accessed via APIs for HTC, as an alternative to traditional machine learning methods that require extensive labeled data and computational resources. We evaluate three prompting strategies -- Direct Leaf Label Prediction (DL), Direct Hierarchical Label Prediction (DH), and Top-down Multi-step Hierarchical Label Prediction (TMH) -- in both zero-shot and few-shot settings, comparing the accuracy and cost-effectiveness of these strategies. Experiments on two datasets show that a few-shot setting consistently improves classification accuracy compared to a zero-shot setting. While a traditional machine learning model achieves high accuracy on a dataset with a shallow hierarchy, LLMs, especially DH strategy, tend to outperform the machine learning model on a dataset with a deeper hierarchy. API costs increase significantly due to the higher input tokens required for deeper label hierarchies on DH strategy. These results emphasize the trade-off between accuracy improvement and the computational cost of prompt strategy. These findings highlight the potential of black box LLMs for HTC while underscoring the need to carefully select a prompt strategy to balance performance and cost.

[36] DP-GPT4MTS: Dual-Prompt Large Language Model for Textual-Numerical Time Series Forecasting

Chanjuan Liu,Shengzhi Wang,Enqiang Zhu

Main category: cs.CL

TL;DR: The paper proposes DP-GPT4MTS, a dual-prompt large language model framework that effectively integrates textual and numerical time series data, significantly improving forecasting accuracy.

Details

Motivation: Traditional forecasting models overlook important textual information such as events and news, which can significantly affect forecasting accuracy. Existing single-prompt frameworks also struggle to effectively capture the semantics of timestamped text. Method: A dual-prompt large language model framework that combines explicit and textual prompts for integrating multimodal data in time series forecasting. Result: The DP-GPT4MTS approach outperforms state-of-the-art algorithms in time series forecasting on diverse textual-numerical time series datasets. Conclusion: Incorporating textual context through a dual-prompt mechanism significantly improves the accuracy of time series predictions. Abstract: Time series forecasting is crucial in strategic planning and decision-making across various industries. Traditional forecasting models mainly concentrate on numerical time series data, often overlooking important textual information such as events and news, which can significantly affect forecasting accuracy. While large language models offer a promise for integrating multimodal data, existing single-prompt frameworks struggle to effectively capture the semantics of timestamped text, introducing redundant information that can hinder model performance. To address this limitation, we introduce DP-GPT4MTS (Dual-Prompt GPT2-base for Multimodal Time Series), a novel dual-prompt large language model framework that combines two complementary prompts: an explicit prompt for clear task instructions and a textual prompt for context-aware embeddings from time-stamped data. The tokenizer generates the explicit prompt while the embeddings from the textual prompt are refined through self-attention and feed-forward networks. Comprehensive experiments conducted on diverse textural-numerical time series datasets demonstrate that this approach outperforms state-of-the-art algorithms in time series forecasting. This highlights the significance of incorporating textual context via a dual-prompt mechanism to achieve more accurate time series predictions.

[37] TalkDep: Clinically Grounded LLM Personas for Conversation-Centric Depression Screening

Xi Wang,Anxo Perez,Javier Parapar,Fabio Crestani

Main category: cs.CL

TL;DR: 本文提出 TalkDep，一种基于语言模型的虚拟患者模拟系统，用于提升抑郁症自动诊断模型的训练与评估。

Details

Motivation: 心理健康服务需求的增长超过了真实训练数据的可用性，导致抑郁症诊断支持受限，因此需要开发临床有效的虚拟患者模拟系统。 Method: TalkDep 通过结合精神病诊断标准、症状严重程度量表和情境因素，利用先进语言模型构建虚拟患者模拟流程，并由临床专业人士进行评估验证。 Result: TalkDep 能够生成多样化、自然且具有临床有效性的患者症状表现，经过临床专业人士的全面评估验证，表现出良好的可靠性。 Conclusion: TalkDep 提供了一种新颖的、基于先进语言模型的虚拟患者模拟流程，能够生成具有临床有效性的患者响应，有助于改进自动抑郁症诊断系统的鲁棒性和泛化能力。 Abstract: The increasing demand for mental health services has outpaced the availability of real training data to develop clinical professionals, leading to limited support for the diagnosis of depression. This shortage has motivated the development of simulated or virtual patients to assist in training and evaluation, but existing approaches often fail to generate clinically valid, natural, and diverse symptom presentations. In this work, we embrace the recent advanced language models as the backbone and propose a novel clinician-in-the-loop patient simulation pipeline, TalkDep, with access to diversified patient profiles to develop simulated patients. By conditioning the model on psychiatric diagnostic criteria, symptom severity scales, and contextual factors, our goal is to create authentic patient responses that can better support diagnostic model training and evaluation. We verify the reliability of these simulated patients with thorough assessments conducted by clinical professionals. The availability of validated simulated patients offers a scalable and adaptable resource for improving the robustness and generalisability of automatic depression diagnosis systems.

[38] KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs

Zunhai Su,Kehong Yuan

Main category: cs.CL

TL;DR: This paper introduces KVSink, a method for better preserving attention sinks during KV cache quantization, leading to improved efficiency and performance in LLM inference.

Details

Motivation: KV cache quantization is crucial for efficient LLM inference, but current methods inadequately preserve attention sinks, leading to performance degradation. Method: The study analyzes the mechanisms of attention sinks and their interaction with KV cache quantization, introducing KVSink, a novel plug-and-play method for sink token prediction. Result: Experiments show that KVSink surpasses the PFN strategy in preserving attention sinks, enhances KVQuant performance, and reduces dependency on 16-bit outliers. Conclusion: KVSink is an effective method for predicting and preserving attention sinks during KV cache quantization, outperforming the PFN strategy and improving performance metrics like perplexity. Abstract: Key-Value (KV) cache quantization has become a widely adopted optimization technique for efficient large language models (LLMs) inference by reducing KV cache memory usage and mitigating memory-bound constraints. Recent studies have emphasized the importance of preserving the original precision of KVs for the first few tokens to ensure the protection of attention sinks. While this approach has proven effective in mitigating performance degradation, its underlying principles remain insufficiently understood. Moreover, it fails to address the recent discovery that attention sinks can emerge beyond the initial token positions. In this work, we elucidate the underlying mechanisms of attention sinks during inference by examining their role in the cross-layer evolution of extreme activation outliers. Additionally, we provide a comprehensive analysis of the interplay between attention sinks and KV cache quantization. Based on our enhanced understanding, we introduce \textit{\textbf{KVSink}}, a plug-and-play method that effectively predicts sink tokens with negligible overhead, enabling more thorough preservation. Extensive experiments demonstrate that KVSink outperforms the existing Preserve-First-N (PFN) strategy, offering more effective preservation of attention sinks during KV cache quantization. Moreover, when applied to the well-established KVQuant method, KVSink further improves perplexity (PPL) and reduces reliance on 16-bit numerical outliers.

[39] ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents

Jiangyuan Wang,Kejun Xiao,Qi Sun,Huaipeng Zhao,Tao Luo,Jiandong Zhang,Xiaoyi Zeng

Main category: cs.CL

TL;DR: ShoppingBench 是一个新型的端到端购物基准，旨在评估语言代理在处理复杂用户意图方面的能力，并提出了一种有效的训练方法，使小型代理能够实现与 GPT-4.1 相媲美的性能。

Details

Motivation: 现有电子商务基准主要关注基本用户意图，如查找或购买产品，而现实世界中的用户通常有更复杂的目标。ShoppingBench 的提出是为了弥补这一差距。 Method: 研究人员设计了一个可扩展的框架，以模拟基于真实产品和多样化意图的用户指令。此外，他们提出了一个轨迹蒸馏策略，并结合监督微调和强化学习来训练一个更小的语言代理。 Result: 实验结果表明，即使是最先进的语言代理（如 GPT-4.1）在 ShoppingBench 任务上的绝对成功率也低于 50%，突出了该基准的挑战性。 Conclusion: ShoppingBench 强调了当前语言代理在处理复杂购物任务方面的不足，并提出了一种有效的训练方法，使小型代理能够实现与GPT-4.1相媲美的性能。 Abstract: Existing benchmarks in e-commerce primarily focus on basic user intents, such as finding or purchasing products. However, real-world users often pursue more complex goals, such as applying vouchers, managing budgets, and finding multi-products seller. To bridge this gap, we propose ShoppingBench, a novel end-to-end shopping benchmark designed to encompass increasingly challenging levels of grounded intent. Specifically, we propose a scalable framework to simulate user instructions based on various intents derived from sampled real-world products. To facilitate consistent and reliable evaluations, we provide a large-scale shopping sandbox that serves as an interactive simulated environment, incorporating over 2.5 million real-world products. Experimental results demonstrate that even state-of-the-art language agents (such as GPT-4.1) achieve absolute success rates under 50% on our benchmark tasks, highlighting the significant challenges posed by our ShoppingBench. In addition, we propose a trajectory distillation strategy and leverage supervised fine-tuning, along with reinforcement learning on synthetic trajectories, to distill the capabilities of a large language agent into a smaller one. As a result, our trained agent achieves competitive performance compared to GPT-4.1.

[40] A Few Words Can Distort Graphs: Knowledge Poisoning Attacks on Graph-based Retrieval-Augmented Generation of Large Language Models

Jiayi Wen,Tianxin Chen,Zhirun Zheng,Cheng Huang

Main category: cs.CL

TL;DR: This paper demonstrates that GraphRAG systems are vulnerable to knowledge poisoning attacks that can significantly mislead reasoning, with current defenses being largely ineffective.

Details

Motivation: GraphRAG enhances LLMs by converting text into knowledge graphs, but its vulnerability to attacks that implant misleading information remains understudied. Method: Two knowledge poisoning attacks (KPAs) were proposed and tested: Targeted KPA (TKPA) and Universal KPA (UKPA), both aiming to manipulate the knowledge graph construction process. Result: TKPA achieved a 93.1% success rate in manipulating QA outcomes, while UKPA reduced QA accuracy from 95% to 50% with minimal text modification. Conclusion: Securing GraphRAG pipelines against knowledge poisoning is largely unexplored and requires further research. Abstract: Graph-based Retrieval-Augmented Generation (GraphRAG) has recently emerged as a promising paradigm for enhancing large language models (LLMs) by converting raw text into structured knowledge graphs, improving both accuracy and explainability. However, GraphRAG relies on LLMs to extract knowledge from raw text during graph construction, and this process can be maliciously manipulated to implant misleading information. Targeting this attack surface, we propose two knowledge poisoning attacks (KPAs) and demonstrate that modifying only a few words in the source text can significantly change the constructed graph, poison the GraphRAG, and severely mislead downstream reasoning. The first attack, named Targeted KPA (TKPA), utilizes graph-theoretic analysis to locate vulnerable nodes in the generated graphs and rewrites the corresponding narratives with LLMs, achieving precise control over specific question-answering (QA) outcomes with a success rate of 93.1\%, while keeping the poisoned text fluent and natural. The second attack, named Universal KPA (UKPA), exploits linguistic cues such as pronouns and dependency relations to disrupt the structural integrity of the generated graph by altering globally influential words. With fewer than 0.05\% of full text modified, the QA accuracy collapses from 95\% to 50\%. Furthermore, experiments show that state-of-the-art defense methods fail to detect these attacks, highlighting that securing GraphRAG pipelines against knowledge poisoning remains largely unexplored.

[41] Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models

Zizhan Ma,Wenxuan Wang,Guo Yu,Yiu-Fai Cheung,Meidan Ding,Jie Liu,Wenting Chen,Linlin Shen

Main category: cs.CL

TL;DR: MedCheck is a new framework for evaluating medical AI benchmarks that identifies major flaws in current practices and offers actionable guidelines for improvement.

Details

Motivation: Current benchmarks for large language models in healthcare lack clinical fidelity, robust data management, and safety-oriented evaluation metrics, raising concerns about their reliability. Method: The study introduces MedCheck, a lifecycle-oriented assessment framework with 46 medically-tailored criteria, and applies it to evaluate 53 medical LLM benchmarks across five development stages. Result: The analysis uncovered systemic issues in existing benchmarks, including a disconnect from clinical practice, data integrity problems, and neglect of safety-critical evaluation dimensions like model robustness and uncertainty awareness. Conclusion: MedCheck serves as both a diagnostic tool for existing benchmarks and a guideline for more standardized, reliable, and transparent evaluation of AI in healthcare. Abstract: Large language models (LLMs) show significant potential in healthcare, prompting numerous benchmarks to evaluate their capabilities. However, concerns persist regarding the reliability of these benchmarks, which often lack clinical fidelity, robust data management, and safety-oriented evaluation metrics. To address these shortcomings, we introduce MedCheck, the first lifecycle-oriented assessment framework specifically designed for medical benchmarks. Our framework deconstructs a benchmark's development into five continuous stages, from design to governance, and provides a comprehensive checklist of 46 medically-tailored criteria. Using MedCheck, we conducted an in-depth empirical evaluation of 53 medical LLM benchmarks. Our analysis uncovers widespread, systemic issues, including a profound disconnect from clinical practice, a crisis of data integrity due to unmitigated contamination risks, and a systematic neglect of safety-critical evaluation dimensions like model robustness and uncertainty awareness. Based on these findings, MedCheck serves as both a diagnostic tool for existing benchmarks and an actionable guideline to foster a more standardized, reliable, and transparent approach to evaluating AI in healthcare.

[42] Modelling and Classifying the Components of a Literature Review

Francisco Bolaños,Angelo Salatino,Francesco Osborne,Enrico Motta

Main category: cs.CL

TL;DR: 本文提出了一种新的文献综述生成支持注释模式，并评估了37种LLMs在该模式下的表现，结果显示微调后的模型表现优异，部分开源模型也有良好表现，半合成数据增强显著提升了模型性能。

Details

Motivation: 现有研究显示，通过标注论文句子的修辞角色（如研究空白、结果、局限性等），AI方法在分析科学文献方面有显著提升。这种表示方式也有望支持新一代高质量文献综述生成系统的开发。然而，实现这一目标需要定义相关的注释模式和大规模标注策略。 Method: 本文提出了一种新的注释模式，用于支持文献综述的生成，并提出了一个包含700个专家手动标注句子和2240个自动标注句子的基准数据集Sci-Sentence。作者对37种LLMs进行了全面评估，涵盖了不同的模型家族和规模，采用了零样本学习和微调方法。 Result: 实验结果表明，在高质量数据微调下，当前一代LLMs在该任务上表现优异，F1得分超过96%。虽然大型专有模型如GPT-4o表现最佳，但一些轻量级开源模型也有出色表现。此外，通过使用LLM生成的半合成示例增强训练数据，小型编码器和解码器模型的性能得到了显著提升。 Conclusion: 论文得出结论，通过引入新的注释模式和评估最先进的大型语言模型（LLMs）在分类修辞角色方面的表现，可以有效推动文献综述生成技术的发展。此外，论文强调了一些轻量级开源模型也表现出色，并指出通过使用半合成数据增强训练数据可显著提升模型性能。 Abstract: Previous work has demonstrated that AI methods for analysing scientific literature benefit significantly from annotating sentences in papers according to their rhetorical roles, such as research gaps, results, limitations, extensions of existing methodologies, and others. Such representations also have the potential to support the development of a new generation of systems capable of producing high-quality literature reviews. However, achieving this goal requires the definition of a relevant annotation schema and effective strategies for large-scale annotation of the literature. This paper addresses these challenges by 1) introducing a novel annotation schema specifically designed to support literature review generation and 2) conducting a comprehensive evaluation of a wide range of state-of-the-art large language models (LLMs) in classifying rhetorical roles according to this schema. To this end, we also present Sci-Sentence, a novel multidisciplinary benchmark comprising 700 sentences manually annotated by domain experts and 2,240 sentences automatically labelled using LLMs. We evaluate 37 LLMs on this benchmark, spanning diverse model families and sizes, using both zero-shot learning and fine-tuning approaches. The experiments yield several novel insights that advance the state of the art in this challenging domain. First, the current generation of LLMs performs remarkably well on this task when fine-tuned on high-quality data, achieving performance levels above 96\% F1. Second, while large proprietary models like GPT-4o achieve the best results, some lightweight open-source alternatives also demonstrate excellent performance. Finally, enriching the training data with semi-synthetic examples generated by LLMs proves beneficial, enabling small encoders to achieve robust results and significantly enhancing the performance of several open decoder models.

[43] GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy

Hongze Tan,Jianfei Pan

Main category: cs.CL

TL;DR: This paper introduces Dynamic Entropy Weighting via GTPO and GRPO-S to enable fine-grained reward signals in reinforcement learning for LLMs, significantly improving reasoning performance compared to existing methods.

Details

Motivation: Reinforcement learning methods like GRPO suffer from coarse-grained credit assignment by applying uniform rewards to all tokens, which limits performance in long-chain reasoning tasks. This paper aims to address this limitation. Method: The paper introduces Dynamic Entropy Weighting through two methods: Group Token Policy Optimization (GTPO), which assigns entropy-weighted rewards to individual tokens, and Sequence-Level Group Relative Policy Optimization (GRPO-S), which applies entropy-weighted rewards at the sequence level based on average token entropy. Result: Experiments demonstrate that the proposed entropy-weighted methods significantly outperform the DAPO baseline, with the entropy-weighting mechanism identified as the key factor driving improved performance. Conclusion: The paper concludes that the entropy-weighting mechanism significantly enhances deep reasoning in models, offering a more effective approach compared to existing methods. Abstract: Reinforcement learning (RL) with algorithms like Group Relative Policy Optimization (GRPO) improves Large Language Model (LLM) reasoning, but is limited by a coarse-grained credit assignment that applies a uniform reward to all tokens in a sequence. This is a major flaw in long-chain reasoning tasks. This paper solves this with \textbf{Dynamic Entropy Weighting}. Our core idea is that high-entropy tokens in correct responses can guide the policy toward a higher performance ceiling. This allows us to create more fine-grained reward signals for precise policy updates via two ways: 1) \textbf{Group Token Policy Optimization} (\textbf{GTPO}), we assigns a entropy-weighted reward to each token for fine-grained credit assignment. 2) \textbf{Sequence-Level Group Relative Policy Optimization} (\textbf{GRPO-S}), we assigns a entropy-weighted reward to each sequence based on its average token entropy. Experiments show our methods significantly outperform the strong DAPO baseline. The results confirm that our entropy-weighting mechanism is the key driver of this performance boost, offering a better path to enhance deep reasoning in models.

[44] Chain of Questions: Guiding Multimodal Curiosity in Language Models

Nima Iji,Kia Dashtipour

Main category: cs.CL

TL;DR: This paper introduces the Chain of Questions (CoQ) framework, which improves reasoning in multimodal language models by enabling them to generate targeted questions and selectively use relevant sensory modalities, leading to better performance across diverse tasks.

Details

Motivation: While reasoning capabilities in large language models have advanced through methods like chain-of-thought, these improvements have not been fully extended to multimodal contexts, where models must proactively decide which sensory modalities to use in complex environments. Method: The authors introduced the Chain of Questions (CoQ) framework, which encourages multimodal language models to generate targeted questions to guide the selection and integration of relevant sensory modalities. They evaluated the framework on a novel multimodal benchmark dataset created by integrating WebGPT, ScienceQA, AVSD, and ScanQA datasets. Result: The experimental results showed that the CoQ method improves a foundation model's ability to identify and integrate relevant sensory information, resulting in enhanced accuracy, interpretability, and alignment of the reasoning process with multimodal tasks. Conclusion: The CoQ framework enhances the reasoning capabilities of multimodal language models by enabling them to dynamically generate targeted questions and selectively activate relevant sensory modalities, leading to improved accuracy, interpretability, and alignment with diverse multimodal tasks. Abstract: Reasoning capabilities in large language models (LLMs) have substantially advanced through methods such as chain-of-thought and explicit step-by-step explanations. However, these improvements have not yet fully transitioned to multimodal contexts, where models must proactively decide which sensory modalities such as vision, audio, or spatial perception to engage when interacting with complex real-world environments. In this paper, we introduce the Chain of Questions (CoQ) framework, a curiosity-driven reasoning approach that encourages multimodal language models to dynamically generate targeted questions regarding their surroundings. These generated questions guide the model to selectively activate relevant modalities, thereby gathering critical information necessary for accurate reasoning and response generation. We evaluate our framework on a novel multimodal benchmark dataset, assembled by integrating WebGPT, ScienceQA, AVSD, and ScanQA datasets. Experimental results demonstrate that our CoQ method improves a foundation model's ability to effectively identify and integrate pertinent sensory information. This leads to improved accuracy, interpretability, and alignment of the reasoning process with diverse multimodal tasks.

[45] AIC CTU@FEVER 8: On-premise fact checking through long context RAG

Herbert Ullrich,Jan Drchal

Main category: cs.CL

TL;DR: 该论文描述了一个在FEVER 8共享任务中取得第一名的简单两步RAG事实核查管道，并展示了如何在有限的硬件资源下实现最先进的事实核查性能。

Details

Motivation: 在FEVER 8共享任务中获得第一名后，展示该管道如何在本地重新部署 Method: 基于去年提交的简单两步RAG管道 Result: 事实核查系统在FEVER 8共享任务中获得第一名 Conclusion: 事实核查管道能够在单个NVIDIA A10 GPU、23GB图形内存和每个声明60秒运行时间的限制下实现最先进的事实核查性能（从Ev2R测试分数来看） Abstract: In this paper, we present our fact-checking pipeline which has scored first in FEVER 8 shared task. Our fact-checking system is a simple two-step RAG pipeline based on our last year's submission. We show how the pipeline can be redeployed on-premise, achieving state-of-the-art fact-checking performance (in sense of Ev2R test-score), even under the constraint of a single NVidia A10 GPU, 23GB of graphical memory and 60s running time per claim.

[46] Improving Crash Data Quality with Large Language Models: Evidence from Secondary Crash Narratives in Kentucky

Xu Zhang,Mei Chen

Main category: cs.CL

TL;DR: 该研究通过比较不同的自然语言处理模型，评估了如何提升事故数据质量。微调的Transformer模型在准确性和效率之间取得了最佳平衡，而大语言模型虽然在召回率上有优势，但计算成本较高。

Details

Motivation: 提高事故数据质量，通过自然语言处理技术挖掘事故叙述，特别是在肯塔基州的二次事故识别中进行案例研究。 Method: 比较了三类模型：零样本开源大语言模型（LLMs）、微调的Transformer模型（如BERT、RoBERTa等）以及作为基线的传统逻辑回归。模型在2015-2021年的数据上进行校准，并在2022年的1771条事故叙述上进行测试。 Result: 微调的Transformer模型表现最佳，其中RoBERTa的F1分数最高（0.90），准确率为95%。零样本的LLaMA3:70B达到了0.86的F1分数，但推理需要139分钟。逻辑回归基线表现最差（F1:0.66）。某些LLMs（如GEMMA3:27B）在召回率上表现优异（0.94），但计算成本高（如DeepSeek-R1:70B需要723分钟）。 Conclusion: 结果表明，微调的Transformer模型在准确性和效率之间取得了最佳平衡，而大型语言模型（LLMs）在召回率方面表现出色但计算成本较高。研究强调了在实际部署中考虑隐私保护、集成方法和增量处理的重要性。 Abstract: This study evaluates advanced natural language processing (NLP) techniques to enhance crash data quality by mining crash narratives, using secondary crash identification in Kentucky as a case study. Drawing from 16,656 manually reviewed narratives from 2015-2022, with 3,803 confirmed secondary crashes, we compare three model classes: zero-shot open-source large language models (LLMs) (LLaMA3:70B, DeepSeek-R1:70B, Qwen3:32B, Gemma3:27B); fine-tuned transformers (BERT, DistilBERT, RoBERTa, XLNet, Longformer); and traditional logistic regression as baseline. Models were calibrated on 2015-2021 data and tested on 1,771 narratives from 2022. Fine-tuned transformers achieved superior performance, with RoBERTa yielding the highest F1-score (0.90) and accuracy (95%). Zero-shot LLaMA3:70B reached a comparable F1 of 0.86 but required 139 minutes of inference; the logistic baseline lagged well behind (F1:0.66). LLMs excelled in recall for some variants (e.g., GEMMA3:27B at 0.94) but incurred high computational costs (up to 723 minutes for DeepSeek-R1:70B), while fine-tuned models processed the test set in seconds after brief training. Further analysis indicated that mid-sized LLMs (e.g., DeepSeek-R1:32B) can rival larger counterparts in performance while reducing runtime, suggesting opportunities for optimized deployments. Results highlight trade-offs between accuracy, efficiency, and data requirements, with fine-tuned transformer models balancing precision and recall effectively on Kentucky data. Practical deployment considerations emphasize privacy-preserving local deployment, ensemble approaches for improved accuracy, and incremental processing for scalability, providing a replicable scheme for enhancing crash-data quality with advanced NLP.

[47] Why are LLMs' abilities emergent?

Vladimír Havlík

Main category: cs.CL

TL;DR: This paper examines the emergent properties of Deep Neural Networks (DNNs) through theoretical analysis and empirical observation, arguing that understanding LLM capabilities requires recognizing DNNs as a new domain of complex dynamical systems governed by universal principles of emergence.

Details

Motivation: The motivation of the paper is to address the epistemological challenge of 'creation without understanding' that characterizes contemporary AI development and to understand the fundamental nature of emergence in DNNs. Method: The paper uses theoretical analysis and empirical observation to explore the emergent properties of DNNs, analyzing scaling laws, grokking phenomena, and phase transitions in model capabilities. Result: The paper demonstrates that emergent abilities in DNNs arise from the complex dynamics of highly sensitive nonlinear systems and exhibit genuine emergent properties analogous to those found in other complex natural phenomena. Conclusion: The paper concludes that understanding LLM capabilities requires recognizing DNNs as a new domain of complex dynamical systems governed by universal principles of emergence, similar to those operating in physics, chemistry, and biology. Abstract: The remarkable success of Large Language Models (LLMs) in generative tasks has raised fundamental questions about the nature of their acquired capabilities, which often appear to emerge unexpectedly without explicit training. This paper examines the emergent properties of Deep Neural Networks (DNNs) through both theoretical analysis and empirical observation, addressing the epistemological challenge of "creation without understanding" that characterises contemporary AI development. We explore how the neural approach's reliance on nonlinear, stochastic processes fundamentally differs from symbolic computational paradigms, creating systems whose macro-level behaviours cannot be analytically derived from micro-level neuron activities. Through analysis of scaling laws, grokking phenomena, and phase transitions in model capabilities, I demonstrate that emergent abilities arise from the complex dynamics of highly sensitive nonlinear systems rather than simply from parameter scaling alone. My investigation reveals that current debates over metrics, pre-training loss thresholds, and in-context learning miss the fundamental ontological nature of emergence in DNNs. I argue that these systems exhibit genuine emergent properties analogous to those found in other complex natural phenomena, where systemic capabilities emerge from cooperative interactions among simple components without being reducible to their individual behaviours. The paper concludes that understanding LLM capabilities requires recognising DNNs as a new domain of complex dynamical systems governed by universal principles of emergence, similar to those operating in physics, chemistry, and biology. This perspective shifts the focus from purely phenomenological definitions of emergence to understanding the internal dynamic transformations that enable these systems to acquire capabilities that transcend their individual components.

[48] What Do Humans Hear When Interacting? Experiments on Selective Listening for Evaluating ASR of Spoken Dialogue Systems

Kiyotada Mori,Seiya Kawano,Chaoran Liu,Carlos Toshinori Ishi,Angel Fernando Garcia Contreras,Koichiro Yoshino

Main category: cs.CL

TL;DR: This paper explores how human selective listening in dialogue response generation can inform a new method for evaluating ASR systems in spoken dialogue contexts.

Details

Motivation: Understanding selective listening in humans can help identify and evaluate the necessary ASR capabilities for spoken dialogue systems (SDSs) more effectively. Method: The study compares human transcriptions for generating dialogue responses with reference transcriptions to experimentally confirm selective listening in humans. Result: The experimental results confirm that humans exhibit selective listening when generating dialogue responses, focusing on important parts of speech. Conclusion: Selective listening ability of humans can be leveraged to develop a new ASR evaluation method that identifies the gap between ASR systems and humans in focusing on important parts of speech. Abstract: Spoken dialogue systems (SDSs) utilize automatic speech recognition (ASR) at the front end of their pipeline. The role of ASR in SDSs is to recognize information in user speech related to response generation appropriately. Examining selective listening of humans, which refers to the ability to focus on and listen to important parts of a conversation during the speech, will enable us to identify the ASR capabilities required for SDSs and evaluate them. In this study, we experimentally confirmed selective listening when humans generate dialogue responses by comparing human transcriptions for generating dialogue responses and reference transcriptions. Based on our experimental results, we discuss the possibility of a new ASR evaluation method that leverages human selective listening, which can identify the gap between transcription ability between ASR systems and humans.

[49] Dialogue Response Prefetching Based on Semantic Similarity and Prediction Confidence of Language Model

Kiyotada Mori,Seiya Kawano,Angel Fernando Garcia Contreras,Koichiro Yoshino

Main category: cs.CL

TL;DR: 本研究提出了一种预测置信度模型（PCM），通过估计语义相似性来判断是否可以进行预取，从而减少用户感知延迟（UPL）。

Details

Motivation: 在语音对话系统中，减少用户感知延迟（UPL）是一个重要目标，通常通过语言模型预测完整用户话语以准备预取对话响应来实现。 Method: 通过估计预测的完整用户话语与实际完整用户话语之间的语义相似性来评估PCM。 Result: 该研究基于预测的完整用户话语与完整用户话语之间的差异评估了所提出的PCM方法。 Conclusion: 该研究提出了一种预测置信度模型（PCM），用于判断是否可以进行预取，从而减少用户感知延迟（UPL）。 Abstract: Prefetching of dialogue responses has been investigated to reduce user-perceived latency (UPL), which refers to the user's waiting time before receiving the system's response, in spoken dialogue systems. To reduce the UPL, it is necessary to predict complete user utterances before the end of the user's speech, typically by language models, to prepare prefetched dialogue responses. In this study, we proposed a prediction confidence model (PCM) that determines whether prefetching is possible or not by estimating the semantic similarity between the predicted complete user utterance and the complete user utterance. We evaluated our PCM based on the differences between the predicted complete user utterance and the complete user utterance.

[50] Evaluating, Synthesizing, and Enhancing for Customer Support Conversation

Jie Zhu,Huaixia Dou,Junhui Li,Lifan Guo,Feng Chen,Chi Zhang,Fang Kong

Main category: cs.CL

TL;DR: 论文提出了一种结构化的客户服务对话框架和相关数据集，通过训练LLM显著提升了生成高质量客服回复的能力。

Details

Motivation: 现有的对话数据集缺乏战略指导，而真实的服务数据又难以获取和标注，因此需要引入一个任务来训练客服代理使用明确的支持策略进行回应。 Method: 论文提出了一个结构化的客户服务对话框架（CSC），定义了五个对话阶段和十二种策略，同时构建了两个数据集CSConv和RoleCS，分别用于评估和训练，并通过微调LLM进行实验验证。 Result: 实验表明，在RoleCS上微调的LLM在CSConv上能生成更高质量、符合策略的回复，人工评估也证实了问题解决能力的提升。 Conclusion: 通过构建基于COPC准则的客户服务对话框架，论文成功提升了LLM在生成高质量客户服务回复上的能力，为人机对话系统的优化提供了新的思路和方法。 Abstract: Effective customer support requires not only accurate problem solving but also structured and empathetic communication aligned with professional standards. However, existing dialogue datasets often lack strategic guidance, and real-world service data is difficult to access and annotate. To address this, we introduce the task of Customer Support Conversation (CSC), aimed at training customer service agents to respond using well-defined support strategies. We propose a structured CSC framework grounded in COPC guidelines, defining five conversational stages and twelve strategies to guide high-quality interactions. Based on this, we construct CSConv, an evaluation dataset of 1,855 real-world customer-agent conversations rewritten using LLMs to reflect deliberate strategy use, and annotated accordingly. Additionally, we develop a role-playing approach that simulates strategy-rich conversations using LLM-powered roles aligned with the CSC framework, resulting in the training dataset RoleCS. Experiments show that fine-tuning strong LLMs on RoleCS significantly improves their ability to generate high-quality, strategy-aligned responses on CSConv. Human evaluations further confirm gains in problem resolution. All code and data will be made publicly available at https://github.com/aliyun/qwen-dianjin.

[51] StepFun-Formalizer: Unlocking the Autoformalization Potential of LLMs through Knowledge-Reasoning Fusion

Yutong Wu,Di Huang,Ruosi Wan,Yue Peng,Shijie Shang,Chenrui Cao,Lei Qi,Rui Zhang,Zidong Du,Jie Yan,Xing Hu

Main category: cs.CL

TL;DR: 本文提出 ThinkingF 方法，以提高 autoformalization 的准确性，并在多个数据集上验证了其有效性。

Details

Motivation: 现有的 autoformalization 方法仍然存在低准确性的问题，需要提高模型在正式语言领域知识和自然语言问题理解方面的能力。 Method: 构建了两个数据集，并应用 SFT 和 RLVR 来改进 autoformalization 的两个关键能力。 Result: StepFun-Formalizer-32B 在 FormalMATH-Lite 上达到了 40.5% 的 BEq@1 分数，在 ProverBench 上达到了 26.7% 的 BEq@1 分数。 Conclusion: ThinkingF 提高了模型在 autoformalization 方面的表现，StepFun-Formalizer-32B 在 FormalMATH-Lite 和 ProverBench 上取得了最先进的 BEq@1 分数。 Abstract: Autoformalization aims to translate natural-language mathematical statements into a formal language. While LLMs have accelerated progress in this area, existing methods still suffer from low accuracy. We identify two key abilities for effective autoformalization: comprehensive mastery of formal-language domain knowledge, and reasoning capability of natural language problem understanding and informal-formal alignment. Without the former, a model cannot identify the correct formal objects; without the latter, it struggles to interpret real-world contexts and map them precisely into formal expressions. To address these gaps, we introduce ThinkingF, a data synthesis and training pipeline that improves both abilities. First, we construct two datasets: one by distilling and selecting large-scale examples rich in formal knowledge, and another by generating informal-to-formal reasoning trajectories guided by expert-designed templates. We then apply SFT and RLVR with these datasets to further fuse and refine the two abilities. The resulting 7B and 32B models exhibit both comprehensive formal knowledge and strong informal-to-formal reasoning. Notably, StepFun-Formalizer-32B achieves SOTA BEq@1 scores of 40.5% on FormalMATH-Lite and 26.7% on ProverBench, surpassing all prior general-purpose and specialized models.

[52] Automated Generation of Curriculum-Aligned Multiple-Choice Questions for Malaysian Secondary Mathematics Using Generative AI

Rohaizah Abdul Wahid,Muhamad Said Nizamuddin Nadim,Suliana Sulaiman,Syahmi Akmal Shaharudin,Muhammad Danial Jupikil,Iqqwan Jasman Su Azlan Su

Main category: cs.CL

TL;DR: This paper explores the use of Generative AI to create high-quality educational assessments in Malaysia, particularly for low-resource languages like Bahasa Melayu. It compares four methods for generating Form 1 Mathematics MCQs using GPT-4o and finds that Retrieval-Augmented Generation-based methods perform best in ensuring curriculum alignment and factual accuracy.

Details

Motivation: There is a critical need for scalable and high-quality educational assessment tools in Malaysia, especially for low-resource languages like Bahasa Melayu. Generative AI has potential but faces challenges in factual accuracy and curriculum alignment. Method: Four incremental pipelines were used to generate Form 1 Mathematics MCQs in Bahasa Melayu using GPT-4o. The methods included non-grounded prompting (structured and basic) and Retrieval-Augmented Generation (RAG) approaches (one using LangChain, one manual). The system was grounded in official curriculum documents. Evaluation used a dual-pronged framework: Semantic Textual Similarity (STS) for curriculum alignment and a RAG-based QA method for factual validity. Result: RAG-based pipelines significantly outperformed non-grounded prompting methods in curriculum alignment and factual validity of generated questions. The study successfully introduced a validated methodology for curriculum-specific content generation and a novel RAG-QA evaluation technique. Conclusion: RAG-based pipelines are more effective in generating curriculum-aligned and factually valid questions compared to non-grounded prompting methods. Framework-based RAG offers ease of implementation, while manual pipelines allow fine-grained control. The study contributes a validated methodology for educational content generation in low-resource languages and provides insights for EdTech development in Malaysia and similar regions. Abstract: This paper addresses the critical need for scalable and high-quality educational assessment tools within the Malaysian education system. It highlights the potential of Generative AI (GenAI) while acknowledging the significant challenges of ensuring factual accuracy and curriculum alignment, especially for low-resource languages like Bahasa Melayu. This research introduces and compares four incremental pipelines for generating Form 1 Mathematics multiple-choice questions (MCQs) in Bahasa Melayu using OpenAI's GPT-4o. The methods range from non-grounded prompting (structured and basic) to Retrieval-Augmented Generation (RAG) approaches (one using the LangChain framework, one implemented manually). The system is grounded in official curriculum documents, including teacher-prepared notes and the yearly teaching plan (RPT). A dual-pronged automated evaluation framework is employed to assess the generated questions. Curriculum alignment is measured using Semantic Textual Similarity (STS) against the RPT, while contextual validity is verified through a novel RAG-based Question-Answering (RAG-QA) method. The results demonstrate that RAG-based pipelines significantly outperform non-grounded prompting methods, producing questions with higher curriculum alignment and factual validity. The study further analyzes the trade-offs between the ease of implementation of framework-based RAG and the fine-grained control offered by a manual pipeline. This work presents a validated methodology for generating curriculum-specific educational content in a low-resource language, introduces a symbiotic RAG-QA evaluation technique, and provides actionable insights for the development and deployment of practical EdTech solutions in Malaysia and similar regions.

[53] CALE : Concept-Aligned Embeddings for Both Within-Lemma and Inter-Lemma Sense Differentiation

Bastien Liétard,Gabriel Loiseau

Main category: cs.CL

TL;DR: 本文提出了一种扩展的词汇语义研究方法（Concept Differentiation），通过构建新数据集和微调模型（CALE），显著提升了模型对词汇语义的表示能力。

Details

Motivation: 现有的Word-in-Context任务仅比较同一词的不同语境，限制了对词汇语义信息的捕捉，因此本文提出扩展任务以涵盖不同词之间的语义关系。 Method: 基于Contextualized Language Models，提出Concept Differentiation任务并构建数据集，使用该数据集对模型进行微调，生成Concept-Aligned Embeddings（CALE），并在多个词汇语义任务中对模型进行评估。 Result: 提出的Concept-Aligned Embeddings（CALE）在多个词汇语义任务中达到最佳性能，并且微调过程改善了词嵌入的空间组织结构。 Conclusion: 本文提出了一种新的任务——概念区分（Concept Differentiation），并构建了相应的数据集，通过微调得到的概念对齐嵌入（CALE）在多个词汇语义任务中表现出色，证明了其作为多用途词汇意义表示的有效性。 Abstract: Lexical semantics is concerned with both the multiple senses a word can adopt in different contexts, and the semantic relations that exist between meanings of different words. To investigate them, Contextualized Language Models are a valuable tool that provides context-sensitive representations that can be used to investigate lexical meaning. Recent works like XL-LEXEME have leveraged the task of Word-in-Context to fine-tune them to get more semantically accurate representations, but Word-in-Context only compares occurrences of the same lemma, limiting the range of captured information. In this paper, we propose an extension, Concept Differentiation, to include inter-words scenarios. We provide a dataset for this task, derived from SemCor data. Then we fine-tune several representation models on this dataset. We call these models Concept-Aligned Embeddings (CALE). By challenging our models and other models on various lexical semantic tasks, we demonstrate that the proposed models provide efficient multi-purpose representations of lexical meaning that reach best performances in our experiments. We also show that CALE's fine-tuning brings valuable changes to the spatial organization of embeddings.

[54] StyliTruth : Unlocking Stylized yet Truthful LLM Generation via Disentangled Steering

Chenglei Shen,Zhongxiang Sun,Teng Shi,Xiao Zhang,Jun Xu

Main category: cs.CL

TL;DR: 本文提出StyliTruth方法，通过分离风格和真实子空间并设计转向向量，解决风格化引起的真实崩溃问题。

Details

Motivation: 现有表示编辑方法在强加独特风格时常常忽略其对事实性的影响，导致答案正确性降低，本文旨在解决这一问题。 Method: 通过正交膨胀过程分离模型表示空间中的风格相关和真实相关子空间，设计自适应的、标记级别的转向向量，以动态精确地控制生成过程。 Result: 实验表明，StyliTruth显著减少了风格化引起的事实性崩溃，并在多种风格和语言上验证了其有效性。 Conclusion: StyliTruth有效地减少了风格化引起的事实性崩溃，在保持风格化的同时保留了真实性，优于现有的推理时间干预方法。 Abstract: Generating stylized large language model (LLM) responses via representation editing is a promising way for fine-grained output control. However, there exists an inherent trade-off: imposing a distinctive style often degrades truthfulness. Existing representation editing methods, by naively injecting style signals, overlook this collateral impact and frequently contaminate the model's core truthfulness representations, resulting in reduced answer correctness. We term this phenomenon stylization-induced truthfulness collapse. We attribute this issue to latent coupling between style and truth directions in certain key attention heads, and propose StyliTruth, a mechanism that preserves stylization while keeping truthfulness intact. StyliTruth separates the style-relevant and truth-relevant subspaces in the model's representation space via an orthogonal deflation process. This decomposition enables independent control of style and truth in their own subspaces, minimizing interference. By designing adaptive, token-level steering vectors within each subspace, we dynamically and precisely control the generation process to maintain both stylistic fidelity and truthfulness. We validate our method on multiple styles and languages. Extensive experiments and analyses show that StyliTruth significantly reduces stylization-induced truthfulness collapse and outperforms existing inference-time intervention methods in balancing style adherence with truthfulness.

[55] Unveiling the Landscape of Clinical Depression Assessment: From Behavioral Signatures to Psychiatric Reasoning

Zhuang Chen,Guanqun Bi,Wen Zhang,Jiawei Hu,Aoyun Wang,Xiyao Xiao,Kun Feng,Minlie Huang

Main category: cs.CL

TL;DR: 本文提出 C-MIND 数据集，结合多种模态信息和临床专业知识，提升 LLM 在抑郁症诊断中的性能。

Details

Motivation: 自动化抑郁症评估的研究通常依赖有限或未经临床验证的数据，且模型设计复杂度优先于实际效果，需要真实世界数据和更有效的模型。 Method: 引入 C-MIND 数据集，使用多种模型分析任务和模态对诊断性能的影响，并探索 LLM 在临床环境中的局限性及改进方法。 Result: 通过 C-MIND 数据集分析行为特征，结合临床专业知识的 LLM 在 Macro-F1 分数上提高了 10%。 Conclusion: C-MIND 是一个真实临床数据集，可用于抑郁症的多模态诊断，结合临床专业知识的LLM可以提高诊断性能。 Abstract: Depression is a widespread mental disorder that affects millions worldwide. While automated depression assessment shows promise, most studies rely on limited or non-clinically validated data, and often prioritize complex model design over real-world effectiveness. In this paper, we aim to unveil the landscape of clinical depression assessment. We introduce C-MIND, a clinical neuropsychiatric multimodal diagnosis dataset collected over two years from real hospital visits. Each participant completes three structured psychiatric tasks and receives a final diagnosis from expert clinicians, with informative audio, video, transcript, and functional near-infrared spectroscopy (fNIRS) signals recorded. Using C-MIND, we first analyze behavioral signatures relevant to diagnosis. We train a range of classical models to quantify how different tasks and modalities contribute to diagnostic performance, and dissect the effectiveness of their combinations. We then explore whether LLMs can perform psychiatric reasoning like clinicians and identify their clear limitations in realistic clinical settings. In response, we propose to guide the reasoning process with clinical expertise and consistently improves LLM diagnostic performance by up to 10% in Macro-F1 score. We aim to build an infrastructure for clinical depression assessment from both data and algorithmic perspectives, enabling C-MIND to facilitate grounded and reliable research for mental healthcare.

[56] Beyond Brainstorming: What Drives High-Quality Scientific Ideas? Lessons from Multi-Agent Collaboration

Nuo Chen,Yicheng Tong,Jiaying Wu,Minh Duc Duong,Qian Wang,Qingyun Zou,Bryan Hooi,Bingsheng He

Main category: cs.CL

TL;DR: Structured multi-agent collaboration improves AI-driven research ideation, with leadership and cognitive diversity playing critical roles in generating high-quality proposals.

Details

Motivation: The motivation stems from the limitations of single-agent ideation in AI, which restricts creativity due to bounded knowledge and perspective. The study aims to explore whether collaborative, multi-agent approaches can enhance ideation quality. Method: The researchers proposed a cooperative multi-agent framework and systematically compared different configurations like group size, leadership structures, and team compositions. They assessed idea quality using agent-based scoring and human review across multiple dimensions. Result: Multi-agent discussions were found to significantly outperform solitary approaches. Teams with a designated leader produced more integrated and visionary proposals. Cognitive diversity was identified as a key driver of quality, but senior expertise was essential for achieving superior outcomes. Conclusion: The study concludes that structured multi-agent discussions outperform solitary ideation in generating high-quality research proposals, with cognitive diversity and a foundation of senior expertise being crucial for success. Abstract: While AI agents show potential in scientific ideation, most existing frameworks rely on single-agent refinement, limiting creativity due to bounded knowledge and perspective. Inspired by real-world research dynamics, this paper investigates whether structured multi-agent discussions can surpass solitary ideation. We propose a cooperative multi-agent framework for generating research proposals and systematically compare configurations including group size, leaderled versus leaderless structures, and team compositions varying in interdisciplinarity and seniority. To assess idea quality, we employ a comprehensive protocol with agent-based scoring and human review across dimensions such as novelty, strategic vision, and integration depth. Our results show that multi-agent discussions substantially outperform solitary baselines. A designated leader acts as a catalyst, transforming discussion into more integrated and visionary proposals. Notably, we find that cognitive diversity is a primary driver of quality, yet expertise is a non-negotiable prerequisite, as teams lacking a foundation of senior knowledge fail to surpass even a single competent agent. These findings offer actionable insights for designing collaborative AI ideation systems and shed light on how team structure influences creative outcomes.

Magauiya Zhussip,Dmitriy Shopkhoev,Ammar Ali,Stamatios Lefkimmiatis

Main category: cs.CL

TL;DR: The paper introduces MASA, a structured weight sharing framework for transformer layers, which significantly reduces the attention module's parameters while maintaining performance. MASA operates as a drop-in replacement and offers a scalable blueprint for parameter-efficient models without sacrificing performance.

Details

Motivation: The high computational and memory demands of large language models (LLMs) hinder their widespread deployment. Existing compression techniques focus on intra-block optimizations, while the repetitive layered structure of transformers implies significant inter-block redundancy, a dimension largely unexplored beyond key-value (KV) caching. Method: The study proposes MASA (Matrix Atom Sharing in Attention), a framework for structured weight sharing across transformer layers, inspired by dictionary learning in CNNs. It decomposes attention projection matrices into shared dictionary atoms and represents each layer's weights as linear combinations of shared matrix atoms. Result: MASA reduces the attention module's parameters by 66.7% while achieving on-par performance. It achieves better benchmark accuracy and perplexity than grouped-query attention (GQA), low-rank baselines, and recently proposed Repeat-all-over/Sequential sharing at comparable parameter budgets. In Vision Transformers (ViT), MASA matches performance metrics on image classification and detection tasks with 66.7% fewer attention parameters. Conclusion: MASA offers a scalable blueprint for parameter-efficient models without sacrificing performance, and it can be employed on pretrained LLMs to reduce their number of parameters without experiencing any significant drop in their performance. Abstract: Large language models (LLMs) have revolutionized AI applications, yet their high computational and memory demands hinder their widespread deployment. Existing compression techniques focus on intra-block optimizations (e.g. low-rank approximation, attention head pruning), while the repetitive layered structure of transformers implies significant inter-block redundancy - a dimension largely unexplored beyond key-value (KV) caching. Inspired by dictionary learning in CNNs, we propose a framework for structured weight sharing across transformer layers. Our approach decomposes attention projection matrices into shared dictionary atoms, reducing the attention module's parameters by 66.7% while achieving on-par performance. Unlike complex methods requiring distillation or architectural changes, MASA (Matrix Atom Sharing in Attention) operates as a drop-in replacement - trained with standard optimizers - and represents each layer's weights as linear combinations of shared matrix atoms. Experiments across scales (100M-700M parameters) show that MASA achieves better benchmark accuracy and perplexity than grouped-query attention (GQA), low-rank baselines and recently proposed Repeat-all-over/Sequential sharing at comparable parameter budgets. Ablation studies confirm robustness to the dictionary size and the efficacy of shared representations in capturing cross-layer statistical regularities. Extending to Vision Transformers (ViT), MASA matches performance metrics on image classification and detection tasks with 66.7% fewer attention parameters. By combining dictionary learning strategies with transformer efficiency, MASA offers a scalable blueprint for parameter-efficient models without sacrificing performance. Finally, we investigate the possibility of employing MASA on pretrained LLMs to reduce their number of parameters without experiencing any significant drop in their performance.

[58] TURA: Tool-Augmented Unified Retrieval Agent for AI Search

Zhejun Zhao,Yuehu Dong,Alley Liu,Lixue Zheng,Pingsheng Liu,Dongdong Shen,Long Xia,Jiashu Zhao,Dawei Yin

Main category: cs.CL

TL;DR: TURA is a novel framework that combines RAG with agentic tool-use to handle both static and dynamic data, enabling real-time, robust AI search at scale.

Details

Motivation: Traditional RAG approaches struggle with real-time and structured queries involving dynamic data such as ticket availability or inventory. Current search engines are limited to indexing static pages, and academic research overlooks the need for dynamic sources like databases and APIs. This creates a gap in providing robust, real-time AI search solutions at scale. Method: The paper introduces TURA, a three-stage framework combining Retrieval-Augmented Generation (RAG) with agentic tool-use. It includes an Intent-Aware Retrieval module, a DAG-based Task Planner, and a Distilled Agent Executor to handle both static and dynamic data efficiently. Result: TURA enables systematic integration of static and dynamic information sources, allowing for real-time and interactive queries while meeting low-latency demands. It serves tens of millions of users and delivers robust, real-time answers in a large-scale industrial system. Conclusion: TURA effectively bridges the gap between static RAG and dynamic information sources, providing a robust and real-time AI search solution for large-scale industrial applications. Abstract: The advent of Large Language Models (LLMs) is transforming search engines into conversational AI search products, primarily using Retrieval-Augmented Generation (RAG) on web corpora. However, this paradigm has significant industrial limitations. Traditional RAG approaches struggle with real-time needs and structured queries that require accessing dynamically generated content like ticket availability or inventory. Limited to indexing static pages, search engines cannot perform the interactive queries needed for such time-sensitive data. Academic research has focused on optimizing RAG for static content, overlooking complex intents and the need for dynamic sources like databases and real-time APIs. To bridge this gap, we introduce TURA (Tool-Augmented Unified Retrieval Agent for AI Search), a novel three-stage framework that combines RAG with agentic tool-use to access both static content and dynamic, real-time information. TURA has three key components: an Intent-Aware Retrieval module to decompose queries and retrieve information sources encapsulated as Model Context Protocol (MCP) Servers, a DAG-based Task Planner that models task dependencies as a Directed Acyclic Graph (DAG) for optimal parallel execution, and a lightweight Distilled Agent Executor for efficient tool calling. TURA is the first architecture to systematically bridge the gap between static RAG and dynamic information sources for a world-class AI search product. Serving tens of millions of users, it leverages an agentic framework to deliver robust, real-time answers while meeting the low-latency demands of a large-scale industrial system.

[59] Lightweight Transformers for Zero-Shot and Fine-Tuned Text-to-SQL Generation Using Spider

Chirag Seth,Utkarsh Singh

Main category: cs.CL

TL;DR: 本研究评估了T5-Small、BART-Small和GPT-2在自然语言到SQL转换中的表现，发现T5-Small在资源有限的环境下效果最好，表明紧凑型transformer模型在低资源场景中具有应用潜力。

Details

Motivation: 自然语言到SQL的转换能够让非专业用户使用自然语言查询关系数据库，这在教育和商业智能中有广泛应用。研究的动机是评估轻量级transformer模型在低资源环境下的表现，以探索其在资源受限场景中的潜力。 Method: 研究使用了三个轻量级transformer模型（T5-Small、BART-Small和GPT-2），在Spider数据集上进行实验，开发了一个可重用的、与模型无关的流水线，针对不同模型架构调整模式格式化，并在1000到5000次迭代中训练模型。 Result: 经过微调的T5-Small模型在逻辑形式准确率（LFAcc）上表现最好（27.8%），优于BART-Small（23.98%）和GPT-2（20.1%）。研究还表明，尽管资源限制影响了性能，但所开发的模块化流水线支持未来的改进。 Conclusion: 该研究得出结论，紧凑型transformer模型（如T5-Small）在资源有限的环境下具有为自然语言到SQL转换提供可行解决方案的潜力，同时强调了编码器-解码器模型在模式感知SQL生成中的优势。 Abstract: Text-to-SQL translation enables non-expert users to query relational databases using natural language, with applications in education and business intelligence. This study evaluates three lightweight transformer models - T5-Small, BART-Small, and GPT-2 - on the Spider dataset, focusing on low-resource settings. We developed a reusable, model-agnostic pipeline that tailors schema formatting to each model's architecture, training them across 1000 to 5000 iterations and evaluating on 1000 test samples using Logical Form Accuracy (LFAcc), BLEU, and Exact Match (EM) metrics. Fine-tuned T5-Small achieves the highest LFAcc (27.8%), outperforming BART-Small (23.98%) and GPT-2 (20.1%), highlighting encoder-decoder models' superiority in schema-aware SQL generation. Despite resource constraints limiting performance, our pipeline's modularity supports future enhancements, such as advanced schema linking or alternative base models. This work underscores the potential of compact transformers for accessible text-to-SQL solutions in resource-scarce environments.

[60] P-Aligner: Enabling Pre-Alignment of Language Models via Principled Instruction Synthesis

Feifan Song,Bofei Gao,Yifan Song,Yi Liu,Weimin Xiong,Yuyang Song,Tianyu Liu,Guoyin Wang,Houfeng Wang

Main category: cs.CL

TL;DR: 本文提出了一种高效的指令预对齐模块 P-Aligner，显著提高了大型语言模型生成符合人类偏好的内容的能力。

Details

Motivation: 大型语言模型在面对有缺陷的指令时，常常无法生成符合人类偏好的内容，因此需要一种高效且有效的方法来优化指令。 Method: P-Aligner 是一个轻量级模块，通过 UltraPrompt 数据集训练，该数据集使用基于原则的管道和蒙特卡洛树搜索合成。 Result: P-Aligner 在多个模型和基准测试中表现优于基线方法，包括在 GPT-4-turbo 和 Gemma-2-SimPO 上的平均胜率分别提高了 28.35% 和 8.69%。 Conclusion: P-Aligner 通过预对齐指令显著提高了大型语言模型在生成安全、有用和诚实内容方面的能力，且在效率和效果上均优于现有方法。 Abstract: Large Language Models (LLMs) are expected to produce safe, helpful, and honest content during interaction with human users, but they frequently fail to align with such values when given flawed instructions, e.g., missing context, ambiguous directives, or inappropriate tone, leaving substantial room for improvement along multiple dimensions. A cost-effective yet high-impact way is to pre-align instructions before the model begins decoding. Existing approaches either rely on prohibitive test-time search costs or end-to-end model rewrite, which is powered by a customized training corpus with unclear objectives. In this work, we demonstrate that the goal of efficient and effective preference alignment can be achieved by P-Aligner, a lightweight module generating instructions that preserve the original intents while being expressed in a more human-preferred form. P-Aligner is trained on UltraPrompt, a new dataset synthesized via a proposed principle-guided pipeline using Monte-Carlo Tree Search, which systematically explores the space of candidate instructions that are closely tied to human preference. Experiments across different methods show that P-Aligner generally outperforms strong baselines across various models and benchmarks, including average win-rate gains of 28.35% and 8.69% on GPT-4-turbo and Gemma-2-SimPO, respectively. Further analyses validate its effectiveness and efficiency through multiple perspectives, including data quality, search strategies, iterative deployment, and time overhead.

[61] IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards

Xu Guo,Tianyi Liang,Tong Jian,Xiaogui Yang,Ling-I Wu,Chenhui Li,Zhihui Lu,Qipeng Guo,Kai Chen

Main category: cs.CL

TL;DR: IFDecorator是一种改进的RLVR训练框架，有效提升大语言模型的指令遵循能力并防止奖励黑客行为。

Details

Motivation: 现有的RLVR方法在训练效率和过拟合问题上存在不足，需要一种更有效的方法来提高LLMs的指令遵循能力并防止奖励黑客行为。 Method: IFDecorator通过一个合作-对抗的数据飞轮生成更具挑战性的指令-验证对，使用IntentCheck模块确保意图对齐，并通过trip wires机制检测奖励黑客行为。 Result: Qwen2.5-32B-Instruct-IFDecorator在IFEval数据集上达到87.43%的准确率，优于GPT-4o等更大模型，并在FollowBench上表现出显著改进。 Conclusion: IFDecorator显著提高了大语言模型在指令遵循任务上的表现，同时有效减少了奖励黑客行为，具有良好的鲁棒性和样本效率。 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) improves instruction following capabilities of large language models (LLMs), but suffers from training inefficiency due to inadequate difficulty assessment. Moreover, RLVR is prone to over-optimization, where LLMs exploit verification shortcuts without aligning to the actual intent of user instructions. We introduce Instruction Following Decorator (IFDecorator}, a framework that wraps RLVR training into a robust and sample-efficient pipeline. It consists of three components: (1) a cooperative-adversarial data flywheel that co-evolves instructions and hybrid verifications, generating progressively more challenging instruction-verification pairs; (2) IntentCheck, a bypass module enforcing intent alignment; and (3) trip wires, a diagnostic mechanism that detects reward hacking via trap instructions, which trigger and capture shortcut exploitation behaviors. Our Qwen2.5-32B-Instruct-IFDecorator achieves 87.43% accuracy on IFEval, outperforming larger proprietary models such as GPT-4o. Additionally, we demonstrate substantial improvements on FollowBench while preserving general capabilities. Our trip wires show significant reductions in reward hacking rates. We will release models, code, and data for future research.

[62] Can NLP Tackle Hate Speech in the Real World? Stakeholder-Informed Feedback and Survey on Counterspeech

Tanvi Dinkar,Aiqi Jiang,Simona Frenda,Poppy Gerrard-Abbott,Nancie Gunson,Gavin Abercrombie,Ioannis Konstas

Main category: cs.CL

TL;DR: 本文系统回顾了74项NLP研究，发现当前反言论研究脱离了受影响社区的需求，并提出了将利益相关者重新纳入研究的建议。

Details

Motivation: 本文的动机是关注到尽管反言论作为一种有希望的干预措施在NLP中获得了关注，但近期的研究趋势转向了自动化流程，而忽视了受影响社区的意见。 Method: 本文采用系统综述的方法分析了74项关于反言论的NLP研究，并与五个专门研究在线基于性别的暴力（oGBV）的非政府组织进行了参与式案例研究，以确定利益相关者在反言论生成中的实践。 Result: 研究结果揭示了利益相关者参与在数据集创建、模型开发和评估中的影响，并确定了利益相关者知情的反言论生成实践。 Conclusion: 本文的结论是，当前的NLP研究与最受有毒在线内容影响的社区的需求之间存在日益脱节的情况，并提出了将利益相关者专业知识重新置于反言论研究中心的具体建议。 Abstract: Counterspeech, i.e. the practice of responding to online hate speech, has gained traction in NLP as a promising intervention. While early work emphasised collaboration with non-governmental organisation stakeholders, recent research trends have shifted toward automated pipelines that reuse a small set of legacy datasets, often without input from affected communities. This paper presents a systematic review of 74 NLP studies on counterspeech, analysing the extent to which stakeholder participation influences dataset creation, model development, and evaluation. To complement this analysis, we conducted a participatory case study with five NGOs specialising in online Gender-Based Violence (oGBV), identifying stakeholder-informed practices for counterspeech generation. Our findings reveal a growing disconnect between current NLP research and the needs of communities most impacted by toxic online content. We conclude with concrete recommendations for re-centring stakeholder expertise in counterspeech research.

[63] Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs

Noah Ziems,Dilara Soylu,Lakshya A Agrawal,Isaac Miller,Liheng Lai,Chen Qian,Kaiqiang Song,Meng Jiang,Dan Klein,Matei Zaharia,Karel D'Oosterlinck,Christopher Potts,Omar Khattab

Main category: cs.CL

TL;DR: 本文介绍mmGRPO，一种适用于模块化AI系统的GRPO推广方法，结合自动提示优化可显著提高准确性。

Details

Motivation: AI系统日益复杂，由多个LM调用和其他工具组成，需要明确如何利用GRPO改进这类系统。 Method: 定义了mmGRPO，一种简单的多模块GRPO推广方法，并将其与自动提示优化结合。 Result: 在分类、多跳搜索和隐私保护委托任务中，与post-trained LM相比，准确性平均提高了11%；与仅提示优化相比，提高了5%。 Conclusion: mmGRPO结合自动提示优化可显著提高准确性，为模块化AI系统的优化提供了有效手段。 Abstract: Group Relative Policy Optimization (GRPO) has proven to be an effective tool for post-training language models (LMs). However, AI systems are increasingly expressed as modular programs that mix together multiple LM calls with distinct prompt templates and other tools, and it is not clear how best to leverage GRPO to improve these systems. We begin to address this challenge by defining mmGRPO, a simple multi-module generalization of GRPO that groups LM calls by module across rollouts and handles variable-length and interrupted trajectories. We find that mmGRPO, composed with automatic prompt optimization, improves accuracy by 11% on average across classification, many-hop search, and privacy-preserving delegation tasks against the post-trained LM, and by 5% against prompt optimization on its own. We open-source mmGRPO in DSPy as the dspy.GRPO optimizer.

[64] Sculptor: Empowering LLMs with Cognitive Agency via Active Context Management

Mo Li,L. H. Xu,Qitai Tan,Ting Cao,Yunxin Liu

Main category: cs.CL

TL;DR: Sculptor 通过主动上下文管理（如碎片化、摘要、智能搜索）有效缓解大语言模型处理长上下文时的主动干扰问题，提升了模型在多样长上下文任务中的推理可靠性。

Details

Motivation: 大语言模型在处理长上下文时由于主动干扰（先前无关信息干扰推理和记忆）性能显著下降，而当前研究多关注外部记忆系统，缺乏对其内部工作记忆的主动管理方法。 Method: Sculptor 框架通过三种工具类别实现主动上下文管理：（1）上下文碎片化，（2）摘要、隐藏和恢复，（3）智能搜索。 Result: 实验评估表明，Sculptor 在无需特定训练的情况下显著提升了信息稀疏基准（如 PI-LLM 和 NeedleBench Multi-Needle Reasoning）的性能，同时展示了大语言模型固有的工具调用泛化能力。 Conclusion: Sculptor 提供了一种有效的主动上下文管理方法，减轻了大语言模型在处理长上下文时的主动干扰问题，同时强调了显式上下文控制策略对于大规模任务可靠性的重要性。 Abstract: Large Language Models (LLMs) suffer from significant performance degradation when processing long contexts due to proactive interference, where irrelevant information in earlier parts of the context disrupts reasoning and memory recall. While most research focuses on external memory systems to augment LLMs' capabilities, we propose a complementary approach: empowering LLMs with Active Context Management (ACM) tools to actively sculpt their internal working memory. We introduce Sculptor, a framework that equips LLMs with three categories of tools: (1) context fragmentation, (2) summary, hide, and restore, and (3) intelligent search. Our approach enables LLMs to proactively manage their attention and working memory, analogous to how humans selectively focus on relevant information while filtering out distractions. Experimental evaluation on information-sparse benchmarks-PI-LLM (proactive interference) and NeedleBench Multi-Needle Reasoning-demonstrates that Sculptor significantly improves performance even without specific training, leveraging LLMs' inherent tool calling generalization capabilities. By enabling Active Context Management, Sculptor not only mitigates proactive interference but also provides a cognitive foundation for more reliable reasoning across diverse long-context tasks-highlighting that explicit context-control strategies, rather than merely larger token windows, are key to robustness at scale.

[65] GeRe: Towards Efficient Anti-Forgetting in Continual Learning of LLM via General Samples Replay

Yunan Zhang,Shuoran Jiang,Mengchen Zhao,Yuefeng Li,Yang Fan,Xiangping Wu,Qingcai Chen

Main category: cs.CL

TL;DR: 本文提出了一种名为GeRe的框架，通过使用预训练文本和基于神经状态的优化方法解决大型语言模型连续学习中的遗忘问题。

Details

Motivation: 大型语言模型在连续微调过程中存在灾难性遗忘的问题，需要一种简单且稳定的方法来解决这一问题。 Method: 引入了一种基于神经状态的增强激活状态约束优化方法，使用阈值边缘（TM）损失来保持重放学习中的激活状态一致性。 Result: 实验结果显示，与不同的重放策略相比，TM方法在性能和鲁棒性方面均有显著提升。 Conclusion: GeRe框架被验证能有效解决连续学习中遗忘问题，同时提升模型在顺序任务中的整体性能。 Abstract: The continual learning capability of large language models (LLMs) is crucial for advancing artificial general intelligence. However, continual fine-tuning LLMs across various domains often suffers from catastrophic forgetting, characterized by: 1) significant forgetting of their general capabilities, and 2) sharp performance declines in previously learned tasks. To simultaneously address both issues in a simple yet stable manner, we propose General Sample Replay (GeRe), a framework that use usual pretraining texts for efficient anti-forgetting. Beyond revisiting the most prevalent replay-based practices under GeRe, we further leverage neural states to introduce a enhanced activation states constrained optimization method using threshold-based margin (TM) loss, which maintains activation state consistency during replay learning. We are the first to validate that a small, fixed set of pre-collected general replay samples is sufficient to resolve both concerns--retaining general capabilities while promoting overall performance across sequential tasks. Indeed, the former can inherently facilitate the latter. Through controlled experiments, we systematically compare TM with different replay strategies under the GeRe framework, including vanilla label fitting, logit imitation via KL divergence and feature imitation via L1/L2 losses. Results demonstrate that TM consistently improves performance and exhibits better robustness. Our work paves the way for efficient replay of LLMs for the future. Our code and data are available at https://github.com/Qznan/GeRe.

[66] FaST: Feature-aware Sampling and Tuning for Personalized Preference Alignment with Limited Data

Thibaut Thonet,Germán Kruszewski,Jos Rozen,Pierre Erbacher,Marc Dymetman

Main category: cs.CL

TL;DR: 本文研究了在个性化偏好对齐中数据有限的情况下，如何高效调整LLM模型参数，并提出了一种新方法FaST。

Details

Motivation: 现有的LLM对话助手通常采用一刀切的部署方式，无法满足用户的个性化需求。 Method: 提出FaST方法，利用从数据中自动发现的高层特征，实现高效的参数调整。 Result: 在两个新引入的数据集DnD和ELIP上，FaST方法表现出最佳的整体性能。 Conclusion: 本文提出了一种高效的个性化偏好对齐方法FaST，并引入了两个数据集DnD和ELIP来支持该领域的研究。 Abstract: LLM-powered conversational assistants are often deployed in a one-size-fits-all manner, which fails to accommodate individual user preferences. Recently, LLM personalization -- tailoring models to align with specific user preferences -- has gained increasing attention as a way to bridge this gap. In this work, we specifically focus on a practical yet challenging setting where only a small set of preference annotations can be collected per user -- a problem we define as Personalized Preference Alignment with Limited Data (PPALLI). To support research in this area, we introduce two datasets -- DnD and ELIP -- and benchmark a variety of alignment techniques on them. We further propose FaST, a highly parameter-efficient approach that leverages high-level features automatically discovered from the data, achieving the best overall performance.

[67] Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis

Anushka Yadav,Isha Nalawade,Srujana Pillarichety,Yashwanth Babu,Reshmi Ghosh,Samyadeep Basu,Wenlong Zhao,Ali Nasaeh,Sriram Balasubramanian,Soundararajan Srinivasan

Main category: cs.CL

TL;DR: This paper investigates why reasoning models hallucinate more than general language models by introducing a new framework for analyzing errors in multi-hop question answering tasks, revealing insights into their cognitive limitations.

Details

Motivation: The motivation is to understand why reasoning models hallucinate more than general-purpose language models, particularly in tasks requiring complex, multi-step thought processes like multi-hop question answering. Method: The authors used a mixed-method approach involving rigorous human annotation and complementary automated metrics to analyze reasoning failures in contemporary language models on multi-hop question answering tasks. Result: A novel error categorization framework was developed, uncovering intricate error patterns across three dimensions: diversity of source documents (hops), completeness of information capture (coverage), and cognitive inefficiency (overthinking). Conclusion: The study concludes that reasoning models have cognitive limitations that lead to hallucinations, and the proposed error categorization framework can provide deeper insights into these limitations, guiding future improvements in reasoning fidelity, transparency, and robustness. Abstract: The emergence of reasoning models and their integration into practical AI chat bots has led to breakthroughs in solving advanced math, deep search, and extractive question answering problems that requires a complex and multi-step thought process. Yet, a complete understanding of why these models hallucinate more than general purpose language models is missing. In this investigative study, we systematicallyexplore reasoning failures of contemporary language models on multi-hop question answering tasks. We introduce a novel, nuanced error categorization framework that examines failures across three critical dimensions: the diversity and uniqueness of source documents involved ("hops"), completeness in capturing relevant information ("coverage"), and cognitive inefficiency ("overthinking"). Through rigorous hu-man annotation, supported by complementary automated metrics, our exploration uncovers intricate error patterns often hidden by accuracy-centric evaluations. This investigative approach provides deeper insights into the cognitive limitations of current models and offers actionable guidance toward enhancing reasoning fidelity, transparency, and robustness in future language modeling efforts.

cs.CV [Back]

[68] Text2VR: Automated instruction Generation in Virtual Reality using Large language Models for Assembly Task

Subin Raj Peter

Main category: cs.CV

TL;DR: This paper introduces a system that uses Large Language Models (LLMs) to automate the creation of VR training content, making development faster, easier, and more adaptable to changing industrial needs.

Details

Motivation: VR training is resource-intensive and time-consuming to develop; automation through LLMs can reduce overhead and improve scalability. Method: The system uses an LLM module to extract task-relevant information from textual input, which is then interpreted by an intelligent module to generate animated demonstrations and visual cues within a VR environment. Result: The approach successfully transforms textual input into interactive VR training content, enhancing training effectiveness by using animations and visual cues. Conclusion: The proposed approach effectively enhances VR-based training by automating content generation through the integration of LLMs and intelligent modules, making training more scalable and adaptable. Abstract: Virtual Reality (VR) has emerged as a powerful tool for workforce training, offering immersive, interactive, and risk-free environments that enhance skill acquisition, decision-making, and confidence. Despite its advantages, developing VR applications for training remains a significant challenge due to the time, expertise, and resources required to create accurate and engaging instructional content. To address these limitations, this paper proposes a novel approach that leverages Large Language Models (LLMs) to automate the generation of virtual instructions from textual input. The system comprises two core components: an LLM module that extracts task-relevant information from the text, and an intelligent module that transforms this information into animated demonstrations and visual cues within a VR environment. The intelligent module receives input from the LLM module and interprets the extracted information. Based on this, an instruction generator creates training content using relevant data from a database. The instruction generator generates the instruction by changing the color of virtual objects and creating animations to illustrate tasks. This approach enhances training effectiveness and reduces development overhead, making VR-based training more scalable and adaptable to evolving industrial needs.

[69] Outlier Detection Algorithm for Circle Fitting

Ahmet Gökhan Poyraz

Main category: cs.CV

TL;DR: 本文提出了一种基于极坐标的新异常值检测算法PCOD，用于提高工业环境中圆拟合的准确性。

Details

Motivation: 噪声数据会严重影响圆拟合算法的效果，因此需要有效的异常值检测和去除算法。 Method: PCOD算法通过将点集转换为极坐标，计算局部和全局标准偏差，识别异常值。 Result: PCOD算法在数据集上表现最佳，准确度最高。 Conclusion: PCOD算法在工业环境中的圆拟合应用中表现出色，提高了准确性。 Abstract: Circle fitting methods are extensively utilized in various industries, particularly in quality control processes and design applications. The effectiveness of these algorithms can be significantly compromised when the point sets to be predicted are noisy. To mitigate this issue, outlier detection and removal algorithms are often applied before the circle fitting procedure. This study introduces the Polar Coordinate-Based Outlier Detection (PCOD) algorithm, which can be effectively employed in circle fitting applications. In the proposed approach, the point set is first transformed into polar coordinates, followed by the calculation of both local and global standard deviations. Outliers are then identified by comparing local mean values with the global standard deviation. The practicality and efficiency of the proposed method are demonstrated by focusing on the high-precision diameter measurement of industrial washer parts. Images from a machine vision system are processed through preprocessing steps, including sub-pixel edge detection. The resulting sub-pixel edge points are then cleaned using the proposed outlier detection and removal algorithm, after which circle fitting is performed. A comparison is made using ten different circle fitting algorithms and five distinct outlier detection methods. The results indicate that the proposed method outperforms the other approaches, delivering the best performance in terms of accuracy within the dataset, thereby demonstrating its potential for enhancing circle fitting applications in industrial environments.

[70] Enhancing Diameter Measurement Accuracy in Machine Vision Applications

Ahmet Gokhan Poyraz,Ahmet Emir Dirik,Hakan Gurkan,Mehmet Kacmaz

Main category: cs.CV

TL;DR: 该研究提出两种新方法，通过使用多个已知参考零件提高相机测量系统的准确性，显著减少了测量误差。

Details

Motivation: 尽管使用了如远心镜头等专业设备，但由于机械和软件相关因素，相机测量系统中仍可能出现测量误差，尤其是在测量不同直径的零件时。 Method: 研究提出了两种创新方法：基于转换因子的方法和基于像素的方法，以提高测量精度。 Result: 实验表明，测量误差从原来的13-114微米减少到1-2微米，显著提高了测量准确性。 Conclusion: 利用少量已知参考零件，所提出的方法能够实现相机视野内所有零件的高精度测量，并显著减少误差率，提高测量可靠性。 Abstract: In camera measurement systems, specialized equipment such as telecentric lenses is often employed to measure parts with narrow tolerances. However, despite the use of such equipment, measurement errors can occur due to mechanical and software-related factors within the system. These errors are particularly evident in applications where parts of different diameters are measured using the same setup. This study proposes two innovative approaches to enhance measurement accuracy using multiple known reference parts: a conversion factor-based method and a pixel-based method. In the first approach, the conversion factor is estimated from known references to calculate the diameter (mm) of the unknown part. In the second approach, the diameter (mm) is directly estimated using pixel-based diameter information from the references. The experimental setup includes an industrial-grade camera and telecentric lenses. Tests conducted on glass samples (1-12 mm) and metal workpieces (3-24 mm) show that measurement errors, which originally ranged from 13-114 micrometers, were reduced to 1-2 micrometers using the proposed methods. By utilizing only a few known reference parts, the proposed approach enables high-accuracy measurement of all parts within the camera's field of view. Additionally, this method enhances the existing diameter measurement literature by significantly reducing error rates and improving measurement reliability.

[71] Multimodal Video Emotion Recognition with Reliable Reasoning Priors

Zhepeng Wang,Yingjian Zhu,Guanghao Dong,Hongzhu Yi,Feng Chen,Xinming Wang,Jun Xie

Main category: cs.CV

TL;DR: This study improves multimodal emotion recognition by integrating prior reasoning knowledge from MLLMs and addressing class imbalance through a novel loss formulation.

Details

Motivation: The motivation behind the study is to improve multimodal emotion recognition by incorporating reliable prior reasoning knowledge from MLLMs and addressing the issue of class imbalance in such tasks. Method: The researchers used Gemini to generate fine-grained, modality-separable reasoning traces, which were used as priors during the fusion stage. They also introduced Balanced Dual-Contrastive Learning to address class imbalance. Result: The proposed framework, when applied to the MER2024 benchmark, achieved significant performance improvements in emotion recognition. Conclusion: The study concludes that integrating trustworthy prior reasoning knowledge from MLLMs with lightweight fusion networks enhances the performance of multimodal emotion recognition, making it more robust and scalable. Abstract: This study investigates the integration of trustworthy prior reasoning knowledge from MLLMs into multimodal emotion recognition. We employ Gemini to generate fine-grained, modality-separable reasoning traces, which are injected as priors during the fusion stage to enrich cross-modal interactions. To mitigate the pronounced class-imbalance in multimodal emotion recognition, we introduce Balanced Dual-Contrastive Learning, a loss formulation that jointly balances inter-class and intra-class distributions. Applied to the MER2024 benchmark, our prior-enhanced framework yields substantial performance gains, demonstrating that the reliability of MLLM-derived reasoning can be synergistically combined with the domain adaptability of lightweight fusion networks for robust, scalable emotion recognition.

[72] From Waveforms to Pixels: A Survey on Audio-Visual Segmentation

Jia Li,Yapeng Tian

Main category: cs.CV

TL;DR: 本文综述了音频-视觉分割领域的研究进展，分析了现有方法和挑战，并提出了未来的发展方向。

Details

Motivation: AVS是一个重要的多模态感知研究领域，能够实现细粒度的对象级理解，但存在一些挑战，如有限的时间建模、视觉模态偏倚、复杂环境下的鲁棒性不足和高计算需求。 Method: 论文分析了AVS领域的方法，包括单模态和多模态编码架构、音频-视觉融合策略、解码器设计以及主要的训练范式。 Result: 论文提供了AVS方法在标准基准上的广泛比较，突出了不同架构选择、融合策略和训练范式对性能的影响。 Conclusion: 该论文综述了音频-视觉分割（AVS）领域的现状，挑战和未来发展方向，提出了提升AVS系统的方法，包括改进时间推理、多模态融合、利用基础模型和减少对标签数据的依赖等方向。 Abstract: Audio-Visual Segmentation (AVS) aims to identify and segment sound-producing objects in videos by leveraging both visual and audio modalities. It has emerged as a significant research area in multimodal perception, enabling fine-grained object-level understanding. In this survey, we present a comprehensive overview of the AVS field, covering its problem formulation, benchmark datasets, evaluation metrics, and the progression of methodologies. We analyze a wide range of approaches, including architectures for unimodal and multimodal encoding, key strategies for audio-visual fusion, and various decoder designs. Furthermore, we examine major training paradigms, from fully supervised learning to weakly supervised and training-free methods. Notably, we provide an extensive comparison of AVS methods across standard benchmarks, highlighting the impact of different architectural choices, fusion strategies, and training paradigms on performance. Finally, we outline the current challenges, such as limited temporal modeling, modality bias toward vision, lack of robustness in complex environments, and high computational demands, and propose promising future directions, including improving temporal reasoning and multimodal fusion, leveraging foundation models for better generalization and few-shot learning, reducing reliance on labeled data through selfand weakly supervised learning, and incorporating higher-level reasoning for more intelligent AVS systems.

[73] A Large Language Model Powered Integrated Circuit Footprint Geometry Understanding

Yida Wang,Taiting Lu,Runze Liu,Lanqing Yang,Yifan Yang,Zhe Chen,Yuehai Wang,Yixin Liu,Kaiyuan Lin,Xiaomeng Chen,Dian Ding,Yijie Li,Yi-Chao Chen,Yincheng Jin,Mahanth Gowda

Main category: cs.CV

TL;DR: 本文提出了一种名为LLM4-IC8K的新框架，用于解决IC封装几何标注问题。该框架基于大型语言模型，并通过合成和真实数据训练以提升性能。

Details

Motivation: IC封装几何标注对于定义元件和PCB布局之间的物理接口至关重要，但由于封装图纸的非结构化和抽象图注，自动解析和准确建模仍然极具挑战性。目前尚无能够直接从IC机械图纸中自动进行封装几何标注的方法。 Method: LLM4-IC8K框架将IC机械图纸视为图像，并通过大型语言模型进行结构化几何解释。该方法包括两个阶段：首先在合成生成的IC封装图上训练多模态模型以学习基本几何推理能力，然后在真实数据表图纸上进行微调以提高实际场景中的鲁棒性和准确性。此外，作者还引入了一个名为ICGeo8K的多模态数据集，包含8608个带标签的样本。 Result: 实验表明，LLM4-IC8K框架在提出的基准测试中优于现有的多模态模型。 Conclusion: 提出的LLM4-IC8K框架在IC封装几何标注任务中表现优异，超越了现有的多模态模型。 Abstract: Printed-Circuit-board (PCB) footprint geometry labeling of integrated circuits (IC) is essential in defining the physical interface between components and the PCB layout, requiring exceptional visual perception proficiency. However, due to the unstructured footprint drawing and abstract diagram annotations, automated parsing and accurate footprint geometry modeling remain highly challenging. Despite its importance, no methods currently exist for automated package geometry labeling directly from IC mechanical drawings. In this paper, we first investigate the visual perception performance of Large Multimodal Models (LMMs) when solving IC footprint geometry understanding. Our findings reveal that current LMMs severely suffer from inaccurate geometric perception, which hinders their performance in solving the footprint geometry labeling problem. To address these limitations, we propose LLM4-IC8K, a novel framework that treats IC mechanical drawings as images and leverages LLMs for structured geometric interpretation. To mimic the step-by-step reasoning approach used by human engineers, LLM4-IC8K addresses three sub-tasks: perceiving the number of pins, computing the center coordinates of each pin, and estimating the dimensions of individual pins. We present a two-stage framework that first trains LMMs on synthetically generated IC footprint diagrams to learn fundamental geometric reasoning and then fine-tunes them on real-world datasheet drawings to enhance robustness and accuracy in practical scenarios. To support this, we introduce ICGeo8K, a multi-modal dataset with 8,608 labeled samples, including 4138 hand-crafted IC footprint samples and 4470 synthetically generated samples. Extensive experiments demonstrate that our model outperforms state-of-the-art LMMs on the proposed benchmark.

[74] TIR-Diffusion: Diffusion-based Thermal Infrared Image Denoising via Latent and Wavelet Domain Optimization

Tai Hyoung Rhee,Dong-guw Lee,Ayoung Kim

Main category: cs.CV

TL;DR: This paper proposes a diffusion-based framework for denoising thermal infrared (TIR) images using latent-space and wavelet-domain optimization, achieving high-quality results and strong generalization for robotic perception tasks.

Details

Motivation: Thermal infrared (TIR) imaging is valuable for robotic perception in poor visibility or challenging lighting conditions, but TIR images suffer from heavy non-uniform fixed-pattern noise, complicating tasks like object detection, localization, and mapping. Method: A diffusion-based TIR image denoising framework leveraging latent-space representations and wavelet-domain optimization, with a cascaded refinement stage for enhancing fine details. Result: Experiments show that the proposed approach outperforms state-of-the-art denoising methods and generalizes well to diverse and challenging real-world TIR datasets. Conclusion: The proposed diffusion-based TIR image denoising framework demonstrates superior performance and robust zero-shot generalization, making it effective for practical robotic deployment. Abstract: Thermal infrared imaging exhibits considerable potentials for robotic perception tasks, especially in environments with poor visibility or challenging lighting conditions. However, TIR images typically suffer from heavy non-uniform fixed-pattern noise, complicating tasks such as object detection, localization, and mapping. To address this, we propose a diffusion-based TIR image denoising framework leveraging latent-space representations and wavelet-domain optimization. Utilizing a pretrained stable diffusion model, our method fine-tunes the model via a novel loss function combining latent-space and discrete wavelet transform (DWT) / dual-tree complex wavelet transform (DTCWT) losses. Additionally, we implement a cascaded refinement stage to enhance fine details, ensuring high-fidelity denoising results. Experiments on benchmark datasets demonstrate superior performance of our approach compared to state-of-the-art denoising methods. Furthermore, our method exhibits robust zero-shot generalization to diverse and challenging real-world TIR datasets, underscoring its effectiveness for practical robotic deployment.

[75] What is Beneath Misogyny: Misogynous Memes Classification and Explanation

Kushal Kanwar,Dushyant Singh Chauhan,Gopendra Vikram Singh,Asif Ekbal

Main category: cs.CV

TL;DR: 本文提出了一种新的多模态方法 MM-Misogyny，用于检测、分类和解释表情包中的厌恶性内容，并在新构建的数据集上展示了其优越性能。

Details

Motivation: 由于厌恶性内容往往隐藏在看似无害的表情包中，且因其多模态特性及其在不同社会背景中的细微表现，检测和理解厌恶性表情包成为一个研究挑战。 Method: 通过跨注意力机制将文本和图像模态统一到多模态上下文中，然后通过分类器和大型语言模型（LLM）进行标签、分类和解释。 Result: 模型在新整理的数据集 WBMS 上进行了评估，该数据集从网络空间收集了厌恶性表情包，并将其分为四个类别：厨房、领导力、工作和购物。 Conclusion: MM-Misogyny 模型不仅能够检测和分类厌恶性内容，还能提供对厌恶性在不同生活领域中运作方式的细致理解，并且该方法优于现有方法。 Abstract: Memes are popular in the modern world and are distributed primarily for entertainment. However, harmful ideologies such as misogyny can be propagated through innocent-looking memes. The detection and understanding of why a meme is misogynous is a research challenge due to its multimodal nature (image and text) and its nuanced manifestations across different societal contexts. We introduce a novel multimodal approach, \textit{namely}, \textit{\textbf{MM-Misogyny}} to detect, categorize, and explain misogynistic content in memes. \textit{\textbf{MM-Misogyny}} processes text and image modalities separately and unifies them into a multimodal context through a cross-attention mechanism. The resulting multimodal context is then easily processed for labeling, categorization, and explanation via a classifier and Large Language Model (LLM). The evaluation of the proposed model is performed on a newly curated dataset (\textit{\textbf{W}hat's \textbf{B}eneath \textbf{M}isogynous \textbf{S}tereotyping (WBMS)}) created by collecting misogynous memes from cyberspace and categorizing them into four categories, \textit{namely}, Kitchen, Leadership, Working, and Shopping. The model not only detects and classifies misogyny, but also provides a granular understanding of how misogyny operates in domains of life. The results demonstrate the superiority of our approach compared to existing methods. The code and dataset are available at \href{https://github.com/kushalkanwarNS/WhatisBeneathMisogyny/tree/main}{https://github.com/Misogyny}.

[76] StorySync: Training-Free Subject Consistency in Text-to-Image Generation via Region Harmonization

Gopalji Gaur,Mohammadreza Zolfaghari,Thomas Brox

Main category: cs.CV

TL;DR: Error

Details

Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: Generating a coherent sequence of images that tells a visual story, using text-to-image diffusion models, often faces the critical challenge of maintaining subject consistency across all story scenes. Existing approaches, which typically rely on fine-tuning or retraining models, are computationally expensive, time-consuming, and often interfere with the model's pre-existing capabilities. In this paper, we follow a training-free approach and propose an efficient consistent-subject-generation method. This approach works seamlessly with pre-trained diffusion models by introducing masked cross-image attention sharing to dynamically align subject features across a batch of images, and Regional Feature Harmonization to refine visually similar details for improved subject consistency. Experimental results demonstrate that our approach successfully generates visually consistent subjects across a variety of scenarios while maintaining the creative abilities of the diffusion model.

[77] Fusion of Pervasive RF Data with Spatial Images via Vision Transformers for Enhanced Mapping in Smart Cities

Rafayel Mkrtchyan,Armen Manukyan,Hrant Khachatrian,Theofanis P. Raptis

Main category: cs.CV

TL;DR: The paper proposes a deep learning approach using DINOv2 and RF data to improve building mapping accuracy, outperforming traditional methods.

Details

Motivation: The motivation is to overcome the limitations of conventional mapping techniques like satellite imagery and LiDAR by using a deep learning approach that integrates RF data to enhance mapping accuracy for smart city applications. Method: The method involves a vision transformer-based architecture that processes RF and map modalities together, using aggregated path loss information, and was evaluated using a synthetic dataset from Huawei with metrics like Jaccard index, Hausdorff distance, and Chamfer distance. Result: The model achieved a macro IoU of 65.3%, outperforming erroneous maps baseline (40.1%), an RF-only method (37.3%), and a non-AI fusion baseline (42.2%). Conclusion: The paper concludes that their deep learning-based approach using DINOv2 architecture significantly improves building mapping accuracy compared to existing methods by integrating RF data and open-source maps. Abstract: Environment mapping is an important computing task for a wide range of smart city applications, including autonomous navigation, wireless network operations and extended reality environments. Conventional smart city mapping techniques, such as satellite imagery, LiDAR scans, and manual annotations, often suffer from limitations related to cost, accessibility and accuracy. Open-source mapping platforms have been widely utilized in artificial intelligence applications for environment mapping, serving as a source of ground truth. However, human errors and the evolving nature of real-world environments introduce biases that can negatively impact the performance of neural networks trained on such data. In this paper, we present a deep learning-based approach that integrates the DINOv2 architecture to improve building mapping by combining maps from open-source platforms with radio frequency (RF) data collected from multiple wireless user equipments and base stations. Our approach leverages a vision transformer-based architecture to jointly process both RF and map modalities within a unified framework, effectively capturing spatial dependencies and structural priors for enhanced mapping accuracy. For the evaluation purposes, we employ a synthetic dataset co-produced by Huawei. We develop and train a model that leverages only aggregated path loss information to tackle the mapping problem. We measure the results according to three performance metrics which capture different qualities: (i) The Jaccard index, also known as intersection over union (IoU), (ii) the Hausdorff distance, and (iii) the Chamfer distance. Our design achieves a macro IoU of 65.3%, significantly surpassing (i) the erroneous maps baseline, which yields 40.1%, (ii) an RF-only method from the literature, which yields 37.3%, and (iii) a non-AI fusion baseline that we designed which yields 42.2%.

[78] VQ-DeepISC: Vector Quantized-Enabled Digital Semantic Communication with Channel Adaptive Image Transmission

Jianqiao Chen,Tingting Zhu,Huishi Song,Nan Ma,Xiaodong Xu

Main category: cs.CV

TL;DR: This paper proposes VQ-DeepISC, a digital semantic communication system using vector quantization and deep learning, which improves transmission efficiency and reconstruction quality by addressing codebook collapse and channel adaptation issues.

Details

Motivation: The motivation is to enable efficient and robust semantic communication by digitizing semantic features while preserving continuity, context, and resistance to channel degradation. Method: The method involves designing a Swin Transformer backbone for semantic feature extraction, followed by vector quantization modules and an attention mechanism-driven channel adaptation module. It also incorporates distributional regularization using KLD minimization and EMA for training stability. Result: The system achieves superior reconstruction fidelity compared to existing methods by enabling efficient index-based transmission and avoiding codebook collapse through regularization and training stabilization techniques. Conclusion: The proposed VQ-DeepISC system outperforms benchmark methods in reconstruction fidelity by combining deep learning with digital communication techniques. Abstract: Discretization of semantic features enables interoperability between semantic and digital communication systems, showing significant potential for practical applications. The fundamental difficulty in digitizing semantic features stems from the need to preserve continuity and context in inherently analog representations during their compression into discrete symbols while ensuring robustness to channel degradation. In this paper, we propose a vector quantized (VQ)-enabled digital semantic communication system with channel adaptive image transmission, named VQ-DeepISC. Guided by deep joint source-channel coding (DJSCC), we first design a Swin Transformer backbone for hierarchical semantic feature extraction, followed by VQ modules projecting features into discrete latent spaces. Consequently, it enables efficient index-based transmission instead of raw feature transmission. To further optimize this process, we develop an attention mechanism-driven channel adaptation module to dynamically optimize index transmission. Secondly, to counteract codebook collapse during training process, we impose a distributional regularization by minimizing the Kullback-Leibler divergence (KLD) between codeword usage frequencies and a uniform prior. Meanwhile, exponential moving average (EMA) is employed to stabilize training and ensure balanced feature coverage during codebook updates. Finally, digital communication is implemented using quadrature phase shift keying (QPSK) modulation alongside orthogonal frequency division multiplexing (OFDM), adhering to the IEEE 802.11a standard. Experimental results demonstrate superior reconstruction fidelity of the proposed system over benchmark methods.

[79] Tobler's First Law in GeoAI: A Spatially Explicit Deep Learning Model for Terrain Feature Detection Under Weak Supervision

Wenwen Li,Chia-Yu Hsu,Maosheng Hu

Main category: cs.CV

TL;DR: This paper introduces a deep learning model for weakly supervised object detection in GeoAI, leveraging spatial principles and attention mechanisms to address data scarcity and improve performance across geospatial applications.

Details

Motivation: The motivation stems from the challenges in GeoAI, such as the lack of training data and neglect of spatial principles, which hinder the integration of AI with geospatial research. Method: The paper develops a spatially explicit deep learning model based on Tobler's first law of geography, incorporates attention maps, and utilizes a multistage training strategy for weakly supervised object detection. Result: The model successfully detects natural and human-made features, including impact craters on Mars, reducing the need for extensive manual effort and demonstrating generalization across planetary surfaces. Conclusion: This paper concludes that the proposed deep learning model significantly advances the theoretical and methodological foundations of GeoAI by enabling weakly supervised object detection that adheres to spatial principles. Abstract: Recent interest in geospatial artificial intelligence (GeoAI) has fostered a wide range of applications using artificial intelligence (AI), especially deep learning, for geospatial problem solving. However, major challenges such as a lack of training data and the neglect of spatial principles and spatial effects in AI model design remain, significantly hindering the in-depth integration of AI with geospatial research. This paper reports our work in developing a deep learning model that enables object detection, particularly of natural features, in a weakly supervised manner. Our work makes three contributions: First, we present a method of object detection using only weak labels. This is achieved by developing a spatially explicit model based on Tobler's first law of geography. Second, we incorporate attention maps into the object detection pipeline and develop a multistage training strategy to improve performance. Third, we apply this model to detect impact craters on Mars, a task that previously required extensive manual effort. The model generalizes to both natural and human-made features on the surfaces of Earth and other planets. This research advances the theoretical and methodological foundations of GeoAI.

[80] Closed-Circuit Television Data as an Emergent Data Source for Urban Rail Platform Crowding Estimation

Riccardo Fiorista,Awad Abdelhalim,Anson F. Stewart,Gabriel L. Pincus,Ian Thistle,Jinhua Zhao

Main category: cs.CV

TL;DR: 本研究探索了利用CCTV影像和计算机视觉技术进行城市轨道交通平台拥挤程度估计的潜力，提出了一种高效的基于线性优化的方法，并验证了其在实时拥挤估计中的有效性。

Details

Motivation: 准确估计城市轨道交通平台的拥挤程度对于提升运营安全、效率和乘客体验至关重要，但传统的拥挤感知方法依赖于间接代理数据，存在局限性。 Method: 研究比较了三种最先进的计算机视觉方法（目标检测与计数、基于视觉变压器的拥挤级别分类、语义分割），并提出了一种基于线性优化的方法，用于从分割图中提取计数，同时考虑图像对象深度和乘客分布。 Result: 在包含超过600小时视频数据的隐私保护数据集上测试后，结果显示计算机视觉方法在拥挤估计方面具有实质性价值，能够提供准确的实时估计。 Conclusion: CCTV数据结合计算机视觉方法能够提供更精确的实时拥挤估计，为城市轨道交通平台的运营决策提供支持。 Abstract: Accurately estimating urban rail platform occupancy can enhance transit agencies' ability to make informed operational decisions, thereby improving safety, operational efficiency, and customer experience, particularly in the context of crowding. However, sensing real-time crowding remains challenging and often depends on indirect proxies such as automatic fare collection data or staff observations. Recently, Closed-Circuit Television (CCTV) footage has emerged as a promising data source with the potential to yield accurate, real-time occupancy estimates. The presented study investigates this potential by comparing three state-of-the-art computer vision approaches for extracting crowd-related features from platform CCTV imagery: (a) object detection and counting using YOLOv11, RT-DETRv2, and APGCC; (b) crowd-level classification via a custom-trained Vision Transformer, Crowd-ViT; and (c) semantic segmentation using DeepLabV3. Additionally, we present a novel, highly efficient linear-optimization-based approach to extract counts from the generated segmentation maps while accounting for image object depth and, thus, for passenger dispersion along a platform. Tested on a privacy-preserving dataset created in collaboration with the Washington Metropolitan Area Transit Authority (WMATA) that encompasses more than 600 hours of video material, our results demonstrate that computer vision approaches can provide substantive value for crowd estimation. This work demonstrates that CCTV image data, independent of other data sources available to a transit agency, can enable more precise real-time crowding estimation and, eventually, timely operational responses for platform crowding mitigation.

[81] Modular Transformer Architecture for Precision Agriculture Imaging

Brian Gopalan,Nathalia Nascimento,Vishal Monga

Main category: cs.CV

TL;DR: This paper proposes a quality-aware modular deep-learning framework for weed segmentation from drone video in precision agriculture. By analyzing image quality and dynamically routing data to specialized vision transformer models, the system surpasses existing CNN-based methods in segmentation quality and efficiency.

Details

Motivation: The motivation is to address the critical need for efficient and accurate weed segmentation from drone video in precision agriculture, especially considering common image degradation issues like blur and noise. Method: The method involves a quality-aware modular deep-learning framework that analyzes drone images for noise and blur using Mean Absolute Deviation and the Laplacian. Based on the degradation type, data is dynamically routed to one of three vision transformer models: a baseline for clean images, a modified transformer with Fisher Vector encoding for noise reduction, or another with an unrolled Lucy-Robinson decoder to correct blur. Result: The result demonstrates that the system's novel routing strategy allows it to outperform existing CNN-based methods in both segmentation quality and computational efficiency. Conclusion: The paper concludes that the proposed quality-aware modular deep-learning framework outperforms existing CNN-based methods in segmentation quality and computational efficiency, marking a significant advancement in deep-learning applications for agriculture. Abstract: This paper addresses the critical need for efficient and accurate weed segmentation from drone video in precision agriculture. A quality-aware modular deep-learning framework is proposed that addresses common image degradation by analyzing quality conditions-such as blur and noise-and routing inputs through specialized pre-processing and transformer models optimized for each degradation type. The system first analyzes drone images for noise and blur using Mean Absolute Deviation and the Laplacian. Data is then dynamically routed to one of three vision transformer models: a baseline for clean images, a modified transformer with Fisher Vector encoding for noise reduction, or another with an unrolled Lucy-Robinson decoder to correct blur. This novel routing strategy allows the system to outperform existing CNN-based methods in both segmentation quality and computational efficiency, demonstrating a significant advancement in deep-learning applications for agriculture.

[82] Generating Synthetic Invoices via Layout-Preserving Content Replacement

Bevin V,Ananthakrishnan P V,Ragesh KR,Sanjay M,Vineeth S,Bibin Wilson

Main category: cs.CV

TL;DR: This paper proposes a novel pipeline for generating realistic synthetic invoices and structured data to overcome dataset limitations in automated invoice processing due to privacy and cost constraints.

Details

Motivation: The motivation stems from the dependency of machine learning models on large-scale, diverse datasets for automated invoice processing, which is often hindered by privacy regulations and the high cost of manual annotation. Method: The method involves three steps: (1) using Optical Character Recognition (OCR) to extract text and layout from a source invoice, (2) replacing select data fields with synthetic content generated by a large language model (LLM), and (3) applying an inpainting technique to erase and replace the original text while preserving layout and font characteristics. Result: The result is a pipeline that generates both a visually realistic synthetic invoice image and a perfectly aligned structured JSON file, enabling the creation of large, varied datasets for training document intelligence models. Conclusion: The study concludes that the proposed method can effectively generate high-fidelity, synthetic invoice documents and structured data, providing a scalable solution to amplify small, private datasets for training robust document intelligence models. Abstract: The performance of machine learning models for automated invoice processing is critically dependent on large-scale, diverse datasets. However, the acquisition of such datasets is often constrained by privacy regulations and the high cost of manual annotation. To address this, we present a novel pipeline for generating high-fidelity, synthetic invoice documents and their corresponding structured data. Our method first utilizes Optical Character Recognition (OCR) to extract the text content and precise spatial layout from a source invoice. Select data fields are then replaced with contextually realistic, synthetic content generated by a large language model (LLM). Finally, we employ an inpainting technique to erase the original text from the image and render the new, synthetic text in its place, preserving the exact layout and font characteristics. This process yields a pair of outputs: a visually realistic new invoice image and a perfectly aligned structured data file (JSON) reflecting the synthetic content. Our approach provides a scalable and automated solution to amplify small, private datasets, enabling the creation of large, varied corpora for training more robust and accurate document intelligence models.

[83] Refine-IQA: Multi-Stage Reinforcement Finetuning for Perceptual Image Quality Assessment

Ziheng Jia,Jiaying Qian,Zicheng Zhang,Zijian Chen,Xiongkuo Min

Main category: cs.CV

TL;DR: The paper introduces Refine-IQA, a two-stage reinforcement fine-tuning framework for image quality assessment that improves visual perception and 'think' process supervision, leading to superior performance on IQA tasks.

Details

Motivation: Existing RFT-based IQA methods lack reward supervision for the model's 'think' process and do not explicitly enhance low-level visual perception, limiting their performance. This work addresses these gaps. Method: The framework involves a two-stage approach: (1) building a dataset with multi-task reward functions to enhance visual quality perception, and (2) introducing a probability difference reward strategy for 'think' process supervision. Result: Refine-IQA Series Models achieve outstanding performance on both perception and scoring tasks, with a robust 'think' capability that performs exceptionally on quality interpreting benchmarks. Conclusion: The proposed Refine-IQA framework enhances the model's visual quality perception and 'think' process supervision, achieving robust performance on IQA tasks. Abstract: Reinforcement fine-tuning (RFT) is a proliferating paradigm for LMM training. Analogous to high-level reasoning tasks, RFT is similarly applicable to low-level vision domains, including image quality assessment (IQA). Existing RFT-based IQA methods typically use rule-based output rewards to verify the model's rollouts but provide no reward supervision for the "think" process, leaving its correctness and efficacy uncontrolled. Furthermore, these methods typically fine-tune directly on downstream IQA tasks without explicitly enhancing the model's native low-level visual quality perception, which may constrain its performance upper bound. In response to these gaps, we propose the multi-stage RFT IQA framework (Refine-IQA). In Stage-1, we build the Refine-Perception-20K dataset (with 12 main distortions, 20,907 locally-distorted images, and over 55K RFT samples) and design multi-task reward functions to strengthen the model's visual quality perception. In Stage-2, targeting the quality scoring task, we introduce a probability difference reward involved strategy for "think" process supervision. The resulting Refine-IQA Series Models achieve outstanding performance on both perception and scoring tasks-and, notably, our paradigm activates a robust "think" (quality interpreting) capability that also attains exceptional results on the corresponding quality interpreting benchmark.

[84] 4D-PreNet: A Unified Preprocessing Framework for 4D-STEM Data Analysis

Mingyu Liu,Zian Mao,Zhu Liu,Haoran Zhang,Jintao Guo,Xiaoya He,Xi Huang,Shufen Chu,Chun Cheng,Jun Ding,Yujun Xie

Main category: cs.CV

TL;DR: This paper introduces 4D-PreNet, an end-to-end deep-learning framework that effectively addresses noise, drift, and distortion in 4D-STEM data, significantly improving preprocessing accuracy and enabling efficient real-time analysis.

Details

Motivation: The motivation stems from the limitations of conventional correction algorithms in addressing pervasive noise, beam center drift, and elliptical distortions in 4D-STEM data preprocessing, which systematically bias quantitative measurements and hinder high-throughput analysis. Method: 4D-PreNet integrates attention-enhanced U-Net and ResNet architectures to simultaneously perform denoising, center correction, and elliptical distortion calibration, trained on large, simulated datasets for generalization to experimental data. Result: The 4D-PreNet pipeline reduces mean squared error by up to 50% during denoising and achieves sub-pixel center localization with average errors below 0.04 pixels, outperforming traditional algorithms in noise suppression and diffraction pattern restoration. Conclusion: The 4D-PreNet pipeline presents a robust and generalizable deep-learning solution for addressing the challenges of noise, beam center drift, and elliptical distortions in high-throughput 4D-STEM data preprocessing, enabling reliable real-time analysis. Abstract: Automated experimentation with real time data analysis in scanning transmission electron microscopy (STEM) often require end-to-end framework. The four-dimensional scanning transmission electron microscopy (4D-STEM) with high-throughput data acquisition has been constrained by the critical bottleneck results from data preprocessing. Pervasive noise, beam center drift, and elliptical distortions during high-throughput acquisition inevitably corrupt diffraction patterns, systematically biasing quantitative measurements. Yet, conventional correction algorithms are often material-specific and fail to provide a robust, generalizable solution. In this work, we present 4D-PreNet, an end-to-end deep-learning pipeline that integrates attention-enhanced U-Net and ResNet architectures to simultaneously perform denoising, center correction, and elliptical distortion calibration. The network is trained on large, simulated datasets encompassing a wide range of noise levels, drift magnitudes, and distortion types, enabling it to generalize effectively to experimental data acquired under varying conditions. Quantitative evaluations demonstrate that our pipeline reduces mean squared error by up to 50% during denoising and achieves sub-pixel center localization in the center detection task, with average errors below 0.04 pixels. The outputs are bench-marked against traditional algorithms, highlighting improvements in both noise suppression and restoration of diffraction patterns, thereby facilitating high-throughput, reliable 4D-STEM real-time analysis for automated characterization.

[85] HPSv3: Towards Wide-Spectrum Human Preference Score

Yuhang Ma,Xiaoshi Wu,Keqiang Sun,Hongsheng Li

Main category: cs.CV

TL;DR: 本研究提出HPSv3和CoHP，分别作为强大的文本到图像评估指标和高效的图像优化方法，解决了现有评估方法的局限性。

Details

Motivation: 现有文本到图像生成模型评估受限于有限的数据覆盖、次优特征提取和低效损失函数，因此需要一种与人类感知对齐的新方法。 Method: 开发了HPSv3，基于VLM的偏好模型，采用不确定性感知排序损失进行训练；提出了CoHP，一种使用HPSv3选择最佳图像的迭代图像优化方法。 Result: HPSv3展示了广泛的图像评估能力，CoHP在不增加额外数据的情况下提升了图像生成质量。 Conclusion: HPSv3作为强大的图像评估指标，CoHP提供了一种高效且符合人类偏好的图像生成质量改进方法。代码和数据集可在HPSv3主页获取。 Abstract: Evaluating text-to-image generation models requires alignment with human perception, yet existing human-centric metrics are constrained by limited data coverage, suboptimal feature extraction, and inefficient loss functions. To address these challenges, we introduce Human Preference Score v3 (HPSv3). (1) We release HPDv3, the first wide-spectrum human preference dataset integrating 1.08M text-image pairs and 1.17M annotated pairwise comparisons from state-of-the-art generative models and low to high-quality real-world images. (2) We introduce a VLM-based preference model trained using an uncertainty-aware ranking loss for fine-grained ranking. Besides, we propose Chain-of-Human-Preference (CoHP), an iterative image refinement method that enhances quality without extra data, using HPSv3 to select the best image at each step. Extensive experiments demonstrate that HPSv3 serves as a robust metric for wide-spectrum image evaluation, and CoHP offers an efficient and human-aligned approach to improve image generation quality. The code and dataset are available at the HPSv3 Homepage.

[86] Deep learning framework for crater detection and identification on the Moon and Mars

Yihan Ma,Zeyang Yu,Rohitash Chandra

Main category: cs.CV

TL;DR: 本文应用深度学习模型（包括CNN、ResNet和YOLO）进行撞击坑的自动检测与识别，并提出了一个两阶段框架，其中第一阶段使用简单经典CNN、ResNet-50和YOLO进行撞击坑识别，第二阶段使用YOLO进行撞击坑定位。

Details

Motivation: 撞击坑是行星表面最显著的地形特征之一，对行星科学研究具有重大意义。近年来深度学习模型的迅速发展促进了自动撞击坑检测领域的研究兴趣。 Method: 使用包括卷积神经网络（CNN）、ResNet和YOLO在内的新型模型，采用两阶段方法进行撞击坑识别和定位。 Result: 检测并识别了不同类型撞击坑，并基于遥感数据提供了特定区域的总结报告。 Conclusion: YOLO提供最平衡的撞击坑检测性能，ResNet-50在识别大撞击坑方面表现出高精度。 Abstract: Impact craters are among the most prominent geomorphological features on planetary surfaces and are of substantial significance in planetary science research. Their spatial distribution and morphological characteristics provide critical information on planetary surface composition, geological history, and impact processes. In recent years, the rapid advancement of deep learning models has fostered significant interest in automated crater detection. In this paper, we apply advancements in deep learning models for impact crater detection and identification. We use novel models, including Convolutional Neural Networks (CNNs) and variants such as YOLO and ResNet. We present a framework that features a two-stage approach where the first stage features crater identification using simple classic CNN, ResNet-50 and YOLO. In the second stage, our framework employs YOLO-based detection for crater localisation. Therefore, we detect and identify different types of craters and present a summary report with remote sensing data for a selected region. We consider selected regions for craters and identification from Mars and the Moon based on remote sensing data. Our results indicate that YOLO demonstrates the most balanced crater detection performance, while ResNet-50 excels in identifying large craters with high precision.

[87] Point-Based Shape Representation Generation with a Correspondence-Preserving Diffusion Model

Shen Zhu,Yinzhu Jin,Ifrah Zawar,P. Thomas Fletcher

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的生成方法，能够生成具有点对应关系的现实形状表示，并成功应用于医学图像分析任务。

Details

Motivation: 传统的统计形状模型广泛考虑了点对应关系，而当前的深度学习方法忽略了这一点，主要集中在无序点云上。本研究旨在解决这一问题，生成具有点对应关系的现实形状表示。 Method: 研究提出了一种新的扩散模型，利用包含对应关系的形状表示数据（源自OASIS-3）进行训练，并通过生成对抗网络的方法生成具有点对应关系的形状。 Result: 实验表明，该模型能够生成高度现实的基于点的海马体形状表示，并且优于现有方法；同时，该模型在健康和阿尔茨海默病（AD）受试者的条件生成以及疾病进展的反事实生成任务中表现出色。 Conclusion: 该研究成功开发了一种能够生成具有点对应关系的基于点的形状表示的扩散模型，并展示了其在下游任务中的应用潜力。 Abstract: We propose a diffusion model designed to generate point-based shape representations with correspondences. Traditional statistical shape models have considered point correspondences extensively, but current deep learning methods do not take them into account, focusing on unordered point clouds instead. Current deep generative models for point clouds do not address generating shapes with point correspondences between generated shapes. This work aims to formulate a diffusion model that is capable of generating realistic point-based shape representations, which preserve point correspondences that are present in the training data. Using shape representation data with correspondences derived from Open Access Series of Imaging Studies 3 (OASIS-3), we demonstrate that our correspondence-preserving model effectively generates point-based hippocampal shape representations that are highly realistic compared to existing methods. We further demonstrate the applications of our generative model by downstream tasks, such as conditional generation of healthy and AD subjects and predicting morphological changes of disease progression by counterfactual generation.

[88] Policy to Assist Iteratively Local Segmentation: Optimising Modality and Location Selection for Prostate Cancer Localisation

Xiangcen Wu,Shaheer U. Saeed,Yipei Wang,Ester Bonmati Coll,Yipeng Hu

Main category: cs.CV

TL;DR: 本文提出了一种基于策略网络的推荐系统，能够推荐最佳图像模态和区域以提高前列腺癌分割性能，具有超越标准分割网络的能力，并可能优化放射科医生的工作流程。

Details

Motivation: 放射科医生通常混合使用不同的医学图像阅读策略，包括检查个体模态和局部图像区域，独立或同时使用来自不同位置的不同图像信息。本文旨在通过提出一种推荐系统来优化这一过程。 Method: 训练一个策略网络来辅助肿瘤定位，推荐最佳成像模态和感兴趣的特定部分，并利用预训练的分割网络对这些模态及其部分的个体或可变组合进行模拟放射科医生的检查。 Result: 实验结果表明，该方法可以超越标准的分割网络，特别是在存在挑战性病理的情况下，同时展示了其在提高标注效率和分割准确性方面的潜力。 Conclusion: 该论文提出了一种推荐系统，用于辅助基于机器学习的分割模型，通过建议最佳的图像部分和模态，以最大化前列腺癌分割性能，并展示了其在提高标注效率和分割准确性方面的潜力。 Abstract: Radiologists often mix medical image reading strategies, including inspection of individual modalities and local image regions, using information at different locations from different images independently as well as concurrently. In this paper, we propose a recommend system to assist machine learning-based segmentation models, by suggesting appropriate image portions along with the best modality, such that prostate cancer segmentation performance can be maximised. Our approach trains a policy network that assists tumor localisation, by recommending both the optimal imaging modality and the specific sections of interest for review. During training, a pre-trained segmentation network mimics radiologist inspection on individual or variable combinations of these imaging modalities and their sections - selected by the policy network. Taking the locally segmented regions as an input for the next step, this dynamic decision making process iterates until all cancers are best localised. We validate our method using a data set of 1325 labelled multiparametric MRI images from prostate cancer patients, demonstrating its potential to improve annotation efficiency and segmentation accuracy, especially when challenging pathology is present. Experimental results show that our approach can surpass standard segmentation networks. Perhaps more interestingly, our trained agent independently developed its own optimal strategy, which may or may not be consistent with current radiologist guidelines such as PI-RADS. This observation also suggests a promising interactive application, in which the proposed policy networks assist human radiologists.

[89] Scaling Up Audio-Synchronized Visual Animation: An Efficient Training Paradigm

Lin Zhang,Zefan Cai,Yufan Zhou,Shentong Mo,Jinhong Lin,Cheng-En Wu,Yibing Wei,Yijing Zhang,Ruiyi Zhang,Wen Xiao,Tong Sun,Junjie Hu,Pedro Morgado

Main category: cs.CV

TL;DR: 本文提出了一种高效的两阶段训练方法，用于扩展音频同步视觉动画，减少了对大量手动策划视频的依赖，并通过利用预训练模型和引入少量额外参数，有效学习音频条件生成能力，从而在多样化的开放类别中实现良好的泛化性能。

Details

Motivation: 现有的音频同步视觉动画方法严重依赖于昂贵的手动策划高质量、类别特定的训练视频，这在开放世界的多样化音频-视频类别中难以扩展。 Method: 第一阶段，自动策划大规模视频进行预训练，使模型学习多样但不完美的音频-视频对齐。第二阶段，在少量手动策划的高质量示例上微调模型。此外，通过多特征条件和窗口注意力机制，允许每一帧访问丰富的音频上下文，以增强同步效果。 Result: 引入了AVSync48基准测试，包含48个类别的视频，比之前的基准测试多样化3倍。实验表明，该方法显著减少了对手动策划的依赖达10倍以上，同时能够推广到许多开放类别。 Conclusion: 本文提出了一种高效的两阶段训练范式，用于扩展音频同步视觉动画，减少了对昂贵的手动策划高质量训练视频的依赖，并通过利用预训练的文本到视频生成器和音频编码器，仅引入少量可训练参数，实现了音频条件生成能力的学习。 Abstract: Recent advances in audio-synchronized visual animation enable control of video content using audios from specific classes. However, existing methods rely heavily on expensive manual curation of high-quality, class-specific training videos, posing challenges to scaling up to diverse audio-video classes in the open world. In this work, we propose an efficient two-stage training paradigm to scale up audio-synchronized visual animation using abundant but noisy videos. In stage one, we automatically curate large-scale videos for pretraining, allowing the model to learn diverse but imperfect audio-video alignments. In stage two, we finetune the model on manually curated high-quality examples, but only at a small scale, significantly reducing the required human effort. We further enhance synchronization by allowing each frame to access rich audio context via multi-feature conditioning and window attention. To efficiently train the model, we leverage pretrained text-to-video generator and audio encoders, introducing only 1.9\% additional trainable parameters to learn audio-conditioning capability without compromising the generator's prior knowledge. For evaluation, we introduce AVSync48, a benchmark with videos from 48 classes, which is 3$\times$ more diverse than previous benchmarks. Extensive experiments show that our method significantly reduces reliance on manual curation by over 10$\times$, while generalizing to many open classes.

[90] RAVID: Retrieval-Augmented Visual Detection: A Knowledge-Driven Approach for AI-Generated Image Identification

Mamadou Keita,Wassim Hamidouche,Hessen Bougueffa Eutamene,Abdelmalik Taleb-Ahmed,Abdenour Hadid

Main category: cs.CV

TL;DR: RAVID是一种基于视觉检索增强生成的新框架，用于AI生成图像的检测，具有卓越的准确性和鲁棒性。

Details

Motivation: 现有的AI生成图像检测方法通常依赖于低级伪影和模型特定特征，导致泛化能力和鲁棒性不足。RAVID旨在通过引入视觉检索增强生成（RAG）方法来解决这一问题，以提高检测性能。 Method: RAVID使用一个经过微调的CLIP图像编码器（RAVID CLIP）结合类别相关提示来生成图像嵌入，并从数据库中动态检索最相关的图像。随后，这些检索到的图像与查询图像一起输入到视觉-语言模型（VLM）中，以增强输入并提高检测准确性。 Result: 实验结果显示，RAVID在UniversalFakeDetect基准测试中取得了平均93.85%的准确率，达到了最先进的性能。此外，在高斯模糊和JPEG压缩等退化条件下，RAVID的平均准确率为80.27%，显著优于现有模型C2P-CLIP的63.44%。 Conclusion: RAVID展现出在不同生成模型和图像退化条件下的卓越检测性能，证明了视觉检索增强生成在AI生成图像检测中的有效性。 Abstract: In this paper, we introduce RAVID, the first framework for AI-generated image detection that leverages visual retrieval-augmented generation (RAG). While RAG methods have shown promise in mitigating factual inaccuracies in foundation models, they have primarily focused on text, leaving visual knowledge underexplored. Meanwhile, existing detection methods, which struggle with generalization and robustness, often rely on low-level artifacts and model-specific features, limiting their adaptability. To address this, RAVID dynamically retrieves relevant images to enhance detection. Our approach utilizes a fine-tuned CLIP image encoder, RAVID CLIP, enhanced with category-related prompts to improve representation learning. We further integrate a vision-language model (VLM) to fuse retrieved images with the query, enriching the input and improving accuracy. Given a query image, RAVID generates an embedding using RAVID CLIP, retrieves the most relevant images from a database, and combines these with the query image to form an enriched input for a VLM (e.g., Qwen-VL or Openflamingo). Experiments on the UniversalFakeDetect benchmark, which covers 19 generative models, show that RAVID achieves state-of-the-art performance with an average accuracy of 93.85%. RAVID also outperforms traditional methods in terms of robustness, maintaining high accuracy even under image degradations such as Gaussian blur and JPEG compression. Specifically, RAVID achieves an average accuracy of 80.27% under degradation conditions, compared to 63.44% for the state-of-the-art model C2P-CLIP, demonstrating consistent improvements in both Gaussian blur and JPEG compression scenarios. The code will be publicly available upon acceptance.

[91] Investigating the Impact of Large-Scale Pre-training on Nutritional Content Estimation from 2D Images

Michele Andrade,Guilherme A. L. Silva,Valéria Santos,Gladston Moreira,Eduardo Luz

Main category: cs.CV

TL;DR: 该论文研究了大规模预训练数据集对仅使用2D图像的深度学习模型在营养估计任务中的表现的影响，并发现预训练数据集的特征对于迁移学习的效果至关重要。

Details

Motivation: 估计食物图像中的营养成分是一个具有挑战性的任务，尤其是在仅依赖2D图像的情况下，而当前最先进的方法依赖于专有数据集进行大规模预训练，这限制了该领域的可重复性。 Method: 微调和评估在ImageNet和COYO两个大规模公共数据集上预训练的Vision Transformer（ViT）模型，并将其表现与基线CNN模型（InceptionV2和ResNet-50）以及在专有JFT-300M数据集上预训练的最先进方法进行对比。 Result: 使用Mean Absolute Error（MAE）和Mean Absolute Percentage Error（MAE%）进行评估的结果显示，基于JFT-300M预训练的模型显著优于基于公共数据集预训练的模型。令人意外的是，在大规模COYO数据集上预训练的模型在该特定回归任务中的表现比在ImageNet上预训练的模型差。 Conclusion: 论文的分析提供了定量证据，突出了预训练数据集特征（包括规模、领域相关性和策划质量）在二维营养估计中的迁移学习效果的关键作用。 Abstract: Estimating the nutritional content of food from images is a critical task with significant implications for health and dietary monitoring. This is challenging, especially when relying solely on 2D images, due to the variability in food presentation, lighting, and the inherent difficulty in inferring volume and mass without depth information. Furthermore, reproducibility in this domain is hampered by the reliance of state-of-the-art methods on proprietary datasets for large-scale pre-training. In this paper, we investigate the impact of large-scale pre-training datasets on the performance of deep learning models for nutritional estimation using only 2D images. We fine-tune and evaluate Vision Transformer (ViT) models pre-trained on two large public datasets, ImageNet and COYO, comparing their performance against baseline CNN models (InceptionV2 and ResNet-50) and a state-of-the-art method pre-trained on the proprietary JFT-300M dataset. We conduct extensive experiments on the Nutrition5k dataset, a large-scale collection of real-world food plates with high-precision nutritional annotations. Our evaluation using Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAE%) reveals that models pre-trained on JFT-300M significantly outperform those pre-trained on public datasets. Unexpectedly, the model pre-trained on the massive COYO dataset performs worse than the model pre-trained on ImageNet for this specific regression task, refuting our initial hypothesis. Our analysis provides quantitative evidence highlighting the critical role of pre-training dataset characteristics, including scale, domain relevance, and curation quality, for effective transfer learning in 2D nutritional estimation.

[92] JanusNet: Hierarchical Slice-Block Shuffle and Displacement for Semi-Supervised 3D Multi-Organ Segmentation

Zheng Zhang,Tianzhuzi Tan,Guanchun Yin,Bo Zhang,Xiuzhuang Zhou

Main category: cs.CV

TL;DR: This paper proposes JanusNet, a novel data augmentation framework for 3D medical images that maintains anatomical continuity while focusing on difficult-to-segment regions, leading to significant improvements in segmentation accuracy.

Details

Motivation: Traditional data augmentation disrupts anatomical continuity in 3D medical images, leading to structural inconsistencies and insufficient training in challenging regions like small organs. Method: Slice-Block Shuffle and Confidence-Guided Displacement steps are introduced to preserve anatomical context and amplify signals from difficult areas. Result: JanusNet achieves a 4% DSC gain on the Synapse dataset with only 20% labeled data, demonstrating superior performance over existing methods. Conclusion: JanusNet, a data augmentation framework for 3D medical data, outperforms state-of-the-art methods by globally modeling anatomical continuity and locally focusing on hard-to-segment regions. Abstract: Limited by the scarcity of training samples and annotations, weakly supervised medical image segmentation often employs data augmentation to increase data diversity, while randomly mixing volumetric blocks has demonstrated strong performance. However, this approach disrupts the inherent anatomical continuity of 3D medical images along orthogonal axes, leading to severe structural inconsistencies and insufficient training in challenging regions, such as small-sized organs, etc. To better comply with and utilize human anatomical information, we propose JanusNet}, a data augmentation framework for 3D medical data that globally models anatomical continuity while locally focusing on hard-to-segment regions. Specifically, our Slice-Block Shuffle step performs aligned shuffling of same-index slice blocks across volumes along a random axis, while preserving the anatomical context on planes perpendicular to the perturbation axis. Concurrently, the Confidence-Guided Displacement step uses prediction reliability to replace blocks within each slice, amplifying signals from difficult areas. This dual-stage, axis-aligned framework is plug-and-play, requiring minimal code changes for most teacher-student schemes. Extensive experiments on the Synapse and AMOS datasets demonstrate that JanusNet significantly surpasses state-of-the-art methods, achieving, for instance, a 4% DSC gain on the Synapse dataset with only 20% labeled data.

[93] CAD-Judge: Toward Efficient Morphological Grading and Verification for Text-to-CAD Generation

Zheyuan Zhou,Jiayi Han,Liang Du,Naiyu Fang,Lemiao Qiu,Shuyou Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为CAD-Judge的新型可验证奖励系统，用于高效的CAD偏好评分和语法验证，以解决文本到CAD系统中的问题。

Details

Motivation: CAD模型在工业设计、仿真和制造过程中被广泛使用，但传统的CAD工作流程存在复杂性和入门门槛高的问题。此外，渲染CAD模型可能较慢，使用VLMs审查CAD模型可能昂贵且可能导致奖励欺骗问题。 Method: 提出CAD-Judge系统，包括Compiler-as-a-Judge Module（CJM）作为快速直接的奖励信号，通过前景理论最大化生成效用，优化模型对齐。在测试阶段引入了一种简单而有效的代理CAD生成方法和Compiler-as-a-Review Module（CRM），以提高文本到CAD的鲁棒性。 Result: 在具有挑战性的CAD数据集上进行的广泛实验表明，所提方法在保持高效率的同时达到了最先进的性能。 Conclusion: CAD-Judge为文本到CAD系统提供了一种高效且有效的解决方案，有效解决了CAD模型生成和验证中的关键问题。 Abstract: Computer-Aided Design (CAD) models are widely used across industrial design, simulation, and manufacturing processes. Text-to-CAD systems aim to generate editable, general-purpose CAD models from textual descriptions, significantly reducing the complexity and entry barrier associated with traditional CAD workflows. However, rendering CAD models can be slow, and deploying VLMs to review CAD models can be expensive and may introduce reward hacking that degrades the systems. To address these challenges, we propose CAD-Judge, a novel, verifiable reward system for efficient and effective CAD preference grading and grammatical validation. We adopt the Compiler-as-a-Judge Module (CJM) as a fast, direct reward signal, optimizing model alignment by maximizing generative utility through prospect theory. To further improve the robustness of Text-to-CAD in the testing phase, we introduce a simple yet effective agentic CAD generation approach and adopt the Compiler-as-a-Review Module (CRM), which efficiently verifies the generated CAD models, enabling the system to refine them accordingly. Extensive experiments on challenging CAD datasets demonstrate that our method achieves state-of-the-art performance while maintaining superior efficiency.

[94] $\text{S}^2$Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation

Weilun Feng,Haotong Qin,Chuanguang Yang,Xiangqi Li,Han Yang,Yuqi Li,Zhulin An,Libo Huang,Michele Magno,Yongjun Xu

Main category: cs.CV

TL;DR: 本文提出了一种针对视频扩散模型的后训练量化框架S²Q-VDiT，通过显著数据和稀疏令牌蒸馏方法，在保证无损性能的前提下，显著压缩模型并加速推理。

Details

Motivation: 扩散变压器在视频生成模型中广泛应用，但其巨大的参数量带来了显著的计算成本。量化技术虽然可以降低内存使用和加速推理，但视频扩散模型中时空信息的联合建模导致了极长的令牌序列，从而引入了高校准方差和学习挑战。 Method: 提出了一种名为S²Q-VDiT的后训练量化框架，利用显著数据和稀疏令牌蒸馏来优化视频扩散模型的量化过程。 Result: S²Q-VDiT通过Hessian-aware Salient Data Selection构建高质量校准数据集，并通过Attention-guided Sparse Token Distillation强调对模型输出更具有影响力的令牌，从而提高量化性能。 Conclusion: S²Q-VDiT在W4A6量化下实现了无损性能，同时提供了3.9倍的模型压缩和1.3倍的推理加速。 Abstract: Diffusion transformers have emerged as the mainstream paradigm for video generation models. However, the use of up to billions of parameters incurs significant computational costs. Quantization offers a promising solution by reducing memory usage and accelerating inference. Nonetheless, we observe that the joint modeling of spatial and temporal information in video diffusion models (V-DMs) leads to extremely long token sequences, which introduces high calibration variance and learning challenges. To address these issues, we propose \textbf{$\text{S}^2$Q-VDiT}, a post-training quantization framework for V-DMs that leverages \textbf{S}alient data and \textbf{S}parse token distillation. During the calibration phase, we identify that quantization performance is highly sensitive to the choice of calibration data. To mitigate this, we introduce \textit{Hessian-aware Salient Data Selection}, which constructs high-quality calibration datasets by considering both diffusion and quantization characteristics unique to V-DMs. To tackle the learning challenges, we further analyze the sparse attention patterns inherent in V-DMs. Based on this observation, we propose \textit{Attention-guided Sparse Token Distillation}, which exploits token-wise attention distributions to emphasize tokens that are more influential to the model's output. Under W4A6 quantization, $\text{S}^2$Q-VDiT achieves lossless performance while delivering $3.9\times$ model compression and $1.3\times$ inference acceleration. Code will be available at \href{https://github.com/wlfeng0509/s2q-vdit}{https://github.com/wlfeng0509/s2q-vdit}.

[95] Can Large Multimodal Models Actively Recognize Faulty Inputs? A Systematic Evaluation Framework of Their Input Scrutiny Ability

Haiqi Yang,Jinzhe Li,Gengxu Li,Yi Chang,Yuan Wu

Main category: cs.CV

TL;DR: This paper introduces the ISEval framework to evaluate LMMs' ability to detect flawed inputs, revealing that most models struggle without explicit prompts and perform unevenly across error types, highlighting the need for improved proactive input verification.

Details

Motivation: Recent studies have shown that large language models tend to passively accept defective inputs, leading to flawed reasoning. However, it remains unknown whether LMMs can actively detect and scrutinize erroneous inputs. This research addresses this gap. Method: The researchers introduced the Input Scrutiny Ability Evaluation Framework (ISEval), which includes seven categories of flawed premises and three evaluation metrics. They conducted extensive evaluations on ten advanced LMMs to analyze their performance in detecting errors. Result: Key findings indicate that most models struggle with detecting flawed textual premises without guidance, perform better in identifying logical fallacies than surface-level linguistic errors, and exhibit varying degrees of modality trust. Conclusion: The study concludes that most LMMs struggle to detect flawed textual premises without explicit guidance, highlighting a reliance on prompts for identifying premise errors. The research emphasizes the urgent need to enhance the proactive verification abilities of LMMs regarding input validity. Abstract: Large Multimodal Models (LMMs) have witnessed remarkable growth, showcasing formidable capabilities in handling intricate multimodal tasks with exceptional performance. Recent research has underscored the inclination of large language models to passively accept defective inputs, often resulting in futile reasoning on invalid prompts. However, the same critical question of whether LMMs can actively detect and scrutinize erroneous inputs still remains unexplored. To address this gap, we introduce the Input Scrutiny Ability Evaluation Framework (ISEval), which encompasses seven categories of flawed premises and three evaluation metrics. Our extensive evaluation of ten advanced LMMs has identified key findings. Most models struggle to actively detect flawed textual premises without guidance, which reflects a strong reliance on explicit prompts for premise error identification. Error type affects performance: models excel at identifying logical fallacies but struggle with surface-level linguistic errors and certain conditional flaws. Modality trust varies-Gemini 2.5 pro and Claude Sonnet 4 balance visual and textual info, while aya-vision-8b over-rely on text in conflicts. These insights underscore the urgent need to enhance LMMs' proactive verification of input validity and shed novel insights into mitigating the problem. The code is available at https://github.com/MLGroupJLU/LMM_ISEval.

[96] Prototype-Driven Structure Synergy Network for Remote Sensing Images Segmentation

Junyi Wang,Jinjiang Li,Guodong Fan,Yakun Ju,Xiang Fang,Alex C. Kot

Main category: cs.CV

TL;DR: 本文提出了一种新的遥感图像语义分割方法-PDSSNet，通过三个关键模块解决了高类内方差和高类间相似性问题，并在实验中表现出优越的性能。

Details

Motivation: 为了克服传统方法在统一类别表示和区分相似特征方面的不足，以及新兴类引导方法的局限性，如粗糙的类别原型表示和对目标结构信息的忽视。 Method: 提出了一个原型驱动的结构协同网络(PDSSNet)，包括自适应原型提取模块(APEM)、语义-结构协调模块(SSCM)和通道相似性调整模块(CSAM)。 Result: 实验结果表明，PDSSNet在遥感图像语义分割任务上超越了最先进的方法，取得了更好的性能表现。 Conclusion: PDSSNet有效地解决了遥感图像语义分割中的高类内方差和高类间相似性问题，表现出优于现有最先进方法的性能。 Abstract: In the semantic segmentation of remote sensing images, acquiring complete ground objects is critical for achieving precise analysis. However, this task is severely hindered by two major challenges: high intra-class variance and high inter-class similarity. Traditional methods often yield incomplete segmentation results due to their inability to effectively unify class representations and distinguish between similar features. Even emerging class-guided approaches are limited by coarse class prototype representations and a neglect of target structural information. Therefore, this paper proposes a Prototype-Driven Structure Synergy Network (PDSSNet). The design of this network is based on a core concept, a complete ground object is jointly defined by its invariant class semantics and its variant spatial structure. To implement this, we have designed three key modules. First, the Adaptive Prototype Extraction Module (APEM) ensures semantic accuracy from the source by encoding the ground truth to extract unbiased class prototypes. Subsequently, the designed Semantic-Structure Coordination Module (SSCM) follows a hierarchical semantics-first, structure-second principle. This involves first establishing a global semantic cognition, then leveraging structural information to constrain and refine the semantic representation, thereby ensuring the integrity of class information. Finally, the Channel Similarity Adjustment Module (CSAM) employs a dynamic step-size adjustment mechanism to focus on discriminative features between classes. Extensive experiments demonstrate that PDSSNet outperforms state-of-the-art methods. The source code is available at https://github.com/wangjunyi-1/PDSSNet.

[97] Dual Prompt Learning for Adapting Vision-Language Models to Downstream Image-Text Retrieval

Yifan Wang,Tao Wang,Chenwei Tang,Caiyang Yu,Zhengqing Zang,Mengmi Zhang,Shudong Huang,Jiancheng Lv

Main category: cs.CV

TL;DR: DCAR is a novel framework for image-text retrieval that dynamically adjusts prompts to better capture fine-grained details, outperforming existing methods on a new challenging dataset.

Details

Motivation: Prompt learning faces challenges in the Image-Text Retrieval task, particularly in distinguishing fine-grained attributes and similar subcategories. Method: Dual prompt Learning with Joint Category-Attribute Reweighting (DCAR), which dynamically adjusts prompt vectors from both semantic and visual dimensions to improve image-text matching. Result: Extensive experiments show that DCAR outperforms existing baselines and achieves superior performance on the newly constructed FDRD dataset. Conclusion: DCAR achieves state-of-the-art performance on the ITR task, proving the effectiveness of dual-prompt learning in enhancing fine-grained representation learning. Abstract: Recently, prompt learning has demonstrated remarkable success in adapting pre-trained Vision-Language Models (VLMs) to various downstream tasks such as image classification. However, its application to the downstream Image-Text Retrieval (ITR) task is more challenging. We find that the challenge lies in discriminating both fine-grained attributes and similar subcategories of the downstream data. To address this challenge, we propose Dual prompt Learning with Joint Category-Attribute Reweighting (DCAR), a novel dual-prompt learning framework to achieve precise image-text matching. The framework dynamically adjusts prompt vectors from both semantic and visual dimensions to improve the performance of CLIP on the downstream ITR task. Based on the prompt paradigm, DCAR jointly optimizes attribute and class features to enhance fine-grained representation learning. Specifically, (1) at the attribute level, it dynamically updates the weights of attribute descriptions based on text-image mutual information correlation; (2) at the category level, it introduces negative samples from multiple perspectives with category-matching weighting to learn subcategory distinctions. To validate our method, we construct the Fine-class Described Retrieval Dataset (FDRD), which serves as a challenging benchmark for ITR in downstream data domains. It covers over 1,500 downstream fine categories and 230,000 image-caption pairs with detailed attribute annotations. Extensive experiments on FDRD demonstrate that DCAR achieves state-of-the-art performance over existing baselines.

[98] Radar-Based NLoS Pedestrian Localization for Darting-Out Scenarios Near Parked Vehicles with Camera-Assisted Point Cloud Interpretation

Hee-Yeun Kim,Byeonggyu Park,Byonghyok Choi,Hansang Cho,Byungkwan Kim,Soomok Lee,Mingu Jeon,Seung-Woo Seo,Seong-Woo Kim

Main category: cs.CV

TL;DR: This paper proposes an NLoS pedestrian localization framework using monocular camera and 2D radar PCD to enhance early pedestrian detection in urban areas with parked vehicles.

Details

Motivation: The motivation is to address the road safety challenges caused by NLoS blind spots due to parked vehicles in urban environments. Method: The method involves detecting parked vehicles through image segmentation, estimating depth for spatial characteristics, and refining the information using 2D radar PCD. Result: Experimental evaluations showed that the approach enhances early pedestrian detection and contributes to improved road safety. Conclusion: The proposed NLoS pedestrian localization framework successfully integrates monocular camera and 2D radar PCD data to improve early pedestrian detection in urban environments. Abstract: The presence of Non-Line-of-Sight (NLoS) blind spots resulting from roadside parking in urban environments poses a significant challenge to road safety, particularly due to the sudden emergence of pedestrians. mmWave technology leverages diffraction and reflection to observe NLoS regions, and recent studies have demonstrated its potential for detecting obscured objects. However, existing approaches predominantly rely on predefined spatial information or assume simple wall reflections, thereby limiting their generalizability and practical applicability. A particular challenge arises in scenarios where pedestrians suddenly appear from between parked vehicles, as these parked vehicles act as temporary spatial obstructions. Furthermore, since parked vehicles are dynamic and may relocate over time, spatial information obtained from satellite maps or other predefined sources may not accurately reflect real-time road conditions, leading to erroneous sensor interpretations. To address this limitation, we propose an NLoS pedestrian localization framework that integrates monocular camera image with 2D radar point cloud (PCD) data. The proposed method initially detects parked vehicles through image segmentation, estimates depth to infer approximate spatial characteristics, and subsequently refines this information using 2D radar PCD to achieve precise spatial inference. Experimental evaluations conducted in real-world urban road environments demonstrate that the proposed approach enhances early pedestrian detection and contributes to improved road safety. Supplementary materials are available at https://hiyeun.github.io/NLoS/.

[99] CORE-ReID V2: Advancing the Domain Adaptation for Object Re-Identification with Optimized Training and Ensemble Fusion

Trinh Quoc Nguyen,Oky Dicky Ardiansyah Prima,Syahid Al Irfan,Hindriyanto Dwi Purnomo,Radius Tanone

Main category: cs.CV

TL;DR: CORE-ReID V2 improves unsupervised domain adaptation for ReID tasks using CycleGAN and an advanced ensemble fusion mechanism, achieving state-of-the-art results.

Details

Motivation: The motivation is to improve unsupervised domain adaptation for ReID tasks across different object categories, enabling better generalization and efficiency. Method: CORE-ReID V2 uses CycleGAN during pre-training to generate diverse data and an ensemble fusion mechanism (ECAB and SECAB) during fine-tuning to enhance feature representation and reduce pseudo-label ambiguity. Result: Experimental results show that CORE-ReID V2 achieves top performance on UDA Person ReID and Vehicle ReID datasets in terms of mAP and Rank-k Accuracy (Top-1, Top-5, Top-10). Conclusion: The proposed CORE-ReID V2 framework successfully addresses UDA challenges in Person ReID, Vehicle ReID, and Object ReID, outperforming state-of-the-art methods while supporting lightweight backbones for scalability and efficiency. Abstract: This study presents CORE-ReID V2, an enhanced framework building upon CORE-ReID. The new framework extends its predecessor by addressing Unsupervised Domain Adaptation (UDA) challenges in Person ReID and Vehicle ReID, with further applicability to Object ReID. During pre-training, CycleGAN is employed to synthesize diverse data, bridging image characteristic gaps across different domains. In the fine-tuning, an advanced ensemble fusion mechanism, consisting of the Efficient Channel Attention Block (ECAB) and the Simplified Efficient Channel Attention Block (SECAB), enhances both local and global feature representations while reducing ambiguity in pseudo-labels for target samples. Experimental results on widely used UDA Person ReID and Vehicle ReID datasets demonstrate that the proposed framework outperforms state-of-the-art methods, achieving top performance in Mean Average Precision (mAP) and Rank-k Accuracy (Top-1, Top-5, Top-10). Moreover, the framework supports lightweight backbones such as ResNet18 and ResNet34, ensuring both scalability and efficiency. Our work not only pushes the boundaries of UDA-based Object ReID but also provides a solid foundation for further research and advancements in this domain. Our codes and models are available at https://github.com/TrinhQuocNguyen/CORE-ReID-V2.

[100] SPJFNet: Self-Mining Prior-Guided Joint Frequency Enhancement for Ultra-Efficient Dark Image Restoration

Tongshun Zhang,Pingling Liu,Zijian Zhang,Qiuzhan Zhou

Main category: cs.CV

TL;DR: 本文提出 SPJFNet，一种高效的暗图像恢复网络，通过自挖掘引导模块和双频域优化框架，在提升性能的同时大幅提高计算效率。

Details

Motivation: 当前暗图像恢复方法在效率方面存在瓶颈，包括对外部先验的依赖、多阶段增强流程的冗余操作以及频率域方法的全局计算需求过高。 Method: 提出了 Self-Mining Guidance Module (SMGM) 和 Dual-Frequency Guidance Framework (DFGF)，结合小波和傅里叶域的联合频率增强，以减少冗余操作和模型复杂度。 Result: SPJFNet 在多个基准测试中超越了现有最先进方法，同时显著降低了模型复杂度和计算开销。 Conclusion: SPJFNet 提出了一种高效的暗图像恢复方法，通过消除对外部先验的依赖、简化频率域处理流程，实现了性能和效率的双重提升。 Abstract: Current dark image restoration methods suffer from severe efficiency bottlenecks, primarily stemming from: (1) computational burden and error correction costs associated with reliance on external priors (manual or cross-modal); (2) redundant operations in complex multi-stage enhancement pipelines; and (3) indiscriminate processing across frequency components in frequency-domain methods, leading to excessive global computational demands. To address these challenges, we propose an Efficient Self-Mining Prior-Guided Joint Frequency Enhancement Network (SPJFNet). Specifically, we first introduce a Self-Mining Guidance Module (SMGM) that generates lightweight endogenous guidance directly from the network, eliminating dependence on external priors and thereby bypassing error correction overhead while improving inference speed. Second, through meticulous analysis of different frequency domain characteristics, we reconstruct and compress multi-level operation chains into a single efficient operation via lossless wavelet decomposition and joint Fourier-based advantageous frequency enhancement, significantly reducing parameters. Building upon this foundation, we propose a Dual-Frequency Guidance Framework (DFGF) that strategically deploys specialized high/low frequency branches (wavelet-domain high-frequency enhancement and Fourier-domain low-frequency restoration), decoupling frequency processing to substantially reduce computational complexity. Rigorous evaluation across multiple benchmarks demonstrates that SPJFNet not only surpasses state-of-the-art performance but also achieves significant efficiency improvements, substantially reducing model complexity and computational overhead. Code is available at https://github.com/bywlzts/SPJFNet.

[101] VisualTrans: A Benchmark for Real-World Visual Transformation Reasoning

Yuheng Ji,Yipu Wang,Yuyang Liu,Xiaoshuai Hao,Yue Liu,Yuting Zhao,Huaihai Lyu,Xiaolong Zheng

Main category: cs.CV

TL;DR: VisualTrans is a new benchmark for visual transformation reasoning that uses real-world scenarios to evaluate reasoning capabilities, revealing weaknesses in current models' dynamic reasoning and temporal modeling.

Details

Motivation: Existing benchmarks for visual transformation reasoning suffer from a sim-to-real gap, limited task complexity, and incomplete reasoning coverage, which limits their practical use. VisualTrans aims to overcome these challenges. Method: VisualTrans uses a scalable data construction pipeline that integrates task selection, image pair extraction, metadata annotation with large multimodal models, and structured question generation. It includes human verification to ensure quality and interpretability. Result: The benchmark includes 12 manipulation tasks, evaluates three reasoning dimensions through 6 subtask types, and contains 472 question-answer pairs. Evaluations show that current models perform well in static spatial tasks but struggle with dynamic, multi-step reasoning. Conclusion: VisualTrans is a first-person manipulation video-based benchmark that addresses the limitations of existing benchmarks for visual transformation reasoning, revealing the weaknesses of current vision-language models in dynamic reasoning tasks. Abstract: Visual transformation reasoning (VTR) is a vital cognitive capability that empowers intelligent agents to understand dynamic scenes, model causal relationships, and predict future states, and thereby guiding actions and laying the foundation for advanced intelligent systems. However, existing benchmarks suffer from a sim-to-real gap, limited task complexity, and incomplete reasoning coverage, limiting their practical use in real-world scenarios. To address these limitations, we introduce VisualTrans, the first comprehensive benchmark specifically designed for VTR in real-world human-object interaction scenarios. VisualTrans encompasses 12 semantically diverse manipulation tasks and systematically evaluates three essential reasoning dimensions - spatial, procedural, and quantitative - through 6 well-defined subtask types. The benchmark features 472 high-quality question-answer pairs in various formats, including multiple-choice, open-ended counting, and target enumeration. We introduce a scalable data construction pipeline built upon first-person manipulation videos, which integrates task selection, image pair extraction, automated metadata annotation with large multimodal models, and structured question generation. Human verification ensures the final benchmark is both high-quality and interpretable. Evaluations of various state-of-the-art vision-language models show strong performance in static spatial tasks. However, they reveal notable shortcomings in dynamic, multi-step reasoning scenarios, particularly in areas like intermediate state recognition and transformation sequence planning. These findings highlight fundamental weaknesses in temporal modeling and causal reasoning, providing clear directions for future research aimed at developing more capable and generalizable VTR systems. The dataset and code are available at https://github.com/WangYipu2002/VisualTrans.

[102] Iterative pseudo-labeling based adaptive copy-paste supervision for semi-supervised tumor segmentation

Qiangguo Jin,Hui Cui,Junbo Wang,Changming Sun,Yimiao He,Ping Xuan,Linlin Wang,Cong Cong,Leyi Wei,Ran Su

Main category: cs.CV

TL;DR: 本文提出了一种新的半监督学习方法IPA-CP，用于CT扫描中的肿瘤分割，解决了现有方法在处理小体积肿瘤和数据增强利用不足的问题。

Details

Motivation: 现有的半监督学习方法主要关注大器官的分割，忽略了肿瘤数量多或体积小的挑战性场景，同时数据增强策略的潜力尚未被充分挖掘。 Method: 引入了一种基于迭代伪标签的自适应复制粘贴监督方法（IPA-CP），结合了基于不确定性的自适应增强机制和迭代伪标签转换策略。 Result: 在多个数据集上的实验表明，IPA-CP在肿瘤分割任务中优于当前最先进的半监督学习方法，消融实验验证了其技术贡献的有效性。 Conclusion: IPA-CP在医学图像分割中超越了现有的半监督学习方法，特别是在处理大量肿瘤或小体积肿瘤的挑战性场景中表现出色。 Abstract: Semi-supervised learning (SSL) has attracted considerable attention in medical image processing. The latest SSL methods use a combination of consistency regularization and pseudo-labeling to achieve remarkable success. However, most existing SSL studies focus on segmenting large organs, neglecting the challenging scenarios where there are numerous tumors or tumors of small volume. Furthermore, the extensive capabilities of data augmentation strategies, particularly in the context of both labeled and unlabeled data, have yet to be thoroughly investigated. To tackle these challenges, we introduce a straightforward yet effective approach, termed iterative pseudo-labeling based adaptive copy-paste supervision (IPA-CP), for tumor segmentation in CT scans. IPA-CP incorporates a two-way uncertainty based adaptive augmentation mechanism, aiming to inject tumor uncertainties present in the mean teacher architecture into adaptive augmentation. Additionally, IPA-CP employs an iterative pseudo-label transition strategy to generate more robust and informative pseudo labels for the unlabeled samples. Extensive experiments on both in-house and public datasets show that our framework outperforms state-of-the-art SSL methods in medical image segmentation. Ablation study results demonstrate the effectiveness of our technical contributions.

[103] Motion is the Choreographer: Learning Latent Pose Dynamics for Seamless Sign Language Generation

Jiayi He,Xu Wang,Shengeng Tang,Yaxiong Wang,Lechao Cheng,Dan Guo

Main category: cs.CV

TL;DR: This paper introduces a new paradigm for sign language video generation by decoupling motion semantics from signer identity, achieving high-quality synthesis and flexibility in signer personalization.

Details

Motivation: Sign language video generation faces challenges due to excessive signer-specific data requirements and poor generalization, which this approach aims to overcome. Method: A two-phase synthesis framework that first constructs a signer-independent multimodal motion lexicon and then uses a discrete-to-continuous motion synthesis stage followed by identity-aware neural rendering. Result: The proposed method enables high-quality synthesis and unprecedented flexibility in signer personalization by disentangling motion from identity. Conclusion: Decoupling motion semantics from signer identity in sign language video generation enables high-quality synthesis and flexibility in signer personalization. Abstract: Sign language video generation requires producing natural signing motions with realistic appearances under precise semantic control, yet faces two critical challenges: excessive signer-specific data requirements and poor generalization. We propose a new paradigm for sign language video generation that decouples motion semantics from signer identity through a two-phase synthesis framework. First, we construct a signer-independent multimodal motion lexicon, where each gloss is stored as identity-agnostic pose, gesture, and 3D mesh sequences, requiring only one recording per sign. This compact representation enables our second key innovation: a discrete-to-continuous motion synthesis stage that transforms retrieved gloss sequences into temporally coherent motion trajectories, followed by identity-aware neural rendering to produce photorealistic videos of arbitrary signers. Unlike prior work constrained by signer-specific datasets, our method treats motion as a first-class citizen: the learned latent pose dynamics serve as a portable "choreography layer" that can be visually realized through different human appearances. Extensive experiments demonstrate that disentangling motion from identity is not just viable but advantageous - enabling both high-quality synthesis and unprecedented flexibility in signer personalization.

[104] DOMR: Establishing Cross-View Segmentation via Dense Object Matching

Jitong Liao,Yulu Gao,Shaofei Huang,Jialin Gao,Jie Lei,Ronghua Liang,Si Liu

Main category: cs.CV

TL;DR: This paper proposes DOMR, a framework for cross-view object correspondence that leverages inter-object relationships to achieve dense matching, achieving state-of-the-art results on the Ego-Exo4D benchmark.

Details

Motivation: Cross-view object correspondence is critical yet challenging for visual understanding, and existing methods struggle with matching individual object masks effectively. Method: The DOMR framework uses a Dense Object Matcher (DOM) module that jointly models multiple objects leveraging positional and semantic relationships, combined with a mask refinement head. Result: DOMR achieved state-of-the-art performance on the Ego-Exo4D benchmark with a mean IoU of 49.7% on Ego→Exo and 55.2% on Exo→Ego, outperforming previous methods by 5.8% and 4.3% respectively. Conclusion: The DOMR framework is effective for cross-view object correspondence, achieving state-of-the-art results on the Ego-Exo4D benchmark. Abstract: Cross-view object correspondence involves matching objects between egocentric (first-person) and exocentric (third-person) views. It is a critical yet challenging task for visual understanding. In this work, we propose the Dense Object Matching and Refinement (DOMR) framework to establish dense object correspondences across views. The framework centers around the Dense Object Matcher (DOM) module, which jointly models multiple objects. Unlike methods that directly match individual object masks to image features, DOM leverages both positional and semantic relationships among objects to find correspondences. DOM integrates a proposal generation module with a dense matching module that jointly encodes visual, spatial, and semantic cues, explicitly constructing inter-object relationships to achieve dense matching among objects. Furthermore, we combine DOM with a mask refinement head designed to improve the completeness and accuracy of the predicted masks, forming the complete DOMR framework. Extensive evaluations on the Ego-Exo4D benchmark demonstrate that our approach achieves state-of-the-art performance with a mean IoU of 49.7% on Ego$\to$Exo and 55.2% on Exo$\to$Ego. These results outperform those of previous methods by 5.8% and 4.3%, respectively, validating the effectiveness of our integrated approach for cross-view understanding.

[105] Towards Globally Predictable k-Space Interpolation: A White-box Transformer Approach

Chen Luo,Qiyu Jin,Taofeng Xie,Xuemei Wang,Huayu Wang,Congcong Liu,Liming Tang,Guoqing Chen,Zhuo-Xu Cui,Dong Liang

Main category: cs.CV

TL;DR: The paper proposes GPI-WT, a white-box Transformer for k-space interpolation, which improves accuracy and interpretability in accelerated MRI.

Details

Motivation: Existing methods overlook global dependencies in k-space, while Transformers, though effective, lack interpretability in interpolated data reliability. Method: GPI-WT, a white-box Transformer framework based on Globally Predictable Interpolation (GPI) for k-space, formulated as a structured low-rank model with learnable global annihilation filters and attention mechanism. Result: Experimental results show that GPI-WT significantly outperforms state-of-the-art approaches in k-space interpolation accuracy. Conclusion: GPI-WT provides superior k-space interpolation accuracy and interpretability compared to existing methods. Abstract: Interpolating missing data in k-space is essential for accelerating imaging. However, existing methods, including convolutional neural network-based deep learning, primarily exploit local predictability while overlooking the inherent global dependencies in k-space. Recently, Transformers have demonstrated remarkable success in natural language processing and image analysis due to their ability to capture long-range dependencies. This inspires the use of Transformers for k-space interpolation to better exploit its global structure. However, their lack of interpretability raises concerns regarding the reliability of interpolated data. To address this limitation, we propose GPI-WT, a white-box Transformer framework based on Globally Predictable Interpolation (GPI) for k-space. Specifically, we formulate GPI from the perspective of annihilation as a novel k-space structured low-rank (SLR) model. The global annihilation filters in the SLR model are treated as learnable parameters, and the subgradients of the SLR model naturally induce a learnable attention mechanism. By unfolding the subgradient-based optimization algorithm of SLR into a cascaded network, we construct the first white-box Transformer specifically designed for accelerated MRI. Experimental results demonstrate that the proposed method significantly outperforms state-of-the-art approaches in k-space interpolation accuracy while providing superior interpretability.

[106] Uni-DocDiff: A Unified Document Restoration Model Based on Diffusion

Fangmin Zhao,Weichao Zeng,Zhenhang Li,Dongbao Yang,Binbin Li,Xiaojun Bi,Yu Zhou

Main category: cs.CV

TL;DR: 本文提出Uni-DocDiff，一种基于扩散模型的统一文档修复方法，通过可学习任务提示和Prior Pool机制实现高效多任务处理与扩展性。

Details

Motivation: 现有的文档修复方法通常针对每个任务单独建模，导致系统复杂且难以扩展。尽管已有研究尝试统一多个任务，但由于手工提示和预处理复杂，扩展性受限。因此需要一个更灵活、可扩展的方法。 Method: Uni-DocDiff基于扩散模型，采用可学习的任务提示设计，并引入Prior Pool和Prior Fusion Module (PFM)机制，以提升多任务处理的性能和适应性。 Result: 实验表明，Uni-DocDiff在多个任务上性能与任务专用模型相当甚至更优，同时具备良好的任务扩展能力，能够无缝适应新任务。 Conclusion: Uni-DocDiff是一个统一且高度可扩展的文档修复模型，通过可学习的任务提示设计和Prior Pool机制，实现了多任务处理能力，同时保持了对新任务的无缝适应性。 Abstract: Removing various degradations from damaged documents greatly benefits digitization, downstream document analysis, and readability. Previous methods often treat each restoration task independently with dedicated models, leading to a cumbersome and highly complex document processing system. Although recent studies attempt to unify multiple tasks, they often suffer from limited scalability due to handcrafted prompts and heavy preprocessing, and fail to fully exploit inter-task synergy within a shared architecture. To address the aforementioned challenges, we propose Uni-DocDiff, a Unified and highly scalable Document restoration model based on Diffusion. Uni-DocDiff develops a learnable task prompt design, ensuring exceptional scalability across diverse tasks. To further enhance its multi-task capabilities and address potential task interference, we devise a novel \textbf{Prior \textbf{P}ool}, a simple yet comprehensive mechanism that combines both local high-frequency features and global low-frequency features. Additionally, we design the \textbf{Prior \textbf{F}usion \textbf{M}odule (PFM)}, which enables the model to adaptively select the most relevant prior information for each specific task. Extensive experiments show that the versatile Uni-DocDiff achieves performance comparable or even superior performance compared with task-specific expert models, and simultaneously holds the task scalability for seamless adaptation to new tasks.

[107] TCSAFormer: Efficient Vision Transformer with Token Compression and Sparse Attention for Medical Image Segmentation

Zunhui Xia,Hongxing Li,Libin Lan

Main category: cs.CV

TL;DR: TCSAFormer improves medical image segmentation efficiency and accuracy by using a novel attention mechanism and dual-branch network.

Details

Motivation: Transformer-based methods face limitations in computational complexity and the inability to capture local contextual information, which this study aims to resolve. Method: The TCSAFormer network incorporates a Compressed Attention module and a Dual-Branch Feed-Forward Network module to improve efficiency and capture local and multiscale features. Result: Experiments on ISIC-2018, CVC-ClinicDB, and Synapse datasets show that TCSAFormer outperforms state-of-the-art methods with lower computational overhead. Conclusion: TCSAFormer achieves superior performance in medical image segmentation by addressing computational complexity and enhancing feature representation. Abstract: In recent years, transformer-based methods have achieved remarkable progress in medical image segmentation due to their superior ability to capture long-range dependencies. However, these methods typically suffer from two major limitations. First, their computational complexity scales quadratically with the input sequences. Second, the feed-forward network (FFN) modules in vanilla Transformers typically rely on fully connected layers, which limits models' ability to capture local contextual information and multiscale features critical for precise semantic segmentation. To address these issues, we propose an efficient medical image segmentation network, named TCSAFormer. The proposed TCSAFormer adopts two key ideas. First, it incorporates a Compressed Attention (CA) module, which combines token compression and pixel-level sparse attention to dynamically focus on the most relevant key-value pairs for each query. This is achieved by pruning globally irrelevant tokens and merging redundant ones, significantly reducing computational complexity while enhancing the model's ability to capture relationships between tokens. Second, it introduces a Dual-Branch Feed-Forward Network (DBFFN) module as a replacement for the standard FFN to capture local contextual features and multiscale information, thereby strengthening the model's feature representation capability. We conduct extensive experiments on three publicly available medical image segmentation datasets: ISIC-2018, CVC-ClinicDB, and Synapse, to evaluate the segmentation performance of TCSAFormer. Experimental results demonstrate that TCSAFormer achieves superior performance compared to existing state-of-the-art (SOTA) methods, while maintaining lower computational overhead, thus achieving an optimal trade-off between efficiency and accuracy.

[108] Beyond the Visible: Benchmarking Occlusion Perception in Multimodal Large Language Models

Zhaochen Liu,Kaiwen Gao,Shuyi Liang,Bin Xiao,Limeng Qiao,Lin Ma,Tingting Jiang

Main category: cs.CV

TL;DR: 本文提出了O-Bench，这是一种专门用于遮挡感知的视觉问答基准测试，揭示了当前多模态大语言模型与人类表现之间的显著差距。

Details

Motivation: 尽管多模态大语言模型(MLLM)表现出了卓越的能力，但其在遮挡感知方面的表现仍未得到充分探索，因此需要O-Bench这样的基准测试。 Method: 基于SA-1B构建了1365张具有语义一致性遮挡场景的图像，并通过可靠半自动工作流程标注了4588对问答对。 Result: 对22个代表性MLLM的评估显示，当前模型与人类基线之间存在显著性能差距。 Conclusion: O-Bench为遮挡感知提供了一个重要的评估工具，并启发了MLLM在视觉智能方面的发展。 Abstract: Occlusion perception, a critical foundation for human-level spatial understanding, embodies the challenge of integrating visual recognition and reasoning. Though multimodal large language models (MLLMs) have demonstrated remarkable capabilities, their performance on occlusion perception remains under-explored. To address this gap, we introduce O-Bench, the first visual question answering (VQA) benchmark specifically designed for occlusion perception. Based on SA-1B, we construct 1,365 images featuring semantically coherent occlusion scenarios through a novel layered synthesis approach. Upon this foundation, we annotate 4,588 question-answer pairs in total across five tailored tasks, employing a reliable, semi-automatic workflow. Our extensive evaluation of 22 representative MLLMs against the human baseline reveals a significant performance gap between current MLLMs and humans, which, we find, cannot be sufficiently bridged by model scaling or thinking process. We further identify three typical failure patterns, including an overly conservative bias, a fragile gestalt prediction, and a struggle with quantitative tasks. We believe O-Bench can not only provide a vital evaluation tool for occlusion perception, but also inspire the development of MLLMs for better visual intelligence. Our benchmark will be made publicly available upon paper publication.

[109] TNet: Terrace Convolutional Decoder Network for Remote Sensing Image Semantic Segmentation

Chengqian Dai,Yonghong Guo,Hongzhao Xiang,Yigui Luo

Main category: cs.CV

TL;DR: 本文提出了一种名为TNet的遥感图像分割架构，通过卷积和加法操作逐步融合不同分辨率的特征，实现了高效的全局和局部信息整合。

Details

Motivation: 现有的分割网络主要关注于单尺度内的特征交互，忽视了跨多尺度的全局上下文依赖。 Method: 提出了TNet，通过卷积和加法操作逐步融合低分辨率和高分辨率特征。 Result: 在ISPRS Vaihingen上mIoU为85.35%，在ISPRS Potsdam上为87.05%，在LoveDA上为52.19%。 Conclusion: TNet-R实现了高效的遥感图像分割，在多个数据集上表现出色，同时保持了计算效率。 Abstract: In remote sensing, most segmentation networks adopt the UNet architecture, often incorporating modules such as Transformers or Mamba to enhance global-local feature interactions within decoder stages. However, these enhancements typically focus on intra-scale relationships and neglect the global contextual dependencies across multiple resolutions. To address this limitation, we introduce the Terrace Convolutional Decoder Network (TNet), a simple yet effective architecture that leverages only convolution and addition operations to progressively integrate low-resolution features (rich in global context) into higher-resolution features (rich in local details) across decoding stages. This progressive fusion enables the model to learn spatially-aware convolutional kernels that naturally blend global and local information in a stage-wise manner. We implement TNet with a ResNet-18 encoder (TNet-R) and evaluate it on three benchmark datasets. TNet-R achieves competitive performance with a mean Intersection-over-Union (mIoU) of 85.35\% on ISPRS Vaihingen, 87.05\% on ISPRS Potsdam, and 52.19\% on LoveDA, while maintaining high computational efficiency. Code is publicly available.

[110] Bridging Diffusion Models and 3D Representations: A 3D Consistent Super-Resolution Framework

Yi-Ting Chen,Ting-Hsuan Liao,Pengsheng Guo,Alexander Schwing,Jia-Bin Huang

Main category: cs.CV

TL;DR: 3DSR 是一个基于3D高斯散射和2D超分辨率模型的框架，旨在通过显式维持视图间的3D一致性来提升超分辨率重建的视觉质量和空间连贯性。

Details

Motivation: 先前的方法，如图像放大或视频超分辨率，要么不考虑3D一致性，要么隐式地引入3D一致性。3DSR旨在通过显式的3D一致性处理来提升重建场景的视觉质量和空间连贯性。 Method: 3DSR 利用现成的基于扩散的2D超分辨率模型，并通过显式的3D高斯散射场景表示来实现视图间的3D一致性。 Result: 3DSR 在 MipNeRF360 和 LLFF 数据集上进行了评估，结果表明其在保持3D重建结构一致性的同时，能够生成视觉上令人满意且高分辨率的结果。 Conclusion: 3DSR 是一种新的基于3D高斯散射的超分辨率框架，它利用现成的基于扩散的2D超分辨率模型，并通过显式的3D高斯散射场景表示来增强视图间的3D一致性，从而在不进行额外微调的情况下提升视觉质量。 Abstract: We propose 3D Super Resolution (3DSR), a novel 3D Gaussian-splatting-based super-resolution framework that leverages off-the-shelf diffusion-based 2D super-resolution models. 3DSR encourages 3D consistency across views via the use of an explicit 3D Gaussian-splatting-based scene representation. This makes the proposed 3DSR different from prior work, such as image upsampling or the use of video super-resolution, which either don't consider 3D consistency or aim to incorporate 3D consistency implicitly. Notably, our method enhances visual quality without additional fine-tuning, ensuring spatial coherence within the reconstructed scene. We evaluate 3DSR on MipNeRF360 and LLFF data, demonstrating that it produces high-resolution results that are visually compelling, while maintaining structural consistency in 3D reconstructions. Code will be released.

[111] DET-GS: Depth- and Edge-Aware Regularization for High-Fidelity 3D Gaussian Splatting

Zexu Huang,Min Xu,Stuart Perry

Main category: cs.CV

TL;DR: The paper introduces DET-GS, a depth and edge-aware regularization framework for 3D Gaussian Splatting, which significantly improves geometric reconstruction quality and robustness in sparse-view novel view synthesis scenarios.

Details

Motivation: Achieving accurate geometric reconstruction under sparse-view conditions remains a challenge, as existing methods struggle with capturing fine-grained structures, handling depth estimation noise, and preserving important edges and textures. Method: The paper proposes DET-GS, which includes hierarchical geometric depth supervision, edge-aware depth regularization using semantic masks from Canny edge detection, and an RGB-guided edge-preserving Total Variation loss. Result: Extensive experiments show that DET-GS achieves better performance in both geometric accuracy and visual fidelity compared to state-of-the-art methods on sparse-view novel view synthesis benchmarks. Conclusion: DET-GS provides significant improvements in geometric accuracy and visual fidelity for sparse-view novel view synthesis, outperforming current state-of-the-art methods. Abstract: 3D Gaussian Splatting (3DGS) represents a significant advancement in the field of efficient and high-fidelity novel view synthesis. Despite recent progress, achieving accurate geometric reconstruction under sparse-view conditions remains a fundamental challenge. Existing methods often rely on non-local depth regularization, which fails to capture fine-grained structures and is highly sensitive to depth estimation noise. Furthermore, traditional smoothing methods neglect semantic boundaries and indiscriminately degrade essential edges and textures, consequently limiting the overall quality of reconstruction. In this work, we propose DET-GS, a unified depth and edge-aware regularization framework for 3D Gaussian Splatting. DET-GS introduces a hierarchical geometric depth supervision framework that adaptively enforces multi-level geometric consistency, significantly enhancing structural fidelity and robustness against depth estimation noise. To preserve scene boundaries, we design an edge-aware depth regularization guided by semantic masks derived from Canny edge detection. Furthermore, we introduce an RGB-guided edge-preserving Total Variation loss that selectively smooths homogeneous regions while rigorously retaining high-frequency details and textures. Extensive experiments demonstrate that DET-GS achieves substantial improvements in both geometric accuracy and visual fidelity, outperforming state-of-the-art (SOTA) methods on sparse-view novel view synthesis benchmarks.

[112] NEARL-CLIP: Interacted Query Adaptation with Orthogonal Regularization for Medical Vision-Language Understanding

Zelin Peng,Yichen Zhao,Yu Huang,Piao Yang,Feilong Tang,Zhengqin Xu,Xiaokang Yang,Wei Shen

Main category: cs.CV

TL;DR: This paper introduces NEARL-CLIP, a cross-modality interaction framework that enhances the application of vision-language models (VLMs) in medical imaging by addressing domain gaps and modality misalignment, achieving performance improvements with only 1.46M additional parameters.

Details

Motivation: The motivation is to overcome the limitations of existing vision-language models (VLMs) when applied to medical imaging due to domain gaps and modality misalignment, aiming to fully unlock the potential of VLMs in the medical domain. Method: The paper proposes NEARL-CLIP, which includes two key components: Unified Synergy Embedding Transformer (USEformer) for dynamic cross-modality interaction and Orthogonal Cross-Attention Adapter (OCA) to decouple new knowledge into distinct components, enhancing modality interaction while introducing only 1.46M learnable parameters. Result: NEARL-CLIP is introduced as a framework that improves modality interaction through dynamic cross-modality queries and orthogonal decoupling, enhancing the application of VLMs to medical imaging with minimal added parameters. Conclusion: The paper concludes that NEARL-CLIP, a novel cross-modality interaction VLM-based framework, successfully bridges the domain gap in applying vision-language models to medical imaging analysis, achieving performance improvements in a parameter-efficient manner. Abstract: Computer-aided medical image analysis is crucial for disease diagnosis and treatment planning, yet limited annotated datasets restrict medical-specific model development. While vision-language models (VLMs) like CLIP offer strong generalization capabilities, their direct application to medical imaging analysis is impeded by a significant domain gap. Existing approaches to bridge this gap, including prompt learning and one-way modality interaction techniques, typically focus on introducing domain knowledge to a single modality. Although this may offer performance gains, it often causes modality misalignment, thereby failing to unlock the full potential of VLMs. In this paper, we propose \textbf{NEARL-CLIP} (i\underline{N}teracted qu\underline{E}ry \underline{A}daptation with o\underline{R}thogona\underline{L} Regularization), a novel cross-modality interaction VLM-based framework that contains two contributions: (1) Unified Synergy Embedding Transformer (USEformer), which dynamically generates cross-modality queries to promote interaction between modalities, thus fostering the mutual enrichment and enhancement of multi-modal medical domain knowledge; (2) Orthogonal Cross-Attention Adapter (OCA). OCA introduces an orthogonality technique to decouple the new knowledge from USEformer into two distinct components: the truly novel information and the incremental knowledge. By isolating the learning process from the interference of incremental knowledge, OCA enables a more focused acquisition of new information, thereby further facilitating modality interaction and unleashing the capability of VLMs. Notably, NEARL-CLIP achieves these two contributions in a parameter-efficient style, which only introduces \textbf{1.46M} learnable parameters.

[113] AR as an Evaluation Playground: Bridging Metrics and Visual Perception of Computer Vision Models

Ashkan Ganj,Yiqin Zhao,Tian Guo

Main category: cs.CV

TL;DR: This paper introduces ARCADE, an augmented reality platform that simplifies and enhances human perception studies for evaluating computer vision models, making it easier to understand model performance through interactive, context-rich user studies.

Details

Motivation: Human perception studies provide important insights into computer vision model performance, but are difficult to conduct due to complex system setups and scalability issues. Augmented reality offers a novel solution for simplifying and enhancing these studies. Method: Design and implementation of ARCADE, an augmented reality-based platform for conducting human perception studies in computer vision, supporting cross-platform data collection, pluggable model inference, and AR streaming for user studies. Result: The ARCADE platform was successfully demonstrated with depth and lighting estimation models, showing its effectiveness in eliciting human perceptual judgments and supporting various deployment and study settings. Conclusion: ARCADE is a flexible and effective platform for human-centered evaluation of computer vision models, enabling easier and more interactive perceptual studies through AR. Abstract: Human perception studies can provide complementary insights to qualitative evaluation for understanding computer vision (CV) model performance. However, conducting human perception studies remains a non-trivial task, it often requires complex, end-to-end system setups that are time-consuming and difficult to scale. In this paper, we explore the unique opportunity presented by augmented reality (AR) for helping CV researchers to conduct perceptual studies. We design ARCADE, an evaluation platform that allows researchers to easily leverage AR's rich context and interactivity for human-centered CV evaluation. Specifically, ARCADE supports cross-platform AR data collection, custom experiment protocols via pluggable model inference, and AR streaming for user studies. We demonstrate ARCADE using two types of CV models, depth and lighting estimation and show that AR tasks can be effectively used to elicit human perceptual judgments of model quality. We also evaluate the systems usability and performance across different deployment and study settings, highlighting its flexibility and effectiveness as a human-centered evaluation platform.

[114] Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decode

Jingchao Wang,Zhijian Wu,Dingjiang Huang,Yefeng Zheng,Hong Wang

Main category: cs.CV

TL;DR: MLLMSeg is a novel framework for Reference Expression Segmentation that effectively balances performance and cost by integrating visual and semantic features without relying on an additional visual encoder or heavy models like SAM.

Details

Motivation: To address the trade-off between performance and cost in Reference Expression Segmentation (RES), particularly to overcome the limitations of existing methods that either rely on the parameter-heavy SAM or sacrifice accuracy with SAM-free pipelines. Method: MLLMSeg utilizes the visual detail features from the MLLM vision encoder and semantic features from the LLM. It incorporates a detail-enhanced and semantic-consistent feature fusion module (DSFF) and a lightweight mask decoder with 34M parameters. Result: Extensive experiments show that MLLMSeg surpasses both SAM-based and SAM-free methods, achieving precise mask prediction while maintaining cost efficiency. Conclusion: The proposed MLLMSeg framework achieves a better balance between performance and cost compared to existing methods, effectively integrating visual and semantic features for precise mask prediction. Abstract: Reference Expression Segmentation (RES) aims to segment image regions specified by referring expressions and has become popular with the rise of multimodal large models (MLLMs). While MLLMs excel in semantic understanding, their token-generation paradigm struggles with pixel-level dense prediction. Existing RES methods either couple MLLMs with the parameter-heavy Segment Anything Model (SAM) with 632M network parameters or adopt SAM-free lightweight pipelines that sacrifice accuracy. To address the trade-off between performance and cost, we specifically propose MLLMSeg, a novel framework that fully exploits the inherent visual detail features encoded in the MLLM vision encoder without introducing an extra visual encoder. Besides, we propose a detail-enhanced and semantic-consistent feature fusion module (DSFF) that fully integrates the detail-related visual feature with the semantic-related feature output by the large language model (LLM) of MLLM. Finally, we establish a light-weight mask decoder with only 34M network parameters that optimally leverages detailed spatial features from the visual encoder and semantic features from the LLM to achieve precise mask prediction. Extensive experiments demonstrate that our method generally surpasses both SAM-based and SAM-free competitors, striking a better balance between performance and cost. Code is available at https://github.com/jcwang0602/MLLMSeg.

[115] CLIPVehicle: A Unified Framework for Vision-based Vehicle Search

Likai Wang,Ruize Han,Xiangqun Zhang,Wei Feng

Main category: cs.CV

TL;DR: 本文提出了一种名为CLIPVehicle的统一框架，用于实现车辆搜索的联合检测和重识别，通过引入双粒度语义区域对齐模块和多级车辆身份识别学习策略，解决了检测和重识别任务之间的冲突目标。

Details

Motivation: 现有的车辆搜索方法需要先预检测和存储所有车辆图像，然后应用车辆重识别模型，这种方法资源消耗大且不够实用。因此，本文旨在实现车辆搜索的联合检测和重识别。 Method: 提出了一种新的统一框架CLIPVehicle，包含双粒度语义区域对齐模块来利用视觉-语言模型(VLMs)进行车辆区分建模，以及多级车辆身份识别学习策略来从全局、实例和特征层学习身份表示。 Result: 实验结果表明，所提出的方法在车辆重识别和人员搜索任务上均优于现有最先进方法，并构建了一个包含真实世界数据集CityFlowVS和两个合成数据集SynVS-Day和SynVS-All的新基准。 Conclusion: 本文通过提出的CLIPVehicle框架，有效解决了车辆检测与重识别任务间的冲突，实现了高效的车辆搜索，并在多个数据集上验证了其优越性能。 Abstract: Vehicles, as one of the most common and significant objects in the real world, the researches on which using computer vision technologies have made remarkable progress, such as vehicle detection, vehicle re-identification, etc. To search an interested vehicle from the surveillance videos, existing methods first pre-detect and store all vehicle patches, and then apply vehicle re-identification models, which is resource-intensive and not very practical. In this work, we aim to achieve the joint detection and re-identification for vehicle search. However, the conflicting objectives between detection that focuses on shared vehicle commonness and re-identification that focuses on individual vehicle uniqueness make it challenging for a model to learn in an end-to-end system. For this problem, we propose a new unified framework, namely CLIPVehicle, which contains a dual-granularity semantic-region alignment module to leverage the VLMs (Vision-Language Models) for vehicle discrimination modeling, and a multi-level vehicle identification learning strategy to learn the identity representation from global, instance and feature levels. We also construct a new benchmark, including a real-world dataset CityFlowVS, and two synthetic datasets SynVS-Day and SynVS-All, for vehicle search. Extensive experimental results demonstrate that our method outperforms the state-of-the-art methods of both vehicle Re-ID and person search tasks.

[116] Conditional Latent Diffusion Models for Zero-Shot Instance Segmentation

Maximilian Ulmer,Wout Boerdijk,Rudolph Triebel,Maximilian Durner

Main category: cs.CV

TL;DR: 本研究提出了OC-DiT，一种新型扩散模型，通过条件潜在扩散框架实现零样本实例分割，在多个基准测试中表现SOTA。

Details

Motivation: 为了实现以对象为中心的预测，设计了一种新的扩散模型（OC-DiT），用于解决零样本实例分割问题。 Method: 提出了一种条件潜在扩散框架，通过在扩散模型的潜在空间中以对象模板和图像特征为条件，生成实例掩码，并引入了用于生成初始对象实例建议的粗模型和并行优化建议的优化模型。 Result: OC-DiT在多个具有挑战性的现实世界基准测试中达到了最先进的性能，并通过全面的消融研究验证了模型的有效性。 Conclusion: OC-DiT展现了在实例分割任务中应用扩散模型的潜力，无需目标数据再训练就在多个现实世界基准测试中达到了SOTA性能。 Abstract: This paper presents OC-DiT, a novel class of diffusion models designed for object-centric prediction, and applies it to zero-shot instance segmentation. We propose a conditional latent diffusion framework that generates instance masks by conditioning the generative process on object templates and image features within the diffusion model's latent space. This allows our model to effectively disentangle object instances through the diffusion process, which is guided by visual object descriptors and localized image cues. Specifically, we introduce two model variants: a coarse model for generating initial object instance proposals, and a refinement model that refines all proposals in parallel. We train these models on a newly created, large-scale synthetic dataset comprising thousands of high-quality object meshes. Remarkably, our model achieves state-of-the-art performance on multiple challenging real-world benchmarks, without requiring any retraining on target data. Through comprehensive ablation studies, we demonstrate the potential of diffusion models for instance segmentation tasks.

Zheng Cheng,Wenri Wang,Guangyong Chen,Yakun Ju,Yihua Cheng,Zhisong Liu,Yanda Meng,Jintao Song

Main category: cs.CV

TL;DR: 本文提出了一种基于单尺度特征提取的水下图像增强方法SSD-Net，通过非对称分解机制结合CNN和Transformer的优势，在保证图像增强质量的同时降低了计算复杂度。

Details

Motivation: 为探索单尺度特征在水下增强中的潜力，同时解决光吸收和散射导致的水下图像退化问题，如颜色失真、模糊和低对比度。 Method: 提出了一种创新的单尺度分解网络（SSD-Net），通过非对称分解机制将输入图像分解为清晰层和退化层，并结合CNN和Transformer的优势，包括并行特征分解模块（PFDB）和双向特征交互模块（BFCB）。 Result: 实验表明，单尺度特征提取可以匹配甚至超越多尺度方法的性能，同时显著降低计算复杂度。 Conclusion: 高质量的图像重建不一定依赖于多尺度特征融合，单尺度特征提取在水下图像增强中可以达到甚至超过多尺度方法的效果，并显著降低复杂度。 Abstract: Underwater image enhancement (UIE) techniques aim to improve visual quality of images captured in aquatic environments by addressing degradation issues caused by light absorption and scattering effects, including color distortion, blurring, and low contrast. Current mainstream solutions predominantly employ multi-scale feature extraction (MSFE) mechanisms to enhance reconstruction quality through multi-resolution feature fusion. However, our extensive experiments demonstrate that high-quality image reconstruction does not necessarily rely on multi-scale feature fusion. Contrary to popular belief, our experiments show that single-scale feature extraction alone can match or surpass the performance of multi-scale methods, significantly reducing complexity. To comprehensively explore single-scale feature potential in underwater enhancement, we propose an innovative Single-Scale Decomposition Network (SSD-Net). This architecture introduces an asymmetrical decomposition mechanism that disentangles input image into clean layer along with degradation layer. The former contains scene-intrinsic information and the latter encodes medium-induced interference. It uniquely combines CNN's local feature extraction capabilities with Transformer's global modeling strengths through two core modules: 1) Parallel Feature Decomposition Block (PFDB), implementing dual-branch feature space decoupling via efficient attention operations and adaptive sparse transformer; 2) Bidirectional Feature Communication Block (BFCB), enabling cross-layer residual interactions for complementary feature mining and fusion. This synergistic design preserves feature decomposition independence while establishing dynamic cross-layer information pathways, effectively enhancing degradation decoupling capacity.

[118] Learning Using Privileged Information for Litter Detection

Matthias Bartolo,Konstantinos Makantasis,Dylan Seychell

Main category: cs.CV

TL;DR: 这篇论文提出了一种结合特权信息和边界框二值掩码编码的新方法，用于提升垃圾检测的准确性和效率，且未增加模型复杂度。

Details

Motivation: 随着全球垃圾污染的持续增加，开发高效的自动化垃圾检测工具仍然是一个重大挑战。现有垃圾检测方法在小垃圾检测、部分遮挡物体检测以及模型效率方面存在局限性。 Method: 该论文首次将特权信息与深度学习目标检测相结合，并设计了一种将边界框信息编码为二值掩码的方法，以优化检测模型的指导。此外，该方法在五个广泛使用的物体检测模型上进行了评估。 Result: 实验结果表明，该方法在SODA、BDW和UAVVaste垃圾检测数据集上均实现了性能提升，不仅提高了训练集内的检测精度，还能够很好地泛化到其他垃圾检测场景中。此外，这些改进是在不增加模型复杂度或额外层的情况下实现的，确保了计算效率和可扩展性。 Conclusion: 该论文提出了一种结合特权信息与深度学习目标检测的新方法，用于提升垃圾检测的效果，同时保持模型的计算效率。这种方法在多个垃圾检测数据集上均表现出一致的性能提升，并且未增加模型复杂性，具有实际应用价值。 Abstract: As litter pollution continues to rise globally, developing automated tools capable of detecting litter effectively remains a significant challenge. This study presents a novel approach that combines, for the first time, privileged information with deep learning object detection to improve litter detection while maintaining model efficiency. We evaluate our method across five widely used object detection models, addressing challenges such as detecting small litter and objects partially obscured by grass or stones. In addition to this, a key contribution of our work can also be attributed to formulating a means of encoding bounding box information as a binary mask, which can be fed to the detection model to refine detection guidance. Through experiments on both within-dataset evaluation on the renowned SODA dataset and cross-dataset evaluation on the BDW and UAVVaste litter detection datasets, we demonstrate consistent performance improvements across all models. Our approach not only bolsters detection accuracy within the training sets but also generalises well to other litter detection contexts. Crucially, these improvements are achieved without increasing model complexity or adding extra layers, ensuring computational efficiency and scalability. Our results suggest that this methodology offers a practical solution for litter detection, balancing accuracy and efficiency in real-world applications.

[119] SVC 2025: the First Multimodal Deception Detection Challenge

Xun Lin,Xiaobao Guo,Taorui Wang,Yingjie Ma,Jiajian Huang,Jiayu Zhang,Junzhe Cao,Zitong Yu

Main category: cs.CV

TL;DR: The SVC 2025 challenge addresses cross-domain deception detection using multimodal data to improve real-world performance.

Details

Motivation: Current methods struggle with domain shifts, limiting their real-world applicability. Method: Introducing the SVC 2025 Multimodal Deception Detection Challenge as a benchmark for evaluating cross-domain generalization. Result: A new benchmark encourages developing models that generalize across multiple heterogeneous datasets using multimodal cues. Conclusion: The challenge attracted 21 teams, indicating strong interest and potential advancements in cross-domain deception detection. Abstract: Deception detection is a critical task in real-world applications such as security screening, fraud prevention, and credibility assessment. While deep learning methods have shown promise in surpassing human-level performance, their effectiveness often depends on the availability of high-quality and diverse deception samples. Existing research predominantly focuses on single-domain scenarios, overlooking the significant performance degradation caused by domain shifts. To address this gap, we present the SVC 2025 Multimodal Deception Detection Challenge, a new benchmark designed to evaluate cross-domain generalization in audio-visual deception detection. Participants are required to develop models that not only perform well within individual domains but also generalize across multiple heterogeneous datasets. By leveraging multimodal data, including audio, video, and text, this challenge encourages the design of models capable of capturing subtle and implicit deceptive cues. Through this benchmark, we aim to foster the development of more adaptable, explainable, and practically deployable deception detection systems, advancing the broader field of multimodal learning. By the conclusion of the workshop competition, a total of 21 teams had submitted their final results. https://sites.google.com/view/svc-mm25 for more information.

[120] DS$^2$Net: Detail-Semantic Deep Supervision Network for Medical Image Segmentation

Zhaohong Huang,Yuxin Zhang,Mingbao Lin,Taojian Zhou,Guorong Cai,Rongrong Ji

Main category: cs.CV

TL;DR: 本文提出了一种新的医学图像分割方法DS²Net，通过多视角深度监督显著提升分割效果。

Details

Motivation: 现有的医学图像分割方法仅单独监督粗粒度语义特征或细粒度细节特征，忽略了两者之间的关键关系。 Method: 提出了细节-语义深度监督网络（DS²Net），结合细节增强模块和语义增强模块，并采用基于不确定性的监督损失函数。 Result: 在六个医学图像分割基准测试中，DS²Net始终优于现有的最先进方法。 Conclusion: DS²Net通过细节增强模块和语义增强模块实现了多视角深度监督，显著提升了医学图像分割的效果。 Abstract: Deep Supervision Networks exhibit significant efficacy for the medical imaging community. Nevertheless, existing work merely supervises either the coarse-grained semantic features or fine-grained detailed features in isolation, which compromises the fact that these two types of features hold vital relationships in medical image analysis. We advocate the powers of complementary feature supervision for medical image segmentation, by proposing a Detail-Semantic Deep Supervision Network (DS$^2$Net). DS$^2$Net navigates both low-level detailed and high-level semantic feature supervision through Detail Enhance Module (DEM) and Semantic Enhance Module (SEM). DEM and SEM respectively harness low-level and high-level feature maps to create detail and semantic masks for enhancing feature supervision. This is a novel shift from single-view deep supervision to multi-view deep supervision. DS$^2$Net is also equipped with a novel uncertainty-based supervision loss that adaptively assigns the supervision strength of features within distinct scales based on their uncertainty, thus circumventing the sub-optimal heuristic design that typifies previous works. Through extensive experiments on six benchmarks captured under either colonoscopy, ultrasound and microscope, we demonstrate that DS$^2$Net consistently outperforms state-of-the-art methods for medical image analysis.

[121] UniFGVC: Universal Training-Free Few-Shot Fine-Grained Vision Classification via Attribute-Aware Multimodal Retrieval

Hongyu Guo,Kuan Zhu,Xiangzhao Hao,Haiyun Guo,Ming Tang,Jinqiao Wang

Main category: cs.CV

TL;DR: UniFGVC is a training-free framework for few-shot FGVC that reformulates the task as multimodal retrieval, leveraging MLLMs to generate descriptive captions and achieving superior performance across multiple benchmarks.

Details

Motivation: To overcome overfitting and weak generalization of existing methods in few-shot FGVC, the work introduces a training-free framework leveraging multimodal retrieval and large language models. Method: UniFGVC reformulates few-shot FGVC as multimodal retrieval, using a Category-Discriminative Visual Captioner (CDV-Captioner) to generate structured text descriptions that capture fine-grained features, followed by vision and text encoders to embed and retrieve the nearest template in joint space. Result: UniFGVC consistently outperforms previous few-shot CLIP-based methods and even some fully supervised MLLM-based approaches across 12 FGVC benchmarks. Conclusion: UniFGVC offers a training-free solution for few-shot FGVC scenarios, ensuring broad compatibility and reliable generalization across diverse MLLMs and encoders. Abstract: Few-shot fine-grained visual classification (FGVC) aims to leverage limited data to enable models to discriminate subtly distinct categories. Recent works mostly finetuned the pre-trained visual language models to achieve performance gain, yet suffering from overfitting and weak generalization. To deal with this, we introduce UniFGVC, a universal training-free framework that reformulates few-shot FGVC as multimodal retrieval. First, we propose the Category-Discriminative Visual Captioner (CDV-Captioner) to exploit the open-world knowledge of multimodal large language models (MLLMs) to generate a structured text description that captures the fine-grained attribute features distinguishing closely related classes. CDV-Captioner uses chain-of-thought prompting and visually similar reference images to reduce hallucination and enhance discrimination of generated captions. Using it we can convert each image into an image-description pair, enabling more comprehensive feature representation, and construct the multimodal category templates using few-shot samples for the subsequent retrieval pipeline. Then, off-the-shelf vision and text encoders embed query and template pairs, and FGVC is accomplished by retrieving the nearest template in the joint space. UniFGVC ensures broad compatibility with diverse MLLMs and encoders, offering reliable generalization and adaptability across few-shot FGVC scenarios. Extensive experiments on 12 FGVC benchmarks demonstrate its consistent superiority over prior few-shot CLIP-based methods and even several fully-supervised MLLMs-based approaches.

[122] IDCNet: Guided Video Diffusion for Metric-Consistent RGBD Scene Generation with Precise Camera Control

Lijuan Liu,Wenfa Li,Dongbo Zhang,Shuo Wang,Shaohui Jiao

Main category: cs.CV

TL;DR: IDC-Net is a new framework that generates RGB-D video sequences with precise camera control and high geometric fidelity, outperforming existing methods and enabling direct use in 3D scene reconstruction tasks.

Details

Motivation: The motivation is to overcome the limitations of previous approaches that treat RGB and depth generation separately, leading to poor geometric alignment and camera control. The authors aim to improve the geometric consistency and visual quality of generated RGB-D video sequences. Method: IDC-Net uses a unified geometry-aware diffusion model with a geometry-aware transformer block to jointly synthesize RGB images and corresponding depth maps. It is trained on a camera-image-depth consistent dataset with metric-aligned RGB videos, depth maps, and accurate camera poses. Result: Extensive experiments show that IDC-Net outperforms state-of-the-art approaches in terms of both visual quality and geometric consistency of generated scene sequences. The generated RGB-D sequences can be directly used for downstream 3D scene reconstruction tasks. Conclusion: IDC-Net is able to generate RGB-D video sequences with high geometric fidelity and enables precise camera control without requiring extra post-processing steps. Abstract: We present IDC-Net (Image-Depth Consistency Network), a novel framework designed to generate RGB-D video sequences under explicit camera trajectory control. Unlike approaches that treat RGB and depth generation separately, IDC-Net jointly synthesizes both RGB images and corresponding depth maps within a unified geometry-aware diffusion model. The joint learning framework strengthens spatial and geometric alignment across frames, enabling more precise camera control in the generated sequences. To support the training of this camera-conditioned model and ensure high geometric fidelity, we construct a camera-image-depth consistent dataset with metric-aligned RGB videos, depth maps, and accurate camera poses, which provides precise geometric supervision with notably improved inter-frame geometric consistency. Moreover, we introduce a geometry-aware transformer block that enables fine-grained camera control, enhancing control over the generated sequences. Extensive experiments show that IDC-Net achieves improvements over state-of-the-art approaches in both visual quality and geometric consistency of generated scene sequences. Notably, the generated RGB-D sequences can be directly feed for downstream 3D Scene reconstruction tasks without extra post-processing steps, showcasing the practical benefits of our joint learning framework. See more at https://idcnet-scene.github.io.

[123] ICM-Fusion: In-Context Meta-Optimized LoRA Fusion for Multi-Task Adaptation

Yihua Shao,Xiaofeng Lin,Xinwei Long,Siyu Chen,Minxi Yan,Yang Liu,Ziyang Yan,Ao Ma,Hao Tang,Jingcai Guo

Main category: cs.CV

TL;DR: This paper proposes ICM-Fusion, a novel framework combining meta-learning and in-context adaptation to improve multi-task performance in pre-trained LoRA models, especially in few-shot scenarios.

Details

Motivation: Existing pre-trained LoRA fusion methods face inter-weight conflicts and catastrophic domain forgetting, especially in long-tailed weight distributions or few-shot scenarios. Incremental learning also struggles with generalization in such cases. This necessitates a more effective approach for multi-task adaptation. Method: The proposed ICM-Fusion framework combines meta-learning with in-context adaptation. It uses task vector arithmetic to dynamically balance conflicting optimization directions across domains through learned manifold projections. The optimal task vector orientation is determined in latent space, and a self-designed Fusion VAE (F-VAE) reconstructs the fused LoRA for multi-task generation. Result: Extensive experiments on visual and linguistic tasks demonstrate that ICM-Fusion outperforms current LoRA fusion techniques by significantly reducing multi-tasking loss and enabling task enhancement in few-shot scenarios. Conclusion: ICM-Fusion can be applied to a wide range of architectural models and various tasks, significantly reducing multi-tasking loss and achieving task enhancement in few-shot scenarios compared to existing LoRA fusion methods. Abstract: Enabling multi-task adaptation in pre-trained Low-Rank Adaptation (LoRA) models is crucial for enhancing their generalization capabilities. Most existing pre-trained LoRA fusion methods decompose weight matrices, sharing similar parameters while merging divergent ones. However, this paradigm inevitably induces inter-weight conflicts and leads to catastrophic domain forgetting. While incremental learning enables adaptation to multiple tasks, it struggles to achieve generalization in few-shot scenarios. Consequently, when the weight data follows a long-tailed distribution, it can lead to forgetting in the fused weights. To address this issue, we propose In-Context Meta LoRA Fusion (ICM-Fusion), a novel framework that synergizes meta-learning with in-context adaptation. The key innovation lies in our task vector arithmetic, which dynamically balances conflicting optimization directions across domains through learned manifold projections. ICM-Fusion obtains the optimal task vector orientation for the fused model in the latent space by adjusting the orientation of the task vectors. Subsequently, the fused LoRA is reconstructed by a self-designed Fusion VAE (F-VAE) to realize multi-task LoRA generation. We have conducted extensive experiments on visual and linguistic tasks, and the experimental results demonstrate that ICM-Fusion can be adapted to a wide range of architectural models and applied to various tasks. Compared to the current pre-trained LoRA fusion method, ICM-Fusion fused LoRA can significantly reduce the multi-tasking loss and can even achieve task enhancement in few-shot scenarios.

[124] Audio-Assisted Face Video Restoration with Temporal and Identity Complementary Learning

Yuqin Cao,Yixuan Gao,Wei Sun,Xiaohong Liu,Yulun Zhang,Xiongkuo Min

Main category: cs.CV

TL;DR: This paper introduces GAVN, a novel audio-assisted network for face video restoration that effectively addresses various distortions by integrating temporal and identity features, aided by audio and facial landmarks.

Details

Motivation: Most face video restoration methods neglect the correlations between visual and audio features, particularly in the mouth region. Existing audio-aided methods focus only on specific issues like compression artifact removal. The motivation is to develop a general solution for various video distortions. Method: The method involves three steps: (1) capturing inter-frame temporal features in low-resolution space for coarse restoration, (2) extracting intra-frame identity features in high-resolution space using audio signals and face landmarks, and (3) integrating temporal and identity features for high-quality video restoration. Result: GAVN outperforms state-of-the-art methods in tasks like compression artifact removal, deblurring, and super-resolution for face videos. Conclusion: GAVN, the proposed General Audio-assisted face Video restoration Network, demonstrates superior performance in restoring face videos by leveraging both temporal and identity features, aided by audio signals and face landmarks. Abstract: Face videos accompanied by audio have become integral to our daily lives, while they often suffer from complex degradations. Most face video restoration methods neglect the intrinsic correlations between the visual and audio features, especially in mouth regions. A few audio-aided face video restoration methods have been proposed, but they only focus on compression artifact removal. In this paper, we propose a General Audio-assisted face Video restoration Network (GAVN) to address various types of streaming video distortions via identity and temporal complementary learning. Specifically, GAVN first captures inter-frame temporal features in the low-resolution space to restore frames coarsely and save computational cost. Then, GAVN extracts intra-frame identity features in the high-resolution space with the assistance of audio signals and face landmarks to restore more facial details. Finally, the reconstruction module integrates temporal features and identity features to generate high-quality face videos. Experimental results demonstrate that GAVN outperforms the existing state-of-the-art methods on face video compression artifact removal, deblurring, and super-resolution. Codes will be released upon publication.

[125] ToxicTAGS: Decoding Toxic Memes with Rich Tag Annotations

Subhankar Swain,Naquee Rizwan,Nayandeep Deb,Vishwajeet Singh Solanki,Vishwa Gangadhar S,Animesh Mukherjee

Main category: cs.CV

TL;DR: 本文介绍了一个新的模因数据集及其标签生成模块，为改进社交媒体中的内容审核提供了基础。

Details

Motivation: 社交媒体在放大有毒言论中扮演重要角色，而模因作为在线交流的常用方式，常常成为传播有害内容的载体。然而，数据可访问性的限制和数据集策划的高成本阻碍了强大的模因审核系统的发展。 Method: 该研究通过两个阶段对模因进行注释：(i) 将模因分为有毒和正常两类；(ii) 对有毒模因进行细粒度标注，分为仇恨、危险或冒犯性。此外，还提出了一种标签生成模块，用于生成具有社会基础的标签。 Result: 实验结果表明，引入的标签显著提升了最先进的视觉语言模型在检测任务中的性能。 Conclusion: 本文提出了一个包含6,300个真实世界基于模因的帖子的数据集，并引入了一个标签生成模块，为改进多模态在线环境中的内容审核提供了新的可扩展基础。 Abstract: The 2025 Global Risks Report identifies state-based armed conflict and societal polarisation among the most pressing global threats, with social media playing a central role in amplifying toxic discourse. Memes, as a widely used mode of online communication, often serve as vehicles for spreading harmful content. However, limitations in data accessibility and the high cost of dataset curation hinder the development of robust meme moderation systems. To address this challenge, in this work, we introduce a first-of-its-kind dataset of 6,300 real-world meme-based posts annotated in two stages: (i) binary classification into toxic and normal, and (ii) fine-grained labelling of toxic memes as hateful, dangerous, or offensive. A key feature of this dataset is that it is enriched with auxiliary metadata of socially relevant tags, enhancing the context of each meme. In addition, we propose a tag generation module that produces socially grounded tags, because most in-the-wild memes often do not come with tags. Experimental results show that incorporating these tags substantially enhances the performance of state-of-the-art VLMs detection tasks. Our contributions offer a novel and scalable foundation for improved content moderation in multimodal online environments.

[126] AD-FM: Multimodal LLMs for Anomaly Detection via Multi-Stage Reasoning and Fine-Grained Reward Optimization

Jingyi Liao,Yongyi Su,Rong-Cheng Tu,Zhao Jin,Wenhao Sun,Yiting Li,Dacheng Tao,Xun Xu,Xulei Yang

Main category: cs.CV

TL;DR: This paper proposes a framework that enhances the application of Multimodal Large Language Models (MLLMs) in specialized anomaly detection by introducing a structured reasoning process and improved reward mechanism, resulting in better performance and efficient adaptation.

Details

Motivation: The motivation stems from the constraints faced by existing GRPO-based approaches in domain adaptation for MLLMs, particularly in specialized anomaly detection, due to inadequate data utilization and insufficient supervision over reasoning processes. Method: The method involves a multi-stage deliberative reasoning process that guides models from region identification to focused examination, along with a fine-grained reward mechanism that incorporates classification accuracy and localization supervision. Result: The framework demonstrates substantial performance improvements in adapting general vision-language models to specialized anomaly detection, achieving superior accuracy with efficient adaptation across multiple industrial datasets. Conclusion: The proposed framework successfully addresses the limitations of existing GRPO-based approaches in MLLMs for specialized anomaly detection by introducing a multi-stage deliberative reasoning process and a fine-grained reward mechanism. Abstract: While Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities across diverse domains, their application to specialized anomaly detection (AD) remains constrained by domain adaptation challenges. Existing Group Relative Policy Optimization (GRPO) based approaches suffer from two critical limitations: inadequate training data utilization when models produce uniform responses, and insufficient supervision over reasoning processes that encourage immediate binary decisions without deliberative analysis. We propose a comprehensive framework addressing these limitations through two synergistic innovations. First, we introduce a multi-stage deliberative reasoning process that guides models from region identification to focused examination, generating diverse response patterns essential for GRPO optimization while enabling structured supervision over analytical workflows. Second, we develop a fine-grained reward mechanism incorporating classification accuracy and localization supervision, transforming binary feedback into continuous signals that distinguish genuine analytical insight from spurious correctness. Comprehensive evaluation across multiple industrial datasets demonstrates substantial performance improvements in adapting general vision-language models to specialized anomaly detection. Our method achieves superior accuracy with efficient adaptation of existing annotations, effectively bridging the gap between general-purpose MLLM capabilities and the fine-grained visual discrimination required for detecting subtle manufacturing defects and structural irregularities.

[127] Uncertainty-Aware Spatial Color Correlation for Low-Light Image Enhancement

Jin Kuang,Dong Liu,Yukuang Zhang,Shengsheng Wang

Main category: cs.CV

TL;DR: U2CLLIE is a novel low-light image enhancement framework that addresses feature uncertainty, noise dominance, and gradient vanishing through an uncertainty-aware dual-domain denoise module and hierarchical causal correlation modeling, achieving superior performance.

Details

Motivation: Most existing low-light image enhancement methods focus on architectural innovations while neglecting intrinsic uncertainty in feature representations under extremely dark conditions, which leads to gradient degradation, noise dominance, and reduced model reliability. Method: U2CLLIE integrates an Uncertainty-Aware Dual-domain Denoise (UaD) Module and a hierarchical causality-aware framework with Neighborhood Correlation State Space (NeCo) and Adaptive Spatial-Color Calibration (AsC) modules to suppress noise, enhance features, and model causal correlations. Result: U2CLLIE achieves state-of-the-art performance on multiple benchmark datasets, showing robustness and strong generalization across various low-light scenarios. Conclusion: U2CLLIE provides a novel framework that effectively addresses the issues of uncertainty in feature representations, gradient vanishing, and noise dominance in low-light image enhancement, achieving state-of-the-art performance with robustness and generalization. Abstract: Most existing low-light image enhancement approaches primarily focus on architectural innovations, while often overlooking the intrinsic uncertainty within feature representations particularly under extremely dark conditions where degraded gradient and noise dominance severely impair model reliability and causal reasoning. To address these issues, we propose U2CLLIE, a novel framework that integrates uncertainty-aware enhancement and spatial-color causal correlation modeling. From the perspective of entropy-based uncertainty, our framework introduces two key components: (1) An Uncertainty-Aware Dual-domain Denoise (UaD) Module, which leverages Gaussian-Guided Adaptive Frequency Domain Feature Enhancement (G2AF) to suppress frequency-domain noise and optimize entropy-driven representations. This module enhances spatial texture extraction and frequency-domain noise suppression/structure refinement, effectively mitigating gradient vanishing and noise dominance. (2) A hierarchical causality-aware framework, where a Luminance Enhancement Network (LEN) first performs coarse brightness enhancement on dark regions. Then, during the encoder-decoder phase, two asymmetric causal correlation modeling modules Neighborhood Correlation State Space (NeCo) and Adaptive Spatial-Color Calibration (AsC) collaboratively construct hierarchical causal constraints. These modules reconstruct and reinforce neighborhood structure and color consistency in the feature space. Extensive experiments demonstrate that U2CLLIE achieves state-of-the-art performance across multiple benchmark datasets, exhibiting robust performance and strong generalization across various scenes.

[128] Deeper Inside Deep ViT

Sungrae Hong

Main category: cs.CV

TL;DR: This study analyzes the practical utility of the ViT-22B vision model, stabilizes its training process through modifications, and proposes an image generation architecture based on ViT.

Details

Motivation: Understanding the practical utility of large-scale structures in vision models similar to LLM, such as ViT-22B. Method: Training and analyzing ViT-22B in a local environment and proposing an image generation architecture based on ViT. Result: The training process of ViT-22B is unstable, but after model modifications, ViT-22B outperforms ViT in performance. An image generation architecture based on ViT is proposed. Conclusion: ViT-22B shows better performance than ViT in the same parameter size and has potential in image generation tasks. Abstract: There have been attempts to create large-scale structures in vision models similar to LLM, such as ViT-22B. While this research has provided numerous analyses and insights, our understanding of its practical utility remains incomplete. Therefore, we examine how this model structure reacts and train in a local environment. We also highlight the instability in training and make some model modifications to stabilize it. The ViT-22B model, trained from scratch, overall outperformed ViT in terms of performance under the same parameter size. Additionally, we venture into the task of image generation, which has not been attempted in ViT-22B. We propose an image generation architecture using ViT and investigate which between ViT and ViT-22B is a more suitable structure for image generation.

[129] RPCANet++: Deep Interpretable Robust PCA for Sparse Object Segmentation

Fengyi Wu,Yimian Dai,Tianfang Zhang,Yixuan Ding,Jian Yang,Ming-Ming Cheng,Zhenming Peng

Main category: cs.CV

TL;DR: RPCANet++ is a deep learning-based framework that overcomes the limitations of traditional RPCA models, achieving superior performance in sparse object segmentation with enhanced interpretability.

Details

Motivation: Traditional RPCA models face challenges such as computational burdens, reliance on finely tuned hyperparameters, and rigid priors that limit adaptability in dynamic scenarios. The authors aim to overcome these limitations by fusing the interpretability of RPCA with the efficiency of deep learning. Method: RPCANet++ unfolds a relaxed RPCA model into a structured network with three modules: Background Approximation Module (BAM), Object Extraction Module (OEM), and Image Restoration Module (IRM). A Memory-Augmented Module (MAM) and a Deep Contrast Prior Module (DCPM) are introduced to enhance background feature preservation and expedite object extraction. Result: Extensive experiments on diverse datasets demonstrate that RPCANet++ achieves state-of-the-art performance under various imaging scenarios. The approach also improves interpretability through visual and numerical low-rankness and sparsity measurements. Conclusion: RPCANet++ successfully addresses the limitations of traditional RPCA models by integrating deep learning techniques, achieving state-of-the-art performance in sparse object segmentation while enhancing interpretability. Abstract: Robust principal component analysis (RPCA) decomposes an observation matrix into low-rank background and sparse object components. This capability has enabled its application in tasks ranging from image restoration to segmentation. However, traditional RPCA models suffer from computational burdens caused by matrix operations, reliance on finely tuned hyperparameters, and rigid priors that limit adaptability in dynamic scenarios. To solve these limitations, we propose RPCANet++, a sparse object segmentation framework that fuses the interpretability of RPCA with efficient deep architectures. Our approach unfolds a relaxed RPCA model into a structured network comprising a Background Approximation Module (BAM), an Object Extraction Module (OEM), and an Image Restoration Module (IRM). To mitigate inter-stage transmission loss in the BAM, we introduce a Memory-Augmented Module (MAM) to enhance background feature preservation, while a Deep Contrast Prior Module (DCPM) leverages saliency cues to expedite object extraction. Extensive experiments on diverse datasets demonstrate that RPCANet++ achieves state-of-the-art performance under various imaging scenarios. We further improve interpretability via visual and numerical low-rankness and sparsity measurements. By combining the theoretical strengths of RPCA with the efficiency of deep networks, our approach sets a new baseline for reliable and interpretable sparse object segmentation. Codes are available at our Project Webpage https://fengyiwu98.github.io/rpcanetx.

[130] From Learning to Unlearning: Biomedical Security Protection in Multimodal Large Language Models

Dunyuan Xu,Xikai Yang,Yaoqian Li,Jinpeng Li,Pheng-Ann Heng

Main category: cs.CV

TL;DR: 论文介绍了一种新的评估生物医学多模态大语言模型遗忘学习效果的基准测试方法，并指出当前方法仍有改进空间。

Details

Motivation: 生物医学多模态大语言模型（MLLMs）的安全性受到越来越多的关注，但训练样本中可能包含难以检测的私有信息和错误知识，可能导致隐私泄露或部署后的错误输出。 Method: 提出了一种名为MLLMU-Med的基准测试方法，通过一种新的数据生成流程，将合成的私有数据和事实错误整合到训练集中，并提出了一个遗忘效率评分来评估不同子集上的遗忘效果。 Result: 在MLLMU-Med上评估的五种遗忘方法显示出有限的移除有害知识的效果，表明需要进一步研究改进。 Conclusion: 该论文提出了一种新的基准测试方法，用于评估生物医学多模态大语言模型（MLLMs）中的遗忘学习效果，指出现有方法在移除有害知识方面效果有限，表明仍有很大的改进空间。 Abstract: The security of biomedical Multimodal Large Language Models (MLLMs) has attracted increasing attention. However, training samples easily contain private information and incorrect knowledge that are difficult to detect, potentially leading to privacy leakage or erroneous outputs after deployment. An intuitive idea is to reprocess the training set to remove unwanted content and retrain the model from scratch. Yet, this is impractical due to significant computational costs, especially for large language models. Machine unlearning has emerged as a solution to this problem, which avoids complete retraining by selectively removing undesired knowledge derived from harmful samples while preserving required capabilities on normal cases. However, there exist no available datasets to evaluate the unlearning quality for security protection in biomedical MLLMs. To bridge this gap, we propose the first benchmark Multimodal Large Language Model Unlearning for BioMedicine (MLLMU-Med) built upon our novel data generation pipeline that effectively integrates synthetic private data and factual errors into the training set. Our benchmark targets two key scenarios: 1) Privacy protection, where patient private information is mistakenly included in the training set, causing models to unintentionally respond with private data during inference; and 2) Incorrectness removal, where wrong knowledge derived from unreliable sources is embedded into the dataset, leading to unsafe model responses. Moreover, we propose a novel Unlearning Efficiency Score that directly reflects the overall unlearning performance across different subsets. We evaluate five unlearning approaches on MLLMU-Med and find that these methods show limited effectiveness in removing harmful knowledge from biomedical MLLMs, indicating significant room for improvement. This work establishes a new pathway for further research in this promising field.

[131] Gather and Trace: Rethinking Video TextVQA from an Instance-oriented Perspective

Yan Zhang,Gangyan Zeng,Daiqing Wu,Huawen Shen,Binbin Li,Yu Zhou,Can Ma,Xiaojun Bi

Main category: cs.CV

TL;DR: 本文提出GAT模型，通过实例导向方法提升视频文本视觉问答的准确性和效率。

Details

Motivation: 现有帧级框架存在冗余文本实体和隐式关系建模问题，限制了准确性和效率。 Method: 设计了上下文聚合实例聚集模块和实例聚焦轨迹追踪模块，以整合视觉、布局和文本信息，并建立实例间的时空关系。 Result: GAT在准确性上超越了现有方法3.86%，推理速度比视频大语言模型快十倍。 Conclusion: 本文提出了一种新的面向实例的视频文本视觉问答模型GAT，在多个公共数据集上验证了其在准确性和推理速度方面的有效性。 Abstract: Video text-based visual question answering (Video TextVQA) aims to answer questions by explicitly reading and reasoning about the text involved in a video. Most works in this field follow a frame-level framework which suffers from redundant text entities and implicit relation modeling, resulting in limitations in both accuracy and efficiency. In this paper, we rethink the Video TextVQA task from an instance-oriented perspective and propose a novel model termed GAT (Gather and Trace). First, to obtain accurate reading result for each video text instance, a context-aggregated instance gathering module is designed to integrate the visual appearance, layout characteristics, and textual contents of the related entities into a unified textual representation. Then, to capture dynamic evolution of text in the video flow, an instance-focused trajectory tracing module is utilized to establish spatio-temporal relationships between instances and infer the final answer. Extensive experiments on several public Video TextVQA datasets validate the effectiveness and generalization of our framework. GAT outperforms existing Video TextVQA methods, video-language pretraining methods, and video large language models in both accuracy and inference speed. Notably, GAT surpasses the previous state-of-the-art Video TextVQA methods by 3.86\% in accuracy and achieves ten times of faster inference speed than video large language models. The source code is available at https://github.com/zhangyan-ucas/GAT.

[132] Bootstrap Deep Spectral Clustering with Optimal Transport

Wengang Guo,Wei Ye,Chunchun Chen,Xin Sun,Christian Böhm,Claudia Plant,Susanto Rahardja

Main category: cs.CV

TL;DR: 本文提出了一种名为BootSC的深度谱聚类模型，通过使用单个网络以端到端的方式联合学习谱聚类的所有阶段，并利用有效的最优传输派生监督和引入语义一致的正交重新参数化技术，解决了谱聚类的离散优化过程和有限表示能力的问题。

Details

Motivation: 谱聚类是领先的聚类方法，但其两大主要缺点是离散优化过程和有限的表示能力。 Method: 提出了一种深度谱聚类模型（称为BootSC），使用单个网络以端到端的方式联合学习谱聚类的所有阶段 - 亲和矩阵构建，谱嵌入和k均值聚类。BootSC利用有效的最优传输派生监督来引导亲和矩阵和聚类分配矩阵，并引入了语义一致的正交重新参数化技术来正交化谱嵌入。 Result: 实验结果表明，BootSC实现了最先进的聚类性能。例如，在具有挑战性的ImageNet-Dogs数据集上，NMI比亚军方法提高了显著的16％。 Conclusion: BootSC实现了最先进的聚类性能，例如在具有挑战性的ImageNet-Dogs数据集上比亚军方法的NMI提高了16％。代码可在https://github.com/spdj2271/BootSC获得。 Abstract: Spectral clustering is a leading clustering method. Two of its major shortcomings are the disjoint optimization process and the limited representation capacity. To address these issues, we propose a deep spectral clustering model (named BootSC), which jointly learns all stages of spectral clustering -- affinity matrix construction, spectral embedding, and $k$-means clustering -- using a single network in an end-to-end manner. BootSC leverages effective and efficient optimal-transport-derived supervision to bootstrap the affinity matrix and the cluster assignment matrix. Moreover, a semantically-consistent orthogonal re-parameterization technique is introduced to orthogonalize spectral embeddings, significantly enhancing the discrimination capability. Experimental results indicate that BootSC achieves state-of-the-art clustering performance. For example, it accomplishes a notable 16\% NMI improvement over the runner-up method on the challenging ImageNet-Dogs dataset. Our code is available at https://github.com/spdj2271/BootSC.

[133] ViFP: A Framework for Visual False Positive Detection to Enhance Reasoning Reliability in VLMs

Ben Zhang,LuLu Yu,Lei Gao,Jing Liu,QuanJiang Guo,Hui Gao

Main category: cs.CV

TL;DR: ViFP enhances visual-language model reasoning reliability by detecting false positives and improving reasoning paths, achieving better accuracy and fewer logical errors across multiple datasets.

Details

Motivation: False positive reasoning in VLMs leads to correct answers with incorrect logic. Existing methods are costly and lack generalization, prompting the need for a more reliable and efficient solution. Method: ViFP constructs sub-question templates based on visual reasoning dimensions, uses multi-turn QA to build reasoning paths, dynamically identifies false positives, and applies adaptive CoT guidance. Result: ViFP improves accuracy by up to 5.4% on A-OKVQA, reduces false positives, and outperforms prior methods while introducing a new reliability metric, VoC. Conclusion: ViFP improves the reliability of visual-language models' reasoning by detecting false positives and enhancing reasoning paths, leading to better accuracy and fewer logical errors. Abstract: In visual-language model (VLM) reasoning, false positive(FP) reasoning occurs when a model generates a correct answer but follows an incorrect reasoning path. Existing methods based on specific multi-step reasoning datasets and reinforcement learning strategies, leading to high training costs and limited generalization. In this work, we propose ViFP, a general framework for enhancing visual reasoning reliability. It improves both answer accuracy and reasoning soundness by detecting FPs. ViFP tackles the limitations of dataset dependency and poor generalization by constructing sub-question templates grounded in the core dimensions of visual reasoning, such as object localization, characteristic description, and object discovery. ViFP then builds effective reasoning paths via multi-turn QA to improve reasoning accuracy. Meanwhile, ViFP dynamically analyzes the consistency of reasoning path to identify potential FPs, and introduces a targeted chain-of-thought (CoT) mechanism that adaptively guides both FP and non-FP samples. Thereby reducing logical errors in the reasoning path while preserving accuracy. Finally, we introduce a reliability evaluation metric-VoC, which integrates answer accuracy and the FP rate, providing a quantitative tool to assess whether a VLM not only answers correctly, but also reasons reliably. Our experiments on closed-source VLMs show that ViFP consistently improves performance across three datasets: A-OKVQA, OKVQA, and FVQA. On A-OKVQA, ViFP improves accuracy by up to 5.4%, surpassing the previous state-of-the-art by 4.3%, and significantly reduces the number of FPs, validating its benefits in enhancing reasoning reliability.

[134] Small Lesions-aware Bidirectional Multimodal Multiscale Fusion Network for Lung Disease Classification

Jianxun Yu,Ruiquan Ge,Zhipeng Wang,Cheng Yang,Chenyu Lin,Xianjun Fu,Jikui Liu,Ahmed Elazab,Changmiao Wang

Main category: cs.CV

TL;DR: 本文提出了一种新的多模态多尺度交叉注意力融合网络（MMCAF-Net），用于医学疾病诊断，显著提高了诊断准确性。

Details

Motivation: 医学疾病诊断面临小病变误诊的挑战，而多模态深度学习方法在解决医学影像和电子健康记录数据维度差异方面存在困难。 Method: 提出了MMCAF-Net模型，结合特征金字塔结构和3D多尺度卷积注意力模块，并引入多尺度交叉注意模块。 Result: 在Lung-PET-CT-Dx数据集上的评估结果显示诊断准确率显著优于当前最先进的方法。 Conclusion: MMCAF-Net实现了更有效的多模态数据集成和特征融合，提升了医学疾病诊断的准确性。 Abstract: The diagnosis of medical diseases faces challenges such as the misdiagnosis of small lesions. Deep learning, particularly multimodal approaches, has shown great potential in the field of medical disease diagnosis. However, the differences in dimensionality between medical imaging and electronic health record data present challenges for effective alignment and fusion. To address these issues, we propose the Multimodal Multiscale Cross-Attention Fusion Network (MMCAF-Net). This model employs a feature pyramid structure combined with an efficient 3D multi-scale convolutional attention module to extract lesion-specific features from 3D medical images. To further enhance multimodal data integration, MMCAF-Net incorporates a multi-scale cross-attention module, which resolves dimensional inconsistencies, enabling more effective feature fusion. We evaluated MMCAF-Net on the Lung-PET-CT-Dx dataset, and the results showed a significant improvement in diagnostic accuracy, surpassing current state-of-the-art methods. The code is available at https://github.com/yjx1234/MMCAF-Net

[135] What Holds Back Open-Vocabulary Segmentation?

Josip Šarić,Ivan Martinović,Matej Kristan,Siniša Šegvić

Main category: cs.CV

TL;DR: 本文通过设计‘oracle’组件分析开放词汇分割模型的瓶颈，提出了提升性能的新方法。

Details

Motivation: 标准的分割设置无法实现对训练分类之外概念的识别，而开放词汇方法由于性能瓶颈未能实现预期效果，因此需要深入分析瓶颈并提出解决方案。 Method: 利用groundtruth信息设计了新的‘oracle’组件，并通过验证实验分析开放词汇模型的性能瓶颈。 Result: 实验揭示了开放词汇模型的关键瓶颈，并提供了新的方法来提升其性能，为未来研究指明了方向。 Conclusion: 本文提出了新的‘oracle’组件，用于识别和解耦开放词汇分割模型中的瓶颈，从而为未来研究提供了重要的实证发现和有前景的解决方向。 Abstract: Standard segmentation setups are unable to deliver models that can recognize concepts outside the training taxonomy. Open-vocabulary approaches promise to close this gap through language-image pretraining on billions of image-caption pairs. Unfortunately, we observe that the promise is not delivered due to several bottlenecks that have caused the performance to plateau for almost two years. This paper proposes novel oracle components that identify and decouple these bottlenecks by taking advantage of the groundtruth information. The presented validation experiments deliver important empirical findings that provide a deeper insight into the failures of open-vocabulary models and suggest prominent approaches to unlock the future research.

[136] SplitGaussian: Reconstructing Dynamic Scenes via Visual Geometry Decomposition

Jiahui Li,Shengeng Tang,Jingxuan He,Gang Huang,Zhangye Wang,Yantao Pan,Lechao Cheng

Main category: cs.CV

TL;DR: SplitGaussian通过将场景表示分解为静态和动态成分，解决了动态场景重建中的运动泄漏、几何失真和时间闪烁问题，从而提高了重建质量和运动分离效果。

Details

Motivation: 现有基于高斯点绘的动态场景重建方法在共享表示中纠缠静态和动态元素，导致运动泄漏、几何失真和时间闪烁。这激发了对更有效方法的需求。 Method: 提出SplitGaussian框架，将场景表示显式分解为静态和动态成分。通过将运动建模与背景几何解耦，仅允许动态分支随时间变形，从而防止静态区域中的运动伪影，同时支持视图和时间依赖的外观优化。 Result: 实验表明，SplitGaussian在渲染质量、几何稳定性和运动分离方面优于现有最先进方法。 Conclusion: SplitGaussian通过解耦静态和动态成分，有效解决了动态场景重建中的运动泄漏和几何失真问题，提升了重建效果并加速了收敛过程。 Abstract: Reconstructing dynamic 3D scenes from monocular video remains fundamentally challenging due to the need to jointly infer motion, structure, and appearance from limited observations. Existing dynamic scene reconstruction methods based on Gaussian Splatting often entangle static and dynamic elements in a shared representation, leading to motion leakage, geometric distortions, and temporal flickering. We identify that the root cause lies in the coupled modeling of geometry and appearance across time, which hampers both stability and interpretability. To address this, we propose \textbf{SplitGaussian}, a novel framework that explicitly decomposes scene representations into static and dynamic components. By decoupling motion modeling from background geometry and allowing only the dynamic branch to deform over time, our method prevents motion artifacts in static regions while supporting view- and time-dependent appearance refinement. This disentangled design not only enhances temporal consistency and reconstruction fidelity but also accelerates convergence. Extensive experiments demonstrate that SplitGaussian outperforms prior state-of-the-art methods in rendering quality, geometric stability, and motion separation.

[137] Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting

Yuyang Liu,Qiuhe Hong,Linlan Huang,Alexandra Gomez-Villa,Dipam Goswami,Xialei Liu,Joost van de Weijer,Yonghong Tian

Main category: cs.CV

TL;DR: This survey reviews continual learning methods for vision-language models, identifies failure modes, proposes solutions, and emphasizes the need for better benchmarks and future research directions.

Details

Motivation: The motivation for this research is the challenge of enabling vision-language models to learn continually from non-stationary data, as their cross-modal alignment and generalization capabilities are vulnerable to catastrophic forgetting. Method: The authors conducted a systematic review of continual learning methods for vision-language models. They identified failure modes and proposed solutions, including multi-modal replay strategies, cross-modal regularization, and parameter-efficient adaptation. Result: The survey identifies three core failure modes in VLM-CL and proposes a challenge-driven taxonomy mapping solutions to target problems. It also highlights the need for better benchmarks and outlines open problems and future directions. Conclusion: The survey concludes that continual learning for vision-language models (VLM-CL) is a significant challenge with the need for better benchmarks, continual pre-training, and compositional zero-shot learning. The authors propose a challenge-driven taxonomy and provide resources for researchers. Abstract: Vision-language models (VLMs) have achieved impressive performance across diverse multimodal tasks by leveraging large-scale pre-training. However, enabling them to learn continually from non-stationary data remains a major challenge, as their cross-modal alignment and generalization capabilities are particularly vulnerable to catastrophic forgetting. Unlike traditional unimodal continual learning (CL), VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion. This survey offers the first focused and systematic review of continual learning for VLMs (VLM-CL). We begin by identifying the three core failure modes that degrade performance in VLM-CL. Based on these, we propose a challenge-driven taxonomy that maps solutions to their target problems: (1) \textit{Multi-Modal Replay Strategies} address cross-modal drift through explicit or implicit memory mechanisms; (2) \textit{Cross-Modal Regularization} preserves modality alignment during updates; and (3) \textit{Parameter-Efficient Adaptation} mitigates parameter interference with modular or low-rank updates. We further analyze current evaluation protocols, datasets, and metrics, highlighting the need for better benchmarks that capture VLM-specific forgetting and compositional generalization. Finally, we outline open problems and future directions, including continual pre-training and compositional zero-shot learning. This survey aims to serve as a comprehensive and diagnostic reference for researchers developing lifelong vision-language systems. All resources are available at: https://github.com/YuyangSunshine/Awesome-Continual-learning-of-Vision-Language-Models.

[138] LayerT2V: Interactive Multi-Object Trajectory Layering for Video Generation

Kangrui Cen,Baixuan Zhao,Yi Xin,Siqi Luo,Guangtao Zhai,Xiaohong Liu

Main category: cs.CV

TL;DR: LayerT2V是一种生成视频的新方法，通过分层合成背景和前景对象，有效解决了多物体运动场景中轨迹冲突的问题，提高了生成质量。

Details

Motivation: 控制文本到视频生成中的物体运动轨迹，尤其是在涉及多个移动物体的场景中，是一个具有挑战性且相对研究不足的领域。当前的模型和数据集主要针对单个物体的运动，缺乏对多物体运动场景的支持，或在物体轨迹交叉时性能严重下降。 Method: 引入LayerT2V，一种分层生成视频的方法，通过将视频元素放置在不同的“层”上，解决多对象运动轨迹冲突的问题。 Result: 实验表明，LayerT2V在生成复杂多物体场景方面优于现有方法，在mIoU和AP50指标上分别提高了1.4倍和4.5倍。 Conclusion: LayerT2V是一个分层生成视频的方法，通过将背景和前景对象逐层合成，实现对多对象运动场景的灵活集成和更好的控制。 Abstract: Controlling object motion trajectories in Text-to-Video (T2V) generation is a challenging and relatively under-explored area, particularly in scenarios involving multiple moving objects. Most community models and datasets in the T2V domain are designed for single-object motion, limiting the performance of current generative models in multi-object tasks. Additionally, existing motion control methods in T2V either lack support for multi-object motion scenes or experience severe performance degradation when object trajectories intersect, primarily due to the semantic conflicts in colliding regions. To address these limitations, we introduce LayerT2V, the first approach for generating video by compositing background and foreground objects layer by layer. This layered generation enables flexible integration of multiple independent elements within a video, positioning each element on a distinct "layer" and thus facilitating coherent multi-object synthesis while enhancing control over the generation process. Extensive experiments demonstrate the superiority of LayerT2V in generating complex multi-object scenarios, showcasing 1.4x and 4.5x improvements in mIoU and AP50 metrics over state-of-the-art (SOTA) methods. Project page and code are available at https://kr-panghu.github.io/LayerT2V/ .

[139] Intention Enhanced Diffusion Model for Multimodal Pedestrian Trajectory Prediction

Yu Liu,Zhijie Liu,Xiao Ren,You-Fu Li,He Kong

Main category: cs.CV

TL;DR: This paper proposes a diffusion-based trajectory prediction model that incorporates pedestrian motion intentions, enhancing interpretability and accuracy for autonomous vehicle applications.

Details

Motivation: Accurately forecasting pedestrian trajectories is crucial for autonomous vehicles, but existing diffusion-based models lack explicit incorporation of motion intentions, limiting interpretability and precision. Method: The method introduces a pedestrian intention recognition module that decomposes motion intentions into lateral and longitudinal components, integrated within a diffusion-based prediction framework with an efficient guidance mechanism. Result: The model was evaluated on the ETH and UCY benchmarks and showed competitive performance compared to state-of-the-art methods in generating interpretable and precise trajectory predictions. Conclusion: The proposed diffusion-based multimodal trajectory prediction model effectively incorporates pedestrian motion intentions, achieving competitive performance on trajectory prediction benchmarks. Abstract: Predicting pedestrian motion trajectories is critical for path planning and motion control of autonomous vehicles. However, accurately forecasting crowd trajectories remains a challenging task due to the inherently multimodal and uncertain nature of human motion. Recent diffusion-based models have shown promising results in capturing the stochasticity of pedestrian behavior for trajectory prediction. However, few diffusion-based approaches explicitly incorporate the underlying motion intentions of pedestrians, which can limit the interpretability and precision of prediction models. In this work, we propose a diffusion-based multimodal trajectory prediction model that incorporates pedestrians' motion intentions into the prediction framework. The motion intentions are decomposed into lateral and longitudinal components, and a pedestrian intention recognition module is introduced to enable the model to effectively capture these intentions. Furthermore, we adopt an efficient guidance mechanism that facilitates the generation of interpretable trajectories. The proposed framework is evaluated on two widely used human trajectory prediction benchmarks, ETH and UCY, on which it is compared against state-of-the-art methods. The experimental results demonstrate that our method achieves competitive performance.

[140] FrEVL: Leveraging Frozen Pretrained Embeddings for Efficient Vision-Language Understanding

Emmanuelle Bourigault,Pauline Bourigault

Main category: cs.CV

TL;DR: FrEVL explores the use of frozen embeddings for efficient vision-language understanding, achieving strong performance with significantly lower computational costs, making it suitable for resource-constrained deployments.

Details

Motivation: Vision-language models typically require substantial computational resources, limiting their deployment. This work investigates whether frozen embeddings can provide an efficient alternative while maintaining performance. Method: The study analyzes the effectiveness of frozen pretrained embeddings in vision-language tasks by evaluating their performance on standard benchmarks with limited trainable parameters. It also compares computational efficiency in terms of speed and energy consumption. Result: Frozen embeddings achieved 85% to 95% of state-of-the-art performance with only 68.4M trainable parameters. FrEVL demonstrated a 2.3× speedup and 52% lower energy consumption, showing its potential for efficient deployment. Conclusion: FrEVL is a framework that uses frozen pretrained embeddings to achieve efficient vision-language understanding, offering significant computational advantages while maintaining strong performance on standard benchmarks. It is ideal for deployment scenarios where computational constraints are critical. Abstract: The deployment of vision-language models remains constrained by substantial computational requirements. We present \textbf{FrEVL}, a framework exploring whether frozen pretrained embeddings can support effective vision-language understanding. Our analysis reveals that frozen embeddings contain rich information for discriminative tasks, achieving 85\% to 95\% of state-of-the-art performance on standard benchmarks with only 68.4M trainable parameters. This performance dichotomy reveals a critical insight: frozen embedding effectiveness depends on alignment between pretraining objectives and downstream task requirements. When accounting for end-to-end computation including embedding extraction, FrEVL provides $2.3\times$ speedup with 52\% lower energy consumption, making it suitable for scenarios with pre-computable inputs or when deployment constraints outweigh marginal performance gains. Our evaluation provides practitioners with guidance on when frozen embedding approaches represent viable alternatives to full model deployment. We will release our complete implementation and evaluation framework to facilitate further research into efficient multi-modal understanding.

[141] DocVCE: Diffusion-based Visual Counterfactual Explanations for Document Image Classification

Saifullah Saifullah,Stefan Agne,Andreas Dengel,Sheraz Ahmed

Main category: cs.CV

TL;DR: 本文提出了一种新的生成文档反事实解释的方法DocVCE，用于提供文档图像分类模型决策过程的有意义洞察。

Details

Motivation: 随着黑盒AI驱动的决策系统在现代文档处理工作流中的广泛应用，提高其透明度和可靠性变得至关重要，特别是在高风险应用中。 Method: 提出DocVCE方法，结合潜在扩散模型和分类器引导生成视觉反事实解释，并进行层次化块状优化以搜索最接近目标事实图像的反事实。 Result: 通过定性和定量评估，展示了DocVCE在生成文档图像分类模型解释方面的有效性。 Conclusion: DocVCE是一种新的方法，通过生成文档反事实来提供有意义的模型决策见解，并且在三个文档分类数据集和三个模型上证明了其有效性。 Abstract: As black-box AI-driven decision-making systems become increasingly widespread in modern document processing workflows, improving their transparency and reliability has become critical, especially in high-stakes applications where biases or spurious correlations in decision-making could lead to serious consequences. One vital component often found in such document processing workflows is document image classification, which, despite its widespread use, remains difficult to explain. While some recent works have attempted to explain the decisions of document image classification models through feature-importance maps, these maps are often difficult to interpret and fail to provide insights into the global features learned by the model. In this paper, we aim to bridge this research gap by introducing generative document counterfactuals that provide meaningful insights into the model's decision-making through actionable explanations. In particular, we propose DocVCE, a novel approach that leverages latent diffusion models in combination with classifier guidance to first generate plausible in-distribution visual counterfactual explanations, and then performs hierarchical patch-wise refinement to search for a refined counterfactual that is closest to the target factual image. We demonstrate the effectiveness of our approach through a rigorous qualitative and quantitative assessment on 3 different document classification datasets -- RVL-CDIP, Tobacco3482, and DocLayNet -- and 3 different models -- ResNet, ConvNeXt, and DiT -- using well-established evaluation criteria such as validity, closeness, and realism. To the best of the authors' knowledge, this is the first work to explore generative counterfactual explanations in document image analysis.

[142] Analyzing and Mitigating Object Hallucination: A Training Bias Perspective

Yifan Li,Kun Zhou,Wayne Xin Zhao,Lei Fang,Ji-Rong Wen

Main category: cs.CV

TL;DR: Obliviate addresses hallucination in LVLMs by efficiently unlearning training bias through targeted updates to the language modeling head.

Details

Motivation: LVLMs suffer from hallucination issues, especially on counterfactual images, due to training bias, which motivates the need for a method to mitigate object hallucination. Method: Obliviate identifies discrepancies between ground-truth labels and model outputs as a proxy for bias and updates only the language modeling head through a parameter- and data-efficient fine-tuning strategy. Result: Obliviate significantly reduces hallucination across both discriminative and generative tasks by updating only approximately 2% of the parameters, showing scalability and generalization beyond object-level hallucinations. Conclusion: Obliviate is an efficient and lightweight unlearning method that effectively reduces hallucination in LVLMs by targeting training bias in the language modeling head. Abstract: As scaling up training data has significantly improved the general multimodal capabilities of Large Vision-Language Models (LVLMs), they still suffer from the hallucination issue, generating text that is inconsistent with the visual input. This phenomenon motivates us to systematically investigate the role of training data in hallucination. We introduce a new benchmark, POPEv2, which consists of counterfactual images collected from the training data of LVLMs with certain objects masked. Through comprehensive evaluation on POPEv2, we find that current LVLMs suffer from training bias: they fail to fully leverage their training data and hallucinate more frequently on images seen during training. Specifically, they perform poorly on counterfactual images, often incorrectly answering ``Yes'' to questions about masked objects. To understand this issue, we conduct probing experiments on the models' internal components, revealing that this training bias is primarily located in the language modeling (LM) head. Based on these findings, we propose Obliviate, an efficient and lightweight unlearning method designed to mitigate object hallucination via training bias unlearning. Obliviate identifies the discrepancy between ground-truth labels and model outputs on the training data as a proxy for bias and adopts a parameter- and data-efficient fine-tuning strategy that only updates the LM head. Extensive experiments demonstrate the effectiveness of our approach. While only reusing the training data and updating approximately 2\% of the parameters, Obliviate significantly reduces hallucination across both discriminative and generative tasks. Furthermore, it demonstrates strong scalability with respect to both model size (2B to 72B) and training data volume, and exhibits promising generalization to hallucination types beyond object-level hallucination. Our code and data will be publicly released.

[143] A machine learning approach for image classification in synthetic aperture RADAR

Romina Gaburro,Patrick Healy,Shraddha Naidu,Clifford Nolan

Main category: cs.CV

TL;DR: 本文研究了利用卷积神经网络（CNN）对合成孔径雷达（SAR）数据进行目标分类和冰型识别，实验结果显示分类准确率较高（≥75%），并探讨了SAR数据采集参数对分类性能的影响。

Details

Motivation: 本研究旨在探索合成孔径雷达（SAR）数据与卷积神经网络（CNNs）结合的潜力，以实现对地面上物体的识别和分类，并评估不同雷达数据采集条件对分类效果的影响。 Method: 采用单散射近似方法对目标形状进行分类，并利用模拟SAR数据和从该数据重建的图像进行比较实验，同时使用Sentinel-1卫星的实际SAR图像识别冰型。 Result: 在两个实验中均达到了较高的分类准确率（≥75%），表明CNN在SAR数据分类任务中具有良好的性能。 Conclusion: 研究结果表明，卷积神经网络（CNN）在利用合成孔径雷达（SAR）数据进行几何和环境分类任务中的有效性，同时探讨了不同天线高度对分类性能的影响。 Abstract: We consider the problem in Synthetic Aperture RADAR (SAR) of identifying and classifying objects located on the ground by means of Convolutional Neural Networks (CNNs). Specifically, we adopt a single scattering approximation to classify the shape of the object using both simulated SAR data and reconstructed images from this data, and we compare the success of these approaches. We then identify ice types in real SAR imagery from the satellite Sentinel-1. In both experiments we achieve a promising high classification accuracy ($\geq$75\%). Our results demonstrate the effectiveness of CNNs in using SAR data for both geometric and environmental classification tasks. Our investigation also explores the effect of SAR data acquisition at different antenna heights on our ability to classify objects successfully.

[144] PIS3R: Very Large Parallax Image Stitching via Deep 3D Reconstruction

Muhua Zhu,Xinhao Jin,Chengbo Wang,Yongcong Zhang,Yifei Xue,Tie Ji,Yizhen Lao

Main category: cs.CV

TL;DR: 本文提出了一种基于深度3D重建的图像拼接解决方案PIS3R，可以有效处理由于大视差导致的图像拼接难题。

Details

Motivation: 由于3D场景中的深度变化和显著的相机基线会导致明显的视差，使得现有的大多数拼接方法难以有效处理具有大视差的图像。 Method: 首先，使用基于视觉几何的Transformer模型从两个具有大视差的输入图像中获取内在和外在参数，以及密集的3D场景重建；随后，将重建的密集点云通过恢复的相机参数重新投影到指定的参考视图上，实现像素级对齐并生成初始拼接图像；最后，提出了一种基于点条件的图像扩散模块以进一步优化初始拼接中可能存在的空洞或噪声。 Result: 实验结果表明，该算法在处理具有非常大视差的图像时能够提供准确的拼接结果，并在定性和定量评估上均优于现有方法。 Conclusion: PIS3R是一种对大视差非常鲁棒的图像拼接解决方案，同时能够完全保留3D摄影测量环境中的所有像素几何完整性，为下游3D视觉任务（如SfM）提供了直接适用性。 Abstract: Image stitching aim to align two images taken from different viewpoints into one seamless, wider image. However, when the 3D scene contains depth variations and the camera baseline is significant, noticeable parallax occurs-meaning the relative positions of scene elements differ substantially between views. Most existing stitching methods struggle to handle such images with large parallax effectively. To address this challenge, in this paper, we propose an image stitching solution called PIS3R that is robust to very large parallax based on the novel concept of deep 3D reconstruction. First, we apply visual geometry grounded transformer to two input images with very large parallax to obtain both intrinsic and extrinsic parameters, as well as the dense 3D scene reconstruction. Subsequently, we reproject reconstructed dense point cloud onto a designated reference view using the recovered camera parameters, achieving pixel-wise alignment and generating an initial stitched image. Finally, to further address potential artifacts such as holes or noise in the initial stitching, we propose a point-conditioned image diffusion module to obtain the refined result.Compared with existing methods, our solution is very large parallax tolerant and also provides results that fully preserve the geometric integrity of all pixels in the 3D photogrammetric context, enabling direct applicability to downstream 3D vision tasks such as SfM. Experimental results demonstrate that the proposed algorithm provides accurate stitching results for images with very large parallax, and outperforms the existing methods qualitatively and quantitatively.

Giuseppe Chindemi,Camilla Bellone,Benoit Girard

Main category: cs.CV

TL;DR: This paper discusses how AI and computational methods are transforming rodent social behavior research, offering deeper insights while also introducing new challenges.

Details

Motivation: The motivation is to provide a comprehensive overview of modern approaches to studying rodent social behavior, highlighting how AI and machine learning overcome the limitations of traditional observational methods. Method: The paper reviews and evaluates current methods and tools used in rodent social behavior research, particularly focusing on the integration of AI and computational techniques. Result: The study identifies the advantages and limitations of modern AI-based tools in analyzing rodent social behavior and proposes practical solutions to address challenges in their implementation. Conclusion: The paper concludes that while integrating AI into rodent social behavior research has limitations and challenges, it also offers significant benefits and deeper insights, making it a promising area for future exploration and discussion. Abstract: The study of rodent social behavior has shifted in the last years from relying on direct human observation to more nuanced approaches integrating computational methods in artificial intelligence (AI) and machine learning. While conventional approaches introduce bias and can fail to capture the complexity of rodent social interactions, modern approaches bridging computer vision, ethology and neuroscience provide more multifaceted insights into behavior which are particularly relevant to social neuroscience. Despite these benefits, the integration of AI into social behavior research also poses several challenges. Here we discuss the main steps involved and the tools available for analyzing rodent social behavior, examining their advantages and limitations. Additionally, we suggest practical solutions to address common hurdles, aiming to guide young researchers in adopting these methods and to stimulate further discussion among experts regarding the evolving requirements of these tools in scientific applications.

[146] Segment Any Vehicle: Semantic and Visual Context Driven SAM and A Benchmark

Xiao Wang,Ziwen Wang,Wentao Wu,Anjie Wang,Jiashu Wu,Yantao Pan,Chenglong Li

Main category: cs.CV

TL;DR: 本文提出了一种新的车辆部件分割框架SAV，并构建了一个大规模基准数据集VehicleSeg10K，为未来研究和比较奠定了基础。

Details

Motivation: 现有的大规模分割模型如SAM无法直接应用于细粒度的车辆部件分割任务，因为其文本提示分割功能不可公开访问，且生成的掩码区域缺乏语义标签。 Method: SAV框架包含三个核心组件：基于SAM的编码器-解码器、车辆部件知识图谱和上下文样本检索编码模块。 Result: 作者在多个数据集上进行了全面实验，建立了坚实的基线，同时发布了一个包含11,665个高质量像素级注释的大规模数据集VehicleSeg10K。 Conclusion: 本文提出了一种新的车辆部件分割框架SAV，并构建了一个大规模基准数据集VehicleSeg10K，为未来研究和比较奠定了基础。 Abstract: With the rapid advancement of autonomous driving, vehicle perception, particularly detection and segmentation, has placed increasingly higher demands on algorithmic performance. Pre-trained large segmentation models, especially Segment Anything Model (SAM), have sparked significant interest and inspired new research directions in artificial intelligence. However, SAM cannot be directly applied to the fine-grained task of vehicle part segmentation, as its text-prompted segmentation functionality is not publicly accessible, and the mask regions generated by its default mode lack semantic labels, limiting its utility in structured, category-specific segmentation tasks. To address these limitations, we propose SAV, a novel framework comprising three core components: a SAM-based encoder-decoder, a vehicle part knowledge graph, and a context sample retrieval encoding module. The knowledge graph explicitly models the spatial and geometric relationships among vehicle parts through a structured ontology, effectively encoding prior structural knowledge. Meanwhile, the context retrieval module enhances segmentation by identifying and leveraging visually similar vehicle instances from training data, providing rich contextual priors for improved generalization. Furthermore, we introduce a new large-scale benchmark dataset for vehicle part segmentation, named VehicleSeg10K, which contains 11,665 high-quality pixel-level annotations across diverse scenes and viewpoints. We conduct comprehensive experiments on this dataset and two other datasets, benchmarking multiple representative baselines to establish a solid foundation for future research and comparison. % Both the dataset and source code of this paper will be released upon acceptance. Both the dataset and source code of this paper will be released on https://github.com/Event-AHU/SAV

[147] Revisiting Continual Semantic Segmentation with Pre-trained Vision Models

Duzhen Zhang,Yong Ren,Wei Cong,Junhao Zheng,Qiaoyi Su,Shuncheng Jia,Zhong-Zhi Li,Xuanle Zhao,Ye Bai,Feilong Chen,Qi Tian,Tielin Zhang

Main category: cs.CV

TL;DR: 本文研究了持续语义分割（CSS），挑战了传统认为直接微调（DFT）容易遗忘旧知识的假设，提出了一种改进的DFT方法（DFT*），通过冻结预训练视觉模型（PVM）和已有分类器等策略，显著提升了性能，同时减少了可训练参数和训练时间。

Details

Motivation: 传统方法认为DFT在CSS中表现差，因其易遗忘旧知识。但作者认为这种假设错误，并试图重新评估DFT在PVM下的遗忘问题，提出更高效的训练策略。 Method: 作者在两个标准数据集（Pascal VOC 2012和ADE20K）上系统评估DFT在不同CSS设置下的表现，分析PVM的遗忘机制，并提出DFT*方法，包括冻结PVM、保留旧分类器和预分配新分类器。 Result: 实验表明，PVM在DFT下具有较强的抗遗忘能力，遗忘主要来源于分类器漂移而非特征表示退化；DFT*在性能上优于16种SOTA方法，同时参数更少、训练更快。 Conclusion: 本文挑战了DFT在CSS中表现较差的传统观点，提出DFT*方法，在保持模型简洁的同时取得了优异性能，表明复杂遗忘缓解方法可能并非必需。 Abstract: Continual Semantic Segmentation (CSS) seeks to incrementally learn to segment novel classes while preserving knowledge of previously encountered ones. Recent advancements in CSS have been largely driven by the adoption of Pre-trained Vision Models (PVMs) as backbones. Among existing strategies, Direct Fine-Tuning (DFT), which sequentially fine-tunes the model across classes, remains the most straightforward approach. Prior work often regards DFT as a performance lower bound due to its presumed vulnerability to severe catastrophic forgetting, leading to the development of numerous complex mitigation techniques. However, we contend that this prevailing assumption is flawed. In this paper, we systematically revisit forgetting in DFT across two standard benchmarks, Pascal VOC 2012 and ADE20K, under eight CSS settings using two representative PVM backbones: ResNet101 and Swin-B. Through a detailed probing analysis, our findings reveal that existing methods significantly underestimate the inherent anti-forgetting capabilities of PVMs. Even under DFT, PVMs retain previously learned knowledge with minimal forgetting. Further investigation of the feature space indicates that the observed forgetting primarily arises from the classifier's drift away from the PVM, rather than from degradation of the backbone representations. Based on this insight, we propose DFT*, a simple yet effective enhancement to DFT that incorporates strategies such as freezing the PVM backbone and previously learned classifiers, as well as pre-allocating future classifiers. Extensive experiments show that DFT* consistently achieves competitive or superior performance compared to sixteen state-of-the-art CSS methods, while requiring substantially fewer trainable parameters and less training time.

[148] PKSS-Align: Robust Point Cloud Registration on Pre-Kendall Shape Space

Chenlei Lv,Hui Huang

Main category: cs.CV

TL;DR: This paper proposes PKSS-Align, a robust and efficient method for point cloud registration that handles various challenges like similarity transformations, noise, and defective parts by leveraging shape feature-based similarity in the Pre-Kendall shape space.

Details

Motivation: Point cloud registration is sensitive to similarity transformations, noise, and incomplete or defective geometric structures, which increase the risk of converging to local optima. A robust method is needed to handle these challenges effectively. Method: PKSS-Align measures shape feature-based similarity between point clouds using the Pre-Kendall shape space (PKSS), which functions as a manifold metric robust to various Euclidean representations. This enables direct generation of the transformation matrix. The method also incorporates a simple parallel acceleration for improved efficiency. Result: Experiments show that PKSS-Align outperforms state-of-the-art methods in terms of robustness and efficiency, particularly in handling non-uniform scales, noisy points, and defective parts without requiring point-to-point or point-to-plane metrics. Conclusion: The proposed PKSS-Align method is a robust and efficient solution for point cloud registration, capable of handling various challenges such as similarity transformations, non-uniform densities, noise, and defective parts without requiring training or complex feature encoding. Abstract: Point cloud registration is a classical topic in the field of 3D Vision and Computer Graphics. Generally, the implementation of registration is typically sensitive to similarity transformations (translation, scaling, and rotation), noisy points, and incomplete geometric structures. Especially, the non-uniform scales and defective parts of point clouds increase probability of struck local optima in registration task. In this paper, we propose a robust point cloud registration PKSS-Align that can handle various influences, including similarity transformations, non-uniform densities, random noisy points, and defective parts. The proposed method measures shape feature-based similarity between point clouds on the Pre-Kendall shape space (PKSS), \textcolor{black}{which is a shape measurement-based scheme and doesn't require point-to-point or point-to-plane metric.} The employed measurement can be regarded as the manifold metric that is robust to various representations in the Euclidean coordinate system. Benefited from the measurement, the transformation matrix can be directly generated for point clouds with mentioned influences at the same time. The proposed method does not require data training and complex feature encoding. Based on a simple parallel acceleration, it can achieve significant improvement for efficiency and feasibility in practice. Experiments demonstrate that our method outperforms the relevant state-of-the-art methods.

[149] MuGS: Multi-Baseline Generalizable Gaussian Splatting Reconstruction

Yaopeng Lou,Liao Shen,Tianqi Liu,Jiaqi Li,Zihao Huang,Huiqiang Sun,Zhiguo Cao

Main category: cs.CV

TL;DR: MuRF是一种新颖的前馈视角合成方法，结合了MVS和MDE特征，通过深度融合和参考视图损失，在多个基线设置下实现了高效、高质量的渲染。

Details

Motivation: 为了有效处理不同基线设置下的新颖视角合成，尤其是在稀疏输入视图的情况下，需要一种通用的前馈方法。 Method: 提出了一种基于3D高斯表示的投影与采样机制，用于深度融合，并引入了参考视图损失以优化几何和效率。 Result: MuRF在DTU、RealEstate10K、LLFF和Mip-NeRF 360等多个数据集上取得了最先进的结果，并展示了零样本性能的潜力。 Conclusion: MuRF通过结合MVS和MDE特征、深度融合机制和参考视图损失，在多个基线设置和多样场景中实现了最先进的性能，并提高了训练和推理效率以及渲染质量。 Abstract: We present Multi-Baseline Gaussian Splatting (MuRF), a generalized feed-forward approach for novel view synthesis that effectively handles diverse baseline settings, including sparse input views with both small and large baselines. Specifically, we integrate features from Multi-View Stereo (MVS) and Monocular Depth Estimation (MDE) to enhance feature representations for generalizable reconstruction. Next, We propose a projection-and-sampling mechanism for deep depth fusion, which constructs a fine probability volume to guide the regression of the feature map. Furthermore, We introduce a reference-view loss to improve geometry and optimization efficiency. We leverage 3D Gaussian representations to accelerate training and inference time while enhancing rendering quality. MuRF achieves state-of-the-art performance across multiple baseline settings and diverse scenarios ranging from simple objects (DTU) to complex indoor and outdoor scenes (RealEstate10K). We also demonstrate promising zero-shot performance on the LLFF and Mip-NeRF 360 datasets.

[150] Length Matters: Length-Aware Transformer for Temporal Sentence Grounding

Yifan Wang,Ziyi Liu,Xiaolong Sun,Jiawei Wang,Hongmin Liu

Main category: cs.CV

TL;DR: 本文提出了用于时间句子定位（TSG）任务的长度感知Transformer（LATR），通过利用视频描述对的时间长度先验信息，使每个查询处理不同时间长度的预测，从而提高了任务性能。

Details

Motivation: 现有的基于DETR的模型在TSG任务中由于缺乏显式监督导致查询角色重叠，产生冗余预测。因此，作者提出了解决这一问题的方法。 Method: 将所有查询划分为三组，分别处理短、中、长时间段的预测，并在训练期间引入额外的时间长度分类任务以抑制不匹配的预测。 Result: 实验表明，LATR在三个公开基准数据集上达到了最先进的性能，并通过消融实验证明了方法中各组件的有效性。 Conclusion: LATR通过利用时间长度先验信息，解决了查询角色重叠的问题，有效提升了TSG任务的表现。 Abstract: Temporal sentence grounding (TSG) is a highly challenging task aiming to localize the temporal segment within an untrimmed video corresponding to a given natural language description. Benefiting from the design of learnable queries, the DETR-based models have achieved substantial advancements in the TSG task. However, the absence of explicit supervision often causes the learned queries to overlap in roles, leading to redundant predictions. Therefore, we propose to improve TSG by making each query fulfill its designated role, leveraging the length priors of the video-description pairs. In this paper, we introduce the Length-Aware Transformer (LATR) for TSG, which assigns different queries to handle predictions based on varying temporal lengths. Specifically, we divide all queries into three groups, responsible for segments with short, middle, and long temporal durations, respectively. During training, an additional length classification task is introduced. Predictions from queries with mismatched lengths are suppressed, guiding each query to specialize in its designated function. Extensive experiments demonstrate the effectiveness of our LATR, achieving state-of-the-art performance on three public benchmarks. Furthermore, the ablation studies validate the contribution of each component of our method and the critical role of incorporating length priors into the TSG task.

[151] A Foundation Model for DAS Signal Recognition and Visual Prompt Tuning of the Pre-trained Model for Downstream Tasks

Kun Gui,Hongliang Ren,Shang Shi,Jin Lu,Changqiu Yu,Quanjun Cao,Guomin Gu,Qi Xuan

Main category: cs.CV

TL;DR: 为了解决DAS信号识别中数据分布差异和标记数据不足的问题，本研究提出了一种基于掩码自编码器的MAEPD模型，并采用视觉提示调优方法进行下游任务识别。

Details

Motivation: 由于异构传感环境导致的数据分布差异，数据驱动的人工智能模型在跨领域泛化方面面临挑战，并面临标记训练数据的短缺。 Method: 本研究提出了一种基于掩码自编码器的MAEPD模型，并采用视觉提示调优(VPT)方法进行下游识别任务。 Result: 实验验证了MAEPD在室内步态识别这一下游任务中的有效性。VPT-Deep方法以仅微调0.322%的参数就达到了96.94%的分类准确率，在训练时间上减少了45%。 Conclusion: MAEPD展现出作为DAS信号识别基础模型的潜力，为解决该领域信号识别模型泛化能力有限的问题提供了新范式。 Abstract: Distributed Acoustic Sensing (DAS) technology finds growing applications across various domains. However, data distribution disparities due to heterogeneous sensing environments pose challenges for data-driven artificial intelligence (AI) models, limiting cross-domain generalization and facing a shortage of labeled training data. To address these issues, this study proposes a foundational model for DAS signal recognition based on a Masked Autoencoder, named MAEPD. The MAEPD model is pretrained on a dataset of 635,860 samples, encompassing DAS gait spatiotemporal signals, 2D GASF images for perimeter security, 2D time-frequency images for pipeline leakage, and open-dataset signals including whale vocalizations and seismic activities, using a self-supervised mask reconstruction task to capture deep semantic features of DAS signals. Visual Prompt Tuning (VPT) is employed for downstream recognition tasks. This method freezes the pretrained backbone parameters and fine-tunes only a small set of learnable visual prompt vectors inserted into the Transformer encoder layers. Experiments on the NVIDIA GeForce RTX 4080 Super platform validate MAEPD using indoor gait recognition as a downstream task. The VPT-Deep approach achieves a classification accuracy of 96.94% with just 0.322% of parameters fine-tuned, surpassing the traditional Full Fine Tuning (FFT) method by 0.61% and reducing training time by 45%. The model also exhibits robust performance in pipeline leakage detection, confirming the generality, efficiency, and scalability of MAEPD as a foundational model. This approach offers a novel paradigm for addressing the limited generalization of signal recognition models in the DAS domain.

[152] TempFlow-GRPO: When Timing Matters for GRPO in Flow Models

Xiaoxuan He,Siming Fu,Yuke Zhao,Wanli Li,Jian Yang,Dacheng Yin,Fengyun Rao,Bo Zhang

Main category: cs.CV

TL;DR: 本文提出了TempFlow-GRPO，一种基于时间感知优化的流模型强化学习框架，解决了文本到图像生成中奖励分配不精确的问题，并在多个基准测试中表现出色。

Details

Motivation: 现有流匹配模型在文本到图像生成方面取得了高质量，但在与强化学习结合时存在时间均匀性假设问题，导致奖励分配不合理，影响优化效果。 Method: 提出了TempFlow-GRPO框架，包括轨迹分支机制和噪声感知加权方案，以解决现有方法中的时间均匀性假设问题。 Result: TempFlow-GRPO在人类偏好对齐和文本到图像生成基准测试中达到了最先进的性能。 Conclusion: TempFlow-GRPO通过引入时间感知优化，显著提升了基于流模型的文本到图像生成在人类偏好对齐和标准基准测试中的性能。 Abstract: Recent flow matching models for text-to-image generation have achieved remarkable quality, yet their integration with reinforcement learning for human preference alignment remains suboptimal, hindering fine-grained reward-based optimization. We observe that the key impediment to effective GRPO training of flow models is the temporal uniformity assumption in existing approaches: sparse terminal rewards with uniform credit assignment fail to capture the varying criticality of decisions across generation timesteps, resulting in inefficient exploration and suboptimal convergence. To remedy this shortcoming, we introduce \textbf{TempFlow-GRPO} (Temporal Flow GRPO), a principled GRPO framework that captures and exploits the temporal structure inherent in flow-based generation. TempFlow-GRPO introduces two key innovations: (i) a trajectory branching mechanism that provides process rewards by concentrating stochasticity at designated branching points, enabling precise credit assignment without requiring specialized intermediate reward models; and (ii) a noise-aware weighting scheme that modulates policy optimization according to the intrinsic exploration potential of each timestep, prioritizing learning during high-impact early stages while ensuring stable refinement in later phases. These innovations endow the model with temporally-aware optimization that respects the underlying generative dynamics, leading to state-of-the-art performance in human preference alignment and standard text-to-image benchmarks.

[153] RiemanLine: Riemannian Manifold Representation of 3D Lines for Factor Graph Optimization

Yanyan Li,Ze Yang,Keisuke Tateno,Federico Tombari Liang Zhao,Gim Hee Lee

Main category: cs.CV

TL;DR: RiemanLine is a novel minimal parametrization for 3D lines that improves structural mapping and camera localization by efficiently encoding parallelism and structural regularities.

Details

Motivation: Existing 3D line representations in robotics and computer vision typically handle independent lines, neglecting structural regularities like parallelism in man-made environments. Method: RiemanLine decouples each line into global and local components, using a shared vanishing direction and scaled normal vectors on orthogonal subspaces, and integrates this into a factor graph framework for optimization. Result: RiemanLine reduces the parameter space for parallel lines, embeds parallelism naturally without constraints, and achieves more accurate pose estimation and line reconstruction on ICL-NUIM, TartanAir, and synthetic benchmarks. Conclusion: RiemanLine provides a more accurate and efficient way to represent 3D lines, improving convergence stability and reducing parameter dimensionality. Abstract: Minimal parametrization of 3D lines plays a critical role in camera localization and structural mapping. Existing representations in robotics and computer vision predominantly handle independent lines, overlooking structural regularities such as sets of parallel lines that are pervasive in man-made environments. This paper introduces \textbf{RiemanLine}, a unified minimal representation for 3D lines formulated on Riemannian manifolds that jointly accommodates both individual lines and parallel-line groups. Our key idea is to decouple each line landmark into global and local components: a shared vanishing direction optimized on the unit sphere $\mathcal{S}^2$, and scaled normal vectors constrained on orthogonal subspaces, enabling compact encoding of structural regularities. For $n$ parallel lines, the proposed representation reduces the parameter space from $4n$ (orthonormal form) to $2n+2$, naturally embedding parallelism without explicit constraints. We further integrate this parameterization into a factor graph framework, allowing global direction alignment and local reprojection optimization within a unified manifold-based bundle adjustment. Extensive experiments on ICL-NUIM, TartanAir, and synthetic benchmarks demonstrate that our method achieves significantly more accurate pose estimation and line reconstruction, while reducing parameter dimensionality and improving convergence stability.

[154] RotatedMVPS: Multi-view Photometric Stereo with Rotated Natural Light

Songyun Yang,Yufei Han,Jilong Zhang,Kongming Liang,Peng Yu,Zhaowei Qu,Heng Guo

Main category: cs.CV

TL;DR: 本文提出了一种名为RotatedMVPS的方法，通过使用实际的旋转平台，在自然光照条件下恢复高保真表面形状和反射属性。

Details

Motivation: 现有的多视角光度立体方法（MVPS）通常需要在受控的暗室环境中进行，或忽略了反射属性的恢复，限制了其在自然光照场景中的应用。本文旨在解决这一问题。 Method: 通过确保不同相机和物体姿态下的光照一致性，减少复杂环境光相关的未知因素，同时整合基于学习的单视角光度立体方法的数据先验信息，以提高形状和反射属性恢复的准确性。 Result: 在合成和真实世界数据集上的实验结果表明，该方法在形状和反射属性恢复方面具有良好的效果。 Conclusion: RotatedMVPS方法在自然光照条件下实现了高保真表面形状和反射属性的恢复，具有较高的实用价值。 Abstract: Multiview photometric stereo (MVPS) seeks to recover high-fidelity surface shapes and reflectances from images captured under varying views and illuminations. However, existing MVPS methods often require controlled darkroom settings for varying illuminations or overlook the recovery of reflectances and illuminations properties, limiting their applicability in natural illumination scenarios and downstream inverse rendering tasks. In this paper, we propose RotatedMVPS to solve shape and reflectance recovery under rotated natural light, achievable with a practical rotation stage. By ensuring light consistency across different camera and object poses, our method reduces the unknowns associated with complex environment light. Furthermore, we integrate data priors from off-the-shelf learning-based single-view photometric stereo methods into our MVPS framework, significantly enhancing the accuracy of shape and reflectance recovery. Experimental results on both synthetic and real-world datasets demonstrate the effectiveness of our approach.

[155] TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding

Canhui Tang,Zifan Han,Hongbo Sun,Sanping Zhou,Xuchong Zhang,Xin Wei,Ye Yuan,Jinglin Xu,Hao Sun

Main category: cs.CV

TL;DR: 提出了一种名为TSPO的方法，通过强化学习提高多模态大语言模型对长时间视频内容的理解能力。

Details

Motivation: 多模态大语言模型在处理长时间视频输入时面临上下文限制和训练成本的问题，现有方法可能错过关键事件或受限于预训练模型的事件理解能力。 Method: 提出了可训练的事件感知时间代理和TSPO强化学习范式，将关键帧选择和语言生成建模为联合决策过程，并使用基于规则的奖励机制优化时间采样策略。 Result: 全面实验表明，TSPO在多个长视频理解基准测试中实现了最先进的性能，并显示出在不同视频多模态大语言模型间的可转移能力。 Conclusion: TSPO方法通过强化学习显著提升了多模态大语言模型对长视频内容的理解能力，并具有广泛的适用性。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant progress in vision-language tasks, yet they still face challenges when processing long-duration video inputs. The limitation arises from MLLMs' context limit and training costs, necessitating sparse frame sampling before feeding videos into MLLMs. Existing video MLLMs adopt training-free uniform sampling or keyframe search, which may miss critical events or be constrained by the pre-trained models' event understanding capabilities. Meanwhile, building a training-based method remains challenging due to the unsupervised and non-differentiable nature of sparse frame sampling. To address these problems, we propose Temporal Sampling Policy Optimization (TSPO), advancing MLLMs' long-form video-language understanding via reinforcement learning. Specifically, we first propose a trainable event-aware temporal agent, which captures event-query correlation for performing probabilistic keyframe selection. Then, we propose the TSPO reinforcement learning paradigm, which models keyframe selection and language generation as a joint decision-making process, enabling end-to-end group relative optimization with efficient rule-based rewards. Furthermore, for the TSPO's training, we propose a long video training data construction pipeline with comprehensive temporal data and video Needle-in-a-Haystack data. Finally, we incorporate rule-based answering accuracy and temporal locating reward mechanisms to optimize the temporal sampling policy. Comprehensive experiments show that our TSPO achieves state-of-the-art performance across multiple long video understanding benchmarks, and shows transferable ability across different cutting-edge Video-MLLMs.

Lefei Shen,Mouxiang Chen,Xu Liu,Han Fu,Xiaoxue Ren,Jianling Sun,Zhuo Li,Chenghao Liu

Main category: cs.CV

TL;DR: VisionTS++ addresses the challenges of using vision models for time series forecasting by bridging data modality, multivariate, and probabilistic gaps, achieving superior performance in both deterministic and probabilistic forecasting tasks.

Details

Motivation: Vision models pre-trained on images show potential for time series forecasting but face challenges due to discrepancies in data modality, multivariate forecasting, and probabilistic forecasting. Method: VisionTS++ uses a vision-model-based filtering mechanism to identify high-quality time series data, a colorized multivariate conversion method to transform time series into RGB images, and a multi-quantile forecasting approach with parallel reconstruction heads for uncertainty-aware predictions. Result: VisionTS++ achieves state-of-the-art (SOTA) results, outperforming specialized TSFMs by 6%-44% in MSE reduction and ranking first in 9 out of 12 probabilistic forecasting settings on both in-distribution and out-of-distribution benchmarks. Conclusion: VisionTS++ bridges the gaps between vision models and time series forecasting, establishing a new paradigm for cross-modal knowledge transfer and advancing the development of universal time series foundation models (TSFMs). Abstract: Recent studies have revealed that vision models pre-trained on images can perform well in time series forecasting by reformulating forecasting as an image reconstruction task, suggesting their potential as universal time series foundation models. However, effective cross-modal transfer from vision to time series remains challenging due to three key discrepancies: (1) data-modality gap between structured, bounded image data and unbounded, heterogeneous time series; (2) multivariate-forecasting gap between standard RGB three-channel-based vision models and the need to model time series with arbitrary numbers of variates; and (3) probabilistic-forecasting gap between the deterministic output formats of most vision models and the requirement for uncertainty-aware probabilistic predictions. To bridge these gaps, we propose VisionTS++, a vision-model-based TSFM that performs continual pre-training on large-scale time series datasets, including 3 innovations: (1) a vision-model-based filtering mechanism to identify high-quality time series data, thereby mitigating modality gap and improving pre-training stability, (2) a colorized multivariate conversion method that transforms multivariate time series into multi-subfigure RGB images, capturing complex inter-variate dependencies; and (3) a multi-quantile forecasting approach using parallel reconstruction heads to generate forecasts of different quantile levels, thus more flexibly approximating arbitrary output distributions without restrictive prior distributional assumptions. Evaluated on both in-distribution and out-of-distribution TSF benchmarks, \model achieves SOTA results, outperforming specialized TSFMs by 6%-44% in MSE reduction and ranking first in 9 out of 12 probabilistic forecasting settings. Our work establishes a new paradigm for cross-modal knowledge transfer, advancing the development of universal TSFMs.

[157] ProtoN: Prototype Node Graph Neural Network for Unconstrained Multi-Impression Ear Recognition

Santhoshkumar Peddi,Sadhvik Bathini,Arun Balasubramanian,Monalisa Sarma,Debasis Samanta

Main category: cs.CV

TL;DR: ProtoN improves few-shot ear recognition by using a graph-based framework to better capture identity features despite limited data and high variability.

Details

Motivation: Ear biometrics are stable and contactless but limited by scarce annotated data and intra-class variability; existing methods struggle to capture consistent identity features. Method: ProtoN utilizes a Prototype Graph Neural Network (PGNN) with a dual-path message-passing mechanism and cross-graph prototype alignment to refine representations of ear biometric data. Result: ProtoN achieved state-of-the-art performance with up to 99.60% Rank-1 identification accuracy and an Equal Error Rate (EER) of 0.025 on five benchmark datasets. Conclusion: ProtoN, a few-shot learning framework using graph-based processing, significantly improves the effectiveness of ear biometric identification under limited data conditions. Abstract: Ear biometrics offer a stable and contactless modality for identity recognition, yet their effectiveness remains limited by the scarcity of annotated data and significant intra-class variability. Existing methods typically extract identity features from individual impressions in isolation, restricting their ability to capture consistent and discriminative representations. To overcome these limitations, a few-shot learning framework, ProtoN, is proposed to jointly process multiple impressions of an identity using a graph-based approach. Each impression is represented as a node in a class-specific graph, alongside a learnable prototype node that encodes identity-level information. This graph is processed by a Prototype Graph Neural Network (PGNN) layer, specifically designed to refine both impression and prototype representations through a dual-path message-passing mechanism. To further enhance discriminative power, the PGNN incorporates a cross-graph prototype alignment strategy that improves class separability by enforcing intra-class compactness while maintaining inter-class distinction. Additionally, a hybrid loss function is employed to balance episodic and global classification objectives, thereby improving the overall structure of the embedding space. Extensive experiments on five benchmark ear datasets demonstrate that ProtoN achieves state-of-the-art performance, with Rank-1 identification accuracy of up to 99.60% and an Equal Error Rate (EER) as low as 0.025, showing the effectiveness for few-shot ear recognition under limited data conditions.

[158] Deep Learning-based Scalable Image-to-3D Facade Parser for Generating Thermal 3D Building Models

Yinan Yu,Alex Gonzalez-Caceres,Samuel Scheidegger,Sanjay Somanath,Alexander Hollberg

Main category: cs.CV

TL;DR: 本文提出了一种名为SI3FP的新方法，能够通过图像生成LoD3热模型，适用于大规模建筑翻新规划，并在城市开发领域具有广泛应用前景。

Details

Motivation: 翻新现有建筑对于减少气候影响至关重要。早期阶段的翻新规划需要基于包含窗户等特征的Level of Detail (LoD) 3热3D模型进行模拟，但目前尚缺乏一种可扩展且准确的特征识别方法。 Method: 提出了一种名为Scalable Image-to-3D Facade Parser (SI3FP)的管道，该管道通过在正交图像平面中直接建模几何基元，结合计算机视觉和深度学习技术，从图像中提取几何信息生成LoD3热模型。 Result: 在典型的瑞典住宅建筑上进行了测试，SI3FP在窗户与墙壁比例估计中的误差约为5%，表明其精度足以满足早期阶段的翻新分析需求。 Conclusion: SI3FP是一个具有统一接口的可扩展管道，可减少透视失真，支持稀疏和密集数据源，适用于大规模能源翻新规划，并在城市开发和规划中具有更广泛的应用前景。 Abstract: Renovating existing buildings is essential for climate impact. Early-phase renovation planning requires simulations based on thermal 3D models at Level of Detail (LoD) 3, which include features like windows. However, scalable and accurate identification of such features remains a challenge. This paper presents the Scalable Image-to-3D Facade Parser (SI3FP), a pipeline that generates LoD3 thermal models by extracting geometries from images using both computer vision and deep learning. Unlike existing methods relying on segmentation and projection, SI3FP directly models geometric primitives in the orthographic image plane, providing a unified interface while reducing perspective distortions. SI3FP supports both sparse (e.g., Google Street View) and dense (e.g., hand-held camera) data sources. Tested on typical Swedish residential buildings, SI3FP achieved approximately 5% error in window-to-wall ratio estimates, demonstrating sufficient accuracy for early-stage renovation analysis. The pipeline facilitates large-scale energy renovation planning and has broader applications in urban development and planning.

[159] Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning

Haoji Zhang,Xin Gu,Jiawen Li,Chixiang Ma,Sule Bai,Chubin Zhang,Bowen Zhang,Zhichao Zhou,Dongliang He,Yansong Tang

Main category: cs.CV

TL;DR: 提出了一种名为VITAL的新型端到端代理视频推理框架，用于解决多模态大语言模型在视频推理任务中的局限性。

Details

Motivation: 为了解决现有文本链推理方法在跨模态交互和长时间视频推理中的幻觉增加和交互有限的问题。 Method: 构建了名为VITAL的框架，通过视觉工具箱按需密集采样新视频帧并生成多模态链推理，同时提出DGRPO算法以缓解多任务强化学习中的难度不平衡。 Result: 在11个具有挑战性的视频理解基准测试中，VITAL表现出先进的推理能力，在视频问答和时间定位任务中优于现有方法，尤其是在长时间视频场景中。 Conclusion: VITAL框架有效地提高了多模态大语言模型在视频推理任务中的表现，特别是在处理长时间视频时。 Abstract: The video reasoning ability of multimodal large language models (MLLMs) is crucial for downstream tasks like video question answering and temporal grounding. While recent approaches have explored text-based chain-of-thought (CoT) reasoning for MLLMs, these methods often suffer from limited cross-modal interaction and increased hallucination, especially with longer videos or reasoning chains. To address these challenges, we propose Video Intelligence via Tool-Augmented Learning (VITAL), a novel end-to-end agentic video reasoning framework. With a visual toolbox, the model can densely sample new video frames on demand and generate multimodal CoT for precise long video reasoning. We observe that temporal grounding and question answering are mutually beneficial for video understanding tasks. Therefore, we construct two high-quality multi-task video reasoning datasets MTVR-CoT-72k for supervised fine-tuning and MTVR-RL-110k for reinforcement learning. Moreover, we propose a Difficulty-aware Group Relative Policy Optimization algorithm (DGRPO) to mitigate difficulty imbalance in multi-task reinforcement learning. Extensive experiments on 11 challenging video understanding benchmarks demonstrate the advanced reasoning ability of VITAL, outperforming existing methods in video question answering and temporal grounding tasks, especially in long video scenarios. All code, data and model weight will be made publicly available.

[160] Efficient Inter-Task Attention for Multitask Transformer Models

Christian Bohn,Thomas Kurbiel,Klaus Friedrichs,Hasan Tercan,Tobias Meisen

Main category: cs.CV

TL;DR: The paper proposes a novel and efficient attention mechanism for multitask models that significantly reduces computational costs while improving performance on vision datasets.

Details

Motivation: The motivation is to overcome the computational limitations of the Transformer's Multi-Head-Attention in Multitask Learning, where the attention matrix scales quadratically with the number of tasks. Method: The authors introduce a new Deformable Inter-Task Self-Attention mechanism designed to efficiently aggregate information across feature maps from different tasks. Result: Experiments on NYUD-v2 and PASCAL-Context datasets showed an order-of-magnitude reduction in FLOPs count and inference latency, along with up to a 7.4% improvement in prediction quality metrics for individual tasks. Conclusion: The proposed Deformable Inter-Task Self-Attention for Multitask models significantly improves computational efficiency and prediction quality in comparison to the conventional Transformer architecture. Abstract: In both Computer Vision and the wider Deep Learning field, the Transformer architecture is well-established as state-of-the-art for many applications. For Multitask Learning, however, where there may be many more queries necessary compared to single-task models, its Multi-Head-Attention often approaches the limits of what is computationally feasible considering practical hardware limitations. This is due to the fact that the size of the attention matrix scales quadratically with the number of tasks (assuming roughly equal numbers of queries for all tasks). As a solution, we propose our novel Deformable Inter-Task Self-Attention for Multitask models that enables the much more efficient aggregation of information across the feature maps from different tasks. In our experiments on the NYUD-v2 and PASCAL-Context datasets, we demonstrate an order-of-magnitude reduction in both FLOPs count and inference latency. At the same time, we also achieve substantial improvements by up to 7.4% in the individual tasks' prediction quality metrics.

[161] Composed Object Retrieval: Object-level Retrieval via Composed Expressions

Tong Wang,Guanyu Yang,Nian Liu,Zongyan Han,Jinxing Zhou,Salman Khan,Fahad Shahbaz Khan

Main category: cs.CV

TL;DR: The paper introduces Composed Object Retrieval (COR), a new task for retrieving and segmenting target objects at the object level, along with the CORE model and COR127K benchmark. Experiments show that CORE outperforms existing models, establishing a new baseline for fine-grained multi-modal retrieval research.

Details

Motivation: Current Composed Image Retrieval (CIR) methods are limited to image-level matching and cannot localize specific objects. This limitation motivates the need for a new approach that achieves object-level precision in retrieval. Method: The researchers proposed Composed Object Retrieval (COR), a new task for object-level precision retrieval, and presented CORE, an end-to-end model integrating reference region encoding, adaptive visual-textual interaction, and region-level contrastive learning. They also constructed COR127K, a large-scale benchmark for COR. Result: Extensive experiments showed that the CORE model significantly outperforms existing models in both base and novel categories on the COR127K benchmark. Conclusion: The study concludes that the proposed CORE model outperforms existing models in both base and novel categories, establishing a baseline for the new COR task and opening new directions for fine-grained multi-modal retrieval research. Abstract: Retrieving fine-grained visual content based on user intent remains a challenge in multi-modal systems. Although current Composed Image Retrieval (CIR) methods combine reference images with retrieval texts, they are constrained to image-level matching and cannot localize specific objects. To this end, we propose Composed Object Retrieval (COR), a brand-new task that goes beyond image-level retrieval to achieve object-level precision, allowing the retrieval and segmentation of target objects based on composed expressions combining reference objects and retrieval texts. COR presents significant challenges in retrieval flexibility, which requires systems to identify arbitrary objects satisfying composed expressions while avoiding semantically similar but irrelevant negative objects within the same scene. We construct COR127K, the first large-scale COR benchmark that contains 127,166 retrieval triplets with various semantic transformations in 408 categories. We also present CORE, a unified end-to-end model that integrates reference region encoding, adaptive visual-textual interaction, and region-level contrastive learning. Extensive experiments demonstrate that CORE significantly outperforms existing models in both base and novel categories, establishing a simple and effective baseline for this challenging task while opening new directions for fine-grained multi-modal retrieval research.

[162] Benchmarking Foundation Models for Mitotic Figure Classification

Jonas Ammeling,Jonathan Ganz,Emely Rosbach,Ludwig Lausser,Christof A. Bertram,Katharina Breininger,Marc Aubreville

Main category: cs.CV

TL;DR: This paper investigates the use of foundation models with LoRA adaptation for mitotic figure classification, showing improved performance and robustness with limited data while competing with traditional fully fine-tuned models.

Details

Motivation: The limited availability of labeled images in medical domains like pathology motivates the use of foundation models trained on unlabeled data to generalize well to new tasks with minimal training effort, improving model performance and robustness. Method: The study evaluates foundation models for mitotic figure classification by investigating data scaling laws and model robustness to unseen tumor domains. It uses low-rank adaptation (LoRA) of attention mechanisms and compares these models with end-to-end-trained baselines like CNNs and Vision Transformers through linear probing. Result: LoRA-adapted foundation models achieve superior performance compared to linear probing, reaching nearly 100% data availability performance with only 10% of training data. They also significantly reduce the performance gap on unseen tumor domains. However, traditional architectures with full fine-tuning remain competitive. Conclusion: LoRA-adapted foundation models outperform standard linear probing and approach performance levels seen with full data availability, while also nearly closing the out-of-domain performance gap, although full fine-tuning of traditional architectures remains competitive. Abstract: The performance of deep learning models is known to scale with data quantity and diversity. In pathology, as in many other medical imaging domains, the availability of labeled images for a specific task is often limited. Self-supervised learning techniques have enabled the use of vast amounts of unlabeled data to train large-scale neural networks, i.e., foundation models, that can address the limited data problem by providing semantically rich feature vectors that can generalize well to new tasks with minimal training effort increasing model performance and robustness. In this work, we investigate the use of foundation models for mitotic figure classification. The mitotic count, which can be derived from this classification task, is an independent prognostic marker for specific tumors and part of certain tumor grading systems. In particular, we investigate the data scaling laws on multiple current foundation models and evaluate their robustness to unseen tumor domains. Next to the commonly used linear probing paradigm, we also adapt the models using low-rank adaptation (LoRA) of their attention mechanisms. We compare all models against end-to-end-trained baselines, both CNNs and Vision Transformers. Our results demonstrate that LoRA-adapted foundation models provide superior performance to those adapted with standard linear probing, reaching performance levels close to 100% data availability with only 10% of training data. Furthermore, LoRA-adaptation of the most recent foundation models almost closes the out-of-domain performance gap when evaluated on unseen tumor domains. However, full fine-tuning of traditional architectures still yields competitive performance.

[163] Boosting Visual Knowledge-Intensive Training for LVLMs Through Causality-Driven Visual Object Completion

Qingguo Hu,Ante Wang,Jia Song,Delai Qiu,Qingsong Liu,Jinsong Su

Main category: cs.CV

TL;DR: 本文提出了一种基于因果驱动视觉对象补全（CVC）的自我改进框架，以增强大规模视觉语言模型的视觉感知和推理能力。

Details

Motivation: 当前的大规模视觉语言模型（LVLM）在需要深度视觉感知的任务中表现仍不理想，如识别图像间的细微差异，这可能是因为流行的教学调整语料库中视觉知识不足，导致视觉感知和推理能力不足。 Method: 引入了一个基于新型视觉知识密集型任务CVC的自我改进框架，并通过自动化实例构建流水线获取丰富的示例，无需依赖复杂的LVLM或人工协助。 Result: 提出的方法在四个专业任务和四个广泛使用的综合基准测试中取得了显著的提升，尤其是在使用LLaVA-1.5-7B和LLaVA-1.5-13B时，与相应基线相比平均提升了5.4%和4.0%。 Conclusion: 实验结果证明了该方法在四个专业任务和四个广泛使用的综合基准测试中都取得了显著的提升，尤其在使用LLaVA-1.5-7B和LLaVA-1.5-13B时，与相应基线相比平均提升了5.4%和4.0%。 Abstract: Large Vision-Language Models (LVLMs) have experienced significant advancements in recent years. However, their performance still falls short in tasks requiring deep visual perception, such as identifying subtle differences between images. A potential cause is the scarcity of visual knowledge in popular instruction-tuning corpora, resulting in inadequate visual perception and reasoning capabilities. To address this challenge, we introduce a self-improvement framework grounded in a novel visual knowledge-intensive task, \underline{C}ausality-driven \underline{V}isual object \underline{C}ompletion (CVC). This task requires LVLMs to infer the masked object in an image based on its \textit{causal} relationships with the other visible information. We first obtain rich examples cheaply through our automated instance construction pipeline, without relying on sophisticated LVLMs (\textit{e.g.}, GPT-4V) or human assistance. Then, LVLMs effectively self-improve through trial and error learning using these created instances. Our experiments demonstrate substantial gains across four challenging specialized tasks and four widely-used comprehensive benchmarks. Especially on specialized tasks, our method achieves an average improvement of 5.4\% and 4.0\% compared to the corresponding baselines when utilizing LLaVA-1.5-7B and LLaVA-1.5-13B, respectively. The code is available at https://github.com/XMUDeepLIT/CVC.

[164] 4DVD: Cascaded Dense-view Video Diffusion Model for High-quality 4D Content Generation

Shuzhou Yang,Xiaodong Cun,Xiaoyu Li,Yaowei Li,Jian Zhang

Main category: cs.CV

TL;DR: 4DVD introduces a cascaded diffusion model to decouple 4D content generation, enhancing cross-view and temporal consistency for high-quality results.

Details

Motivation: Directly generating high-dimensional data like 4D is highly complex, so a more decoupled approach is needed to improve efficiency and quality. Method: 4DVD uses a cascaded video diffusion model that decouples 4D generation into coarse multi-view layout generation and structure-aware conditional generation. Result: 4DVD achieves state-of-the-art performance on novel view synthesis and 4D generation, supported by the newly created D-Objaverse dataset. Conclusion: 4DVD effectively decouples the generation of 4D content into two subtasks, achieving high-quality dense-view videos and accurate 4D representation optimization. Abstract: Given the high complexity of directly generating high-dimensional data such as 4D, we present 4DVD, a cascaded video diffusion model that generates 4D content in a decoupled manner. Unlike previous multi-view video methods that directly model 3D space and temporal features simultaneously with stacked cross view/temporal attention modules, 4DVD decouples this into two subtasks: coarse multi-view layout generation and structure-aware conditional generation, and effectively unifies them. Specifically, given a monocular video, 4DVD first predicts the dense view content of its layout with superior cross-view and temporal consistency. Based on the produced layout priors, a structure-aware spatio-temporal generation branch is developed, combining these coarse structural priors with the exquisite appearance content of input monocular video to generate final high-quality dense-view videos. Benefit from this, explicit 4D representation~(such as 4D Gaussian) can be optimized accurately, enabling wider practical application. To train 4DVD, we collect a dynamic 3D object dataset, called D-Objaverse, from the Objaverse benchmark and render 16 videos with 21 frames for each object. Extensive experiments demonstrate our state-of-the-art performance on both novel view synthesis and 4D generation. Our project page is https://4dvd.github.io/

[165] Zero-Residual Concept Erasure via Progressive Alignment in Text-to-Image Model

Hongxu Chen,Zhen Wang,Taoran Mei,Lin Li,Bowei Zhu,Runshi Li,Long Chen

Main category: cs.CV

TL;DR: 本文提出了一种新的概念擦除方法ErasePro，通过零残差约束和渐进式层更新策略，解决了现有方法擦除不完全和生成质量下降的问题。

Details

Motivation: 现有方法存在擦除不完全和生成质量下降的问题。 Method: 将概念擦除任务建模为优化问题，并引入零残差约束和层更新策略。 Result: 实验结果表明，ErasePro在多种概念擦除任务中表现出色，能够更完全地擦除目标概念并保持生成质量。 Conclusion: ErasePro通过引入严格的零残差约束和渐进式层更新策略，提高了概念擦除的完整性和生成质量。 Abstract: Concept Erasure, which aims to prevent pretrained text-to-image models from generating content associated with semantic-harmful concepts (i.e., target concepts), is getting increased attention. State-of-the-art methods formulate this task as an optimization problem: they align all target concepts with semantic-harmless anchor concepts, and apply closed-form solutions to update the model accordingly. While these closed-form methods are efficient, we argue that existing methods have two overlooked limitations: 1) They often result in incomplete erasure due to "non-zero alignment residual", especially when text prompts are relatively complex. 2) They may suffer from generation quality degradation as they always concentrate parameter updates in a few deep layers. To address these issues, we propose a novel closed-form method ErasePro: it is designed for more complete concept erasure and better preserving overall generative quality. Specifically, ErasePro first introduces a strict zero-residual constraint into the optimization objective, ensuring perfect alignment between target and anchor concept features and enabling more complete erasure. Secondly, it employs a progressive, layer-wise update strategy that gradually transfers target concept features to those of the anchor concept from shallow to deep layers. As the depth increases, the required parameter changes diminish, thereby reducing deviations in sensitive deep layers and preserving generative quality. Empirical results across different concept erasure tasks (including instance, art style, and nudity erasure) have demonstrated the effectiveness of our ErasePro.

[166] QuantVSR: Low-Bit Post-Training Quantization for Real-World Video Super-Resolution

Bowen Chai,Zheng Chen,Libo Zhu,Wenbo Li,Yong Guo,Yulun Zhang

Main category: cs.CV

TL;DR: QuantVSR introduces a novel quantization approach for diffusion-based video super-resolution models, combining a spatio-temporal complexity aware mechanism and learnable bias alignment to achieve high performance with reduced computational demands.

Details

Motivation: Diffusion models excel in video super-resolution (VSR) but are limited by slow processing speeds and high resource consumption. Quantization is a promising compression technique, but VSR models are difficult to quantize due to their temporal nature and high fidelity requirements. This work aims to overcome these challenges. Method: The authors propose QuantVSR, which incorporates a spatio-temporal complexity aware (STCA) mechanism and a learnable bias alignment (LBA) module. STCA measures spatial and temporal complexity per layer to allocate layer-specific ranks for a low-rank full-precision branch, while LBA helps reduce quantization errors. The full-precision and low-bit branches are then jointly optimized. Result: Experiments on synthetic and real-world datasets show that QuantVSR achieves performance comparable to the full-precision model and significantly outperforms state-of-the-art low-bit quantization methods. Conclusion: QuantVSR is able to achieve comparable performance to the full-precision model and significantly outperforms recent low-bit quantization methods, making it a promising solution for deploying diffusion-based VSR models in real-world applications with limited resources. Abstract: Diffusion models have shown superior performance in real-world video super-resolution (VSR). However, the slow processing speeds and heavy resource consumption of diffusion models hinder their practical application and deployment. Quantization offers a potential solution for compressing the VSR model. Nevertheless, quantizing VSR models is challenging due to their temporal characteristics and high fidelity requirements. To address these issues, we propose QuantVSR, a low-bit quantization model for real-world VSR. We propose a spatio-temporal complexity aware (STCA) mechanism, where we first utilize the calibration dataset to measure both spatial and temporal complexities for each layer. Based on these statistics, we allocate layer-specific ranks to the low-rank full-precision (FP) auxiliary branch. Subsequently, we jointly refine the FP and low-bit branches to achieve simultaneous optimization. In addition, we propose a learnable bias alignment (LBA) module to reduce the biased quantization errors. Extensive experiments on synthetic and real-world datasets demonstrate that our method obtains comparable performance with the FP model and significantly outperforms recent leading low-bit quantization methods. Code is available at: https://github.com/bowenchai/QuantVSR.

[167] Learning Robust Intervention Representations with Delta Embeddings

Panagiotis Alimisis,Christos Diou

Main category: cs.CV

TL;DR: This paper proposes Causal Delta Embeddings to represent interventions in latent space for improved out-of-distribution robustness in causal representation learning.

Details

Motivation: Most research in causal representation learning focuses on identifying and representing scene variables under a causal model, with less emphasis on representing the interventions themselves. This work aims to address this gap by focusing on intervention representations for improved OOD robustness. Method: A framework was proposed that learns causal representations from image pairs without additional supervision, based on the concept of Causal Delta Embeddings that are invariant to the visual scene and sparse in terms of affected causal variables. Result: Experiments on the Causal Triplet challenge showed that Causal Delta Embeddings are highly effective in OOD settings, significantly outperforming baselines on both synthetic and real-world benchmarks. Conclusion: Focusing on the representation of interventions in the latent space, specifically using Causal Delta Embeddings, is an effective strategy for improving OOD robustness in causal representation learning. Abstract: Causal representation learning has attracted significant research interest during the past few years, as a means for improving model generalization and robustness. Causal representations of interventional image pairs, have the property that only variables corresponding to scene elements affected by the intervention / action are changed between the start state and the end state. While most work in this area has focused on identifying and representing the variables of the scene under a causal model, fewer efforts have focused on representations of the interventions themselves. In this work, we show that an effective strategy for improving out of distribution (OOD) robustness is to focus on the representation of interventions in the latent space. Specifically, we propose that an intervention can be represented by a Causal Delta Embedding that is invariant to the visual scene and sparse in terms of the causal variables it affects. Leveraging this insight, we propose a framework that is capable of learning causal representations from image pairs, without any additional supervision. Experiments in the Causal Triplet challenge demonstrate that Causal Delta Embeddings are highly effective in OOD settings, significantly exceeding baseline performance in both synthetic and real-world benchmarks.

[168] MonoCloth: Reconstruction and Animation of Cloth-Decoupled Human Avatars from Monocular Videos

Daisheng Jin,Ying He

Main category: cs.CV

TL;DR: MonoCloth 是一种从单目视频中重建和动画化着装人体头像的新方法，它通过基于部分的分解策略提高了重建质量，并支持如服装转移等额外任务。

Details

Motivation: 由于单目视频中几何信息有限且涉及复杂的非刚性运动，从单目视频中重建逼真的3D人体头像是一个具有挑战性的任务。 Method: MonoCloth 使用一种基于部分的分解策略，将头像分为身体、面部、手和服装，并针对面部和手部的细节几何恢复，以及服装部分使用专门的布料模拟模块来捕捉服装变形。 Result: 实验结果表明，与现有方法相比，MonoCloth 在视觉重建质量和动画的真实性方面都有提高。 Conclusion: MonoCloth 是一种从单目视频中重建和动画化着装人体头像的新方法，它通过基于部分的设计支持额外的任务，如服装转移，从而证明了其多功能性和实用性。 Abstract: Reconstructing realistic 3D human avatars from monocular videos is a challenging task due to the limited geometric information and complex non-rigid motion involved. We present MonoCloth, a new method for reconstructing and animating clothed human avatars from monocular videos. To overcome the limitations of monocular input, we introduce a part-based decomposition strategy that separates the avatar into body, face, hands, and clothing. This design reflects the varying levels of reconstruction difficulty and deformation complexity across these components. Specifically, we focus on detailed geometry recovery for the face and hands. For clothing, we propose a dedicated cloth simulation module that captures garment deformation using temporal motion cues and geometric constraints. Experimental results demonstrate that MonoCloth improves both visual reconstruction quality and animation realism compared to existing methods. Furthermore, thanks to its part-based design, MonoCloth also supports additional tasks such as clothing transfer, underscoring its versatility and practical utility.

[169] Skeleton Motion Words for Unsupervised Skeleton-Based Temporal Action Segmentation

Uzay Gökay,Federico Spurio,Dominik R. Bach,Juergen Gall

Main category: cs.CV

TL;DR: 本文提出了一种新的无监督基于骨架的时间动作分割方法，通过使用序列到序列的时间自编码器和量化潜在骨架序列得到独特的骨架运动词，从而发现语义上有意义的动作聚类，并在三个数据集中证明了其优越性。

Details

Motivation: 当前最先进的基于骨架的时间动作分割方法主要是有监督的，需要昂贵的标注数据。相比之下，尽管骨架序列与实际应用相关、具有鲁棒性和隐私保护特性，但现有的无监督时间动作分割方法主要集中在视频数据上，而骨架序列仍未得到充分探索。 Method: 该文使用了一个序列到序列的时间自编码器来保持嵌入空间中不同关节的信息解耦。然后将潜在的骨架序列划分为不重叠的块并量化以获得独特的骨架运动词，从而驱动语义上有意义的动作聚类的发现。 Result: 该方法在三个广泛使用的基于骨架的数据集（HuGaDB、LARa和BABEL）上进行了彻底评估，结果证明了其优于当前最先进的无监督时间动作分割方法。 Conclusion: 本文提出了一种新的无监督基于骨架的时间动作分割方法，并在三个广泛使用的基于骨架的数据集中证明了其优于当前最先进的无监督时间动作分割方法。 Abstract: Current state-of-the-art methods for skeleton-based temporal action segmentation are predominantly supervised and require annotated data, which is expensive to collect. In contrast, existing unsupervised temporal action segmentation methods have focused primarily on video data, while skeleton sequences remain underexplored, despite their relevance to real-world applications, robustness, and privacy-preserving nature. In this paper, we propose a novel approach for unsupervised skeleton-based temporal action segmentation. Our method utilizes a sequence-to-sequence temporal autoencoder that keeps the information of the different joints disentangled in the embedding space. Latent skeleton sequences are then divided into non-overlapping patches and quantized to obtain distinctive skeleton motion words, driving the discovery of semantically meaningful action clusters. We thoroughly evaluate the proposed approach on three widely used skeleton-based datasets, namely HuGaDB, LARa, and BABEL. The results demonstrate that our model outperforms the current state-of-the-art unsupervised temporal action segmentation methods. Code is available at https://github.com/bachlab/SMQ .

[170] RAIDX: A Retrieval-Augmented Generation and GRPO Reinforcement Learning Framework for Explainable Deepfake Detection

Tianxiao Li,Zhenglin Huang,Haiquan Wen,Yiwei He,Shuchang Lyu,Baoyuan Wu,Guangliang Cheng

Main category: cs.CV

TL;DR: RAIDX引入了基于RAG和GRPO的深度伪造检测框架，提高了检测准确性和可解释性。

Details

Motivation: 现有的深度伪造检测方法缺乏透明度，无法解释决策，而现有的LLM方法分析过于粗糙，依赖手动注释。 Method: RAIDX结合RAG和GRPO，利用外部知识提高检测准确性，并自动生成细粒度文本解释和显著性图。 Result: 实验表明RAIDX在多个基准上实现了最先进的检测性能，并提供了可解释的推理依据。 Conclusion: RAIDX是首个结合RAG与GRPO的统一框架，有效解决了深度伪造检测中的准确性和可解释性问题。 Abstract: The rapid advancement of AI-generation models has enabled the creation of hyperrealistic imagery, posing ethical risks through widespread misinformation. Current deepfake detection methods, categorized as face specific detectors or general AI-generated detectors, lack transparency by framing detection as a classification task without explaining decisions. While several LLM-based approaches offer explainability, they suffer from coarse-grained analyses and dependency on labor-intensive annotations. This paper introduces RAIDX (Retrieval-Augmented Image Deepfake Detection and Explainability), a novel deepfake detection framework integrating Retrieval-Augmented Generation (RAG) and Group Relative Policy Optimization (GRPO) to enhance detection accuracy and decision explainability. Specifically, RAIDX leverages RAG to incorporate external knowledge for improved detection accuracy and employs GRPO to autonomously generate fine-grained textual explanations and saliency maps, eliminating the need for extensive manual annotations. Experiments on multiple benchmarks demonstrate RAIDX's effectiveness in identifying real or fake, and providing interpretable rationales in both textual descriptions and saliency maps, achieving state-of-the-art detection performance while advancing transparency in deepfake identification. RAIDX represents the first unified framework to synergize RAG and GRPO, addressing critical gaps in accuracy and explainability. Our code and models will be publicly available.

[171] No Masks Needed: Explainable AI for Deriving Segmentation from Classification

Mosong Ma,Tania Stathaki,Michalis Lazarou

Main category: cs.CV

TL;DR: 本研究提出了一种基于微调预训练模型并结合可解释人工智能的新方法，用于医学图像分割，从而在多个医学数据集上取得了优异的结果。

Details

Motivation: 尽管计算机视觉的最新进展探索了使用预训练模型进行无监督分割，但这些方法在医学成像领域未能很好地转化应用。因此，我们提出了一种新的方法来解决这一问题。 Method: 该方法通过微调预训练模型来专门处理医学图像，并结合可解释人工智能（Explainable AI）生成相关性评分，以提高分割的准确性。 Result: 实现了对医学图像的精确分割，并通过广泛处理和结合可解释人工智能技术，提高了分割过程的性能。 Conclusion: 我们的方法在CBIS-DDSM、NuInsSeg和Kvasir-SEG等数据集上取得了更好的结果，证明了其在医学图像分割领域的有效性。 Abstract: Medical image segmentation is vital for modern healthcare and is a key element of computer-aided diagnosis. While recent advancements in computer vision have explored unsupervised segmentation using pre-trained models, these methods have not been translated well to the medical imaging domain. In this work, we introduce a novel approach that fine-tunes pre-trained models specifically for medical images, achieving accurate segmentation with extensive processing. Our method integrates Explainable AI to generate relevance scores, enhancing the segmentation process. Unlike traditional methods that excel in standard benchmarks but falter in medical applications, our approach achieves improved results on datasets like CBIS-DDSM, NuInsSeg and Kvasir-SEG.

[172] TopKD: Top-scaled Knowledge Distillation

Qi Wang,Jinjia Zhou

Main category: cs.CV

TL;DR: This paper introduces TopKD, a new logit-based knowledge distillation framework that improves distillation performance across various architectures without requiring additional modules.

Details

Motivation: The paper is motivated by the observation that critical information in teacher logit distributions is often overlooked in knowledge distillation methods. Method: The proposed TopKD method includes a Top-K Scaling Module (TSM) to amplify informative logits and a Top-K Decoupled Loss (TDL) for targeted supervision, without requiring architectural changes. Result: Experiments show that TopKD outperforms state-of-the-art distillation methods on multiple datasets, including CIFAR-100, ImageNet, STL-10, and Tiny-ImageNet, and it is particularly effective for distilling Vision Transformers. Conclusion: The paper concludes that TopKD, a new logit-based distillation method, significantly enhances knowledge distillation across various network architectures, and its effectiveness is demonstrated through extensive experiments. Abstract: Recent advances in knowledge distillation (KD) predominantly emphasize feature-level knowledge transfer, frequently overlooking critical information embedded within the teacher's logit distributions. In this paper, we revisit logit-based distillation and reveal an underexplored yet critical element: Top-K knowledge. Motivated by this insight, we propose Top-scaled Knowledge Distillation (TopKD), a simple, efficient, and architecture-agnostic framework that significantly enhances logit-based distillation. TopKD consists of two main components: (1) a Top-K Scaling Module (TSM), which adaptively amplifies the most informative logits, and (2) a Top-K Decoupled Loss (TDL), which offers targeted and effective supervision. Notably, TopKD integrates seamlessly into existing KD methods without introducing extra modules or requiring architectural changes. Extensive experiments on CIFAR-100, ImageNet, STL-10, and Tiny-ImageNet demonstrate that TopKD consistently surpasses state-of-the-art distillation methods. Moreover, our method demonstrates substantial effectiveness when distilling Vision Transformers, underscoring its versatility across diverse network architectures. These findings highlight the significant potential of logits to advance knowledge distillation.

[173] InceptoFormer: A Multi-Signal Neural Framework for Parkinson's Disease Severity Evaluation from Gait

Safwen Naimi,Arij Said,Wassim Bouachir,Guillaume-Alexandre Bilodeau

Main category: cs.CV

TL;DR: 本文提出 InceptoFormer，结合 Inception1D 和 Transformer 进行帕金森病严重程度评估，准确率达到 96.6%。

Details

Motivation: 通过分析步态动态，更好地评估帕金森病的严重程度，并解决严重程度分级中的类别不平衡问题。 Method: 设计了 Inception1D 和 Transformer 的组合框架，利用 1D 卷积滤波器和过采样策略进行多尺度时间特征提取和不平衡数据处理。 Result: 在实验中，InceptoFormer 达到了 96.6% 的准确率，优于现有的最先进方法。 Conclusion: InceptoFormer 是一种用于帕金森病严重程度评估的多信号神经框架，通过引入 Inception1D 和基于 Transformer 的框架，显著提高了分类性能。 Abstract: We present InceptoFormer, a multi-signal neural framework designed for Parkinson's Disease (PD) severity evaluation via gait dynamics analysis. Our architecture introduces a 1D adaptation of the Inception model, which we refer to as Inception1D, along with a Transformer-based framework to stage PD severity according to the Hoehn and Yahr (H&Y) scale. The Inception1D component captures multi-scale temporal features by employing parallel 1D convolutional filters with varying kernel sizes, thereby extracting features across multiple temporal scales. The transformer component efficiently models long-range dependencies within gait sequences, providing a comprehensive understanding of both local and global patterns. To address the issue of class imbalance in PD severity staging, we propose a data structuring and preprocessing strategy based on oversampling to enhance the representation of underrepresented severity levels. The overall design enables to capture fine-grained temporal variations and global dynamics in gait signal, significantly improving classification performance for PD severity evaluation. Through extensive experimentation, InceptoFormer achieves an accuracy of 96.6%, outperforming existing state-of-the-art methods in PD severity assessment. The source code for our implementation is publicly available at https://github.com/SafwenNaimi/InceptoFormer

[174] Hierarchical Event Memory for Accurate and Low-latency Online Video Temporal Grounding

Minghang Zheng,Yuxin Peng,Benyuan Sun,Yi Yang,Yang Liu

Main category: cs.CV

TL;DR: 本文提出了一种用于在线视频时间定位（OnVTG）的分层事件记忆方法，通过基于事件提议的框架和未来预测分支，实现了对目标事件的高效实时预测，并在多个数据集上达到了最先进的性能。

Details

Motivation: 在线视频时间定位（OnVTG）任务要求模型在不观察未来帧的情况下，在视频流中定位与给定文本查询相关的事件。现有的OnVTG模型缺乏有效的事件建模，无法保留长期历史信息，导致性能低下。 Method: 提出了一种基于事件提议的OnVTG框架，并引入了分层事件记忆以保留历史事件信息。此外，还设计了一个未来预测分支来预测目标事件是否即将发生，并回归事件的开始时间。 Result: 该方法在TACoS、ActivityNet Captions和MAD数据集上达到了最先进的性能。 Conclusion: 本文提出的分层事件记忆和未来预测分支有效提升了在线视频时间定位的性能，使得模型能够同时访问近期和长期信息，实现了高效的实时预测。 Abstract: In this paper, we tackle the task of online video temporal grounding (OnVTG), which requires the model to locate events related to a given text query within a video stream. Unlike regular video temporal grounding, OnVTG requires the model to make predictions without observing future frames. As online videos are streaming inputs and can go on indefinitely, it is impractical and inefficient to store all historical inputs. The existing OnVTG models employ memory to store recent historical video frame features and predict scores indicating whether the current frame corresponds to the start or end time of the target event. However, these methods lack effective event modeling and cannot retain long-term historical information, leading to low performance. To tackle these challenges, we propose a hierarchical event memory for OnVTG. We propose an event-based OnVTG framework that makes predictions based on event proposals that model event-level information with various durations. To preserve historically valuable event information, we introduce a hierarchical event memory that retains historical events, allowing the model to access both recent and long-term information. To enable the real-time prediction, we further propose a future prediction branch that predicts whether the target event will occur shortly and further regresses the start time of the event. We achieve state-of-the-art performance on the TACoS, ActivityNet Captions, and MAD datasets. Code is available at https://github.com/minghangz/OnVTG.

[175] MSC: A Marine Wildlife Video Dataset with Grounded Segmentation and Clip-Level Captioning

Quang-Trung Truong,Yuk-Kwan Wong,Vo Hoang Kim Tuyen Dang,Rinaldi Gotama,Duc Thanh Nguyen,Sai-Kit Yeung

Main category: cs.CV

TL;DR: 论文提出了一种面向海洋目标的视频描述生成方法，结合视频、文本和分割掩码三元组及视频分割技术，有效提升了海洋视频的理解和生成能力。

Details

Motivation: 现有的视频描述数据集通常集中在通用或以人为中心的领域，难以推广到海洋环境的复杂性并深入了解海洋生物，因此需要一种能够应对海洋环境中各种挑战的方法。 Method: 论文采用视频、文本和分割掩码的三元组进行视觉定位和描述生成，并强调了视频分割在检测场景变化中显著目标转换的有效性。 Result: 论文通过引入新的基准和方法，显著丰富了描述内容的语义，提升了海洋视频的理解、分析和生成能力。 Conclusion: 论文提出了一个两阶段的面向海洋目标的视频描述生成管道，并引入了一个综合的视频理解基准，以促进视觉定位和描述生成，从而显著提升海洋视频的理解和分析以及生成效果。 Abstract: Marine videos present significant challenges for video understanding due to the dynamics of marine objects and the surrounding environment, camera motion, and the complexity of underwater scenes. Existing video captioning datasets, typically focused on generic or human-centric domains, often fail to generalize to the complexities of the marine environment and gain insights about marine life. To address these limitations, we propose a two-stage marine object-oriented video captioning pipeline. We introduce a comprehensive video understanding benchmark that leverages the triplets of video, text, and segmentation masks to facilitate visual grounding and captioning, leading to improved marine video understanding and analysis, and marine video generation. Additionally, we highlight the effectiveness of video splitting in order to detect salient object transitions in scene changes, which significantly enrich the semantics of captioning content. Our dataset and code have been released at https://msc.hkustvgd.com.

[176] Two-Way Garment Transfer: Unified Diffusion Framework for Dressing and Undressing Synthesis

Angang Zhang,Fang Deng,Hao Chen,Zhongjian Chen,Junyan Li

Main category: cs.CV

TL;DR: This paper introduces TWGTM, a unified framework for virtual try-on and try-off tasks that leverages bidirectional feature disentanglement and dual-conditioned guidance.

Details

Motivation: The inverse task of VTON, known as VTOFF, has been underexplored despite its importance in reconstructing canonical garment templates from dressed humans. Method: TWGTM uses bidirectional feature disentanglement and dual-conditioned guidance from latent and pixel spaces for joint image synthesis, along with a phased training paradigm to address modality gaps. Result: Extensive experiments on the DressCode and VITON-HD datasets demonstrate the effectiveness and competitive advantage of the TWGTM framework. Conclusion: The proposed Two-Way Garment Transfer Model (TWGTM) successfully bridges the gap between virtual try-on (VTON) and virtual try-off (VTOFF), offering a unified framework for clothing-centric image synthesis. Abstract: While recent advances in virtual try-on (VTON) have achieved realistic garment transfer to human subjects, its inverse task, virtual try-off (VTOFF), which aims to reconstruct canonical garment templates from dressed humans, remains critically underexplored and lacks systematic investigation. Existing works predominantly treat them as isolated tasks: VTON focuses on garment dressing while VTOFF addresses garment extraction, thereby neglecting their complementary symmetry. To bridge this fundamental gap, we propose the Two-Way Garment Transfer Model (TWGTM), to the best of our knowledge, the first unified framework for joint clothing-centric image synthesis that simultaneously resolves both mask-guided VTON and mask-free VTOFF through bidirectional feature disentanglement. Specifically, our framework employs dual-conditioned guidance from both latent and pixel spaces of reference images to seamlessly bridge the dual tasks. On the other hand, to resolve the inherent mask dependency asymmetry between mask-guided VTON and mask-free VTOFF, we devise a phased training paradigm that progressively bridges this modality gap. Extensive qualitative and quantitative experiments conducted across the DressCode and VITON-HD datasets validate the efficacy and competitive edge of our proposed approach.

[177] Augmentation-based Domain Generalization and Joint Training from Multiple Source Domains for Whole Heart Segmentation

Franz Thaler,Darko Stern,Gernot Plank,Martin Urschler

Main category: cs.CV

TL;DR: 论文提出了一种应对医学图像中域间差异的全心脏分割方法，能够高效获取准确的语义分割结果。

Details

Motivation: 心血管疾病是全球死亡的首要原因，需要更复杂的方法来分析心脏及其子结构，以进行个性化治疗规划。 Method: 论文采用了5折集成方法，并通过平衡联合训练和强强度及空间增强技术来应对域间差异。 Result: 该方法在MR数据上表现最佳，CT数据上的表现与仅用CT训练的模型相当，分别以93.33% DSC和89.30% DSC为指标。 Conclusion: 该论文提出了一种全心脏分割方法，具有较高的准确率，有望高效获取可用于生成患者特异性心脏模型的语义分割结果。 Abstract: As the leading cause of death worldwide, cardiovascular diseases motivate the development of more sophisticated methods to analyze the heart and its substructures from medical images like Computed Tomography (CT) and Magnetic Resonance (MR). Semantic segmentations of important cardiac structures that represent the whole heart are useful to assess patient-specific cardiac morphology and pathology. Furthermore, accurate semantic segmentations can be used to generate cardiac digital twin models which allows e.g. electrophysiological simulation and personalized therapy planning. Even though deep learning-based methods for medical image segmentation achieved great advancements over the last decade, retaining good performance under domain shift -- i.e. when training and test data are sampled from different data distributions -- remains challenging. In order to perform well on domains known at training-time, we employ a (1) balanced joint training approach that utilizes CT and MR data in equal amounts from different source domains. Further, aiming to alleviate domain shift towards domains only encountered at test-time, we rely on (2) strong intensity and spatial augmentation techniques to greatly diversify the available training data. Our proposed whole heart segmentation method, a 5-fold ensemble with our contributions, achieves the best performance for MR data overall and a performance similar to the best performance for CT data when compared to a model trained solely on CT. With 93.33% DSC and 0.8388 mm ASSD for CT and 89.30% DSC and 1.2411 mm ASSD for MR data, our method demonstrates great potential to efficiently obtain accurate semantic segmentations from which patient-specific cardiac twin models can be generated.

[178] One Model For All: Partial Diffusion for Unified Try-On and Try-Off in Any Pose

Jinxi Liu,Zijian He,Guangrun Wang,Guanbin Li,Liang Lin

Main category: cs.CV

TL;DR: OMFA 提出了一种新的扩散框架，无需展示衣物和分割掩码，支持任意姿势的虚拟试穿和试脱，提高了现实场景的适用性。

Details

Motivation: 现有方法受限于对展示衣物和分割掩码的依赖，以及处理灵活姿势变化的能力有限，难以在现实场景中应用。 Method: 基于部分扩散策略，选择性地对输入的各个部分应用噪声和去噪，并利用基于 SMPL-X 的姿态条件实现多视角和任意姿态的试穿。 Result: OMFA 在虚拟试穿和试脱任务上实现了最先进的结果，提供了一种实用且可推广的虚拟衣物合成解决方案。 Conclusion: OMFA 是一种统一的扩散框架，用于虚拟试穿和试脱，无需展示衣物，支持任意姿势，提高了现实场景中的实用性。 Abstract: Recent diffusion-based approaches have made significant advances in image-based virtual try-on, enabling more realistic and end-to-end garment synthesis. However, most existing methods remain constrained by their reliance on exhibition garments and segmentation masks, as well as their limited ability to handle flexible pose variations. These limitations reduce their practicality in real-world scenarios-for instance, users cannot easily transfer garments worn by one person onto another, and the generated try-on results are typically restricted to the same pose as the reference image. In this paper, we introduce \textbf{OMFA} (\emph{One Model For All}), a unified diffusion framework for both virtual try-on and try-off that operates without the need for exhibition garments and supports arbitrary poses. For example, OMFA enables removing garments from a source person (try-off) and transferring them onto a target person (try-on), while also allowing the generated target to appear in novel poses-even without access to multi-pose images of that person. OMFA is built upon a novel \emph{partial diffusion} strategy that selectively applies noise and denoising to individual components of the joint input-such as the garment, the person image, or the face-enabling dynamic subtask control and efficient bidirectional garment-person transformation. The framework is entirely mask-free and requires only a single portrait and a target pose as input, making it well-suited for real-world applications. Additionally, by leveraging SMPL-X-based pose conditioning, OMFA supports multi-view and arbitrary-pose try-on from just one image. Extensive experiments demonstrate that OMFA achieves state-of-the-art results on both try-on and try-off tasks, providing a practical and generalizable solution for virtual garment synthesis. The project page is here: https://onemodelforall.github.io/.

[179] Drone Detection with Event Cameras

Gabriele Magrini,Lorenzo Berlincioni,Luca Cultrera,Federico Becattini,Pietro Pala

Main category: cs.CV

TL;DR: This paper highlights the potential of event-based vision as a robust solution for detecting fast-moving and small drones, offering advantages over traditional surveillance systems and providing a foundation for next-generation counter-UAV technology.

Details

Motivation: Traditional surveillance systems, such as conventional frame-based cameras, face significant challenges in reliably detecting small, highly agile drones due to motion blur and poor performance in difficult lighting conditions. Method: This paper surveys the state-of-the-art in event-based vision for drone detection, covering data representation methods, advanced processing pipelines like spiking neural networks, and methodologies for tasks such as tracking, trajectory forecasting, and propeller signature analysis. Result: Event cameras eliminate motion blur, enable consistent detection in extreme lighting conditions, and suppress static backgrounds through sparse, asynchronous outputs, allowing for low-latency focus on motion cues. Conclusion: Event-based vision offers a promising foundation for reliable, efficient, and low-latency counter-UAV systems, surpassing traditional frame-based cameras in detecting agile and small drones. Abstract: The diffusion of drones presents significant security and safety challenges. Traditional surveillance systems, particularly conventional frame-based cameras, struggle to reliably detect these targets due to their small size, high agility, and the resulting motion blur and poor performance in challenging lighting conditions. This paper surveys the emerging field of event-based vision as a robust solution to these problems. Event cameras virtually eliminate motion blur and enable consistent detection in extreme lighting. Their sparse, asynchronous output suppresses static backgrounds, enabling low-latency focus on motion cues. We review the state-of-the-art in event-based drone detection, from data representation methods to advanced processing pipelines using spiking neural networks. The discussion extends beyond simple detection to cover more sophisticated tasks such as real-time tracking, trajectory forecasting, and unique identification through propeller signature analysis. By examining current methodologies, available datasets, and the distinct advantages of the technology, this work demonstrates that event-based vision provides a powerful foundation for the next generation of reliable, low-latency, and efficient counter-UAV systems.

[180] TAlignDiff: Automatic Tooth Alignment assisted by Diffusion-based Transformation Learning

Yunbi Liu,Enqi Tang,Shiyu Li,Lei Ma,Juncheng Li,Shu Lou,Yongchu Pan,Qingshan Liu

Main category: cs.CV

TL;DR: TAlignDiff introduces a new method for automatic tooth alignment in orthodontic treatment by combining point cloud-based regression and diffusion-based transformation modeling.

Details

Motivation: Current deep learning approaches fail to capture the latent distribution of transformation matrices related to the anatomical structure of the oral cavity. Method: TAlignDiff combines a point cloud-based regression network (PRN) and a diffusion-based transformation matrix denoising module (DTMD) for automatic tooth alignment. Result: Extensive ablation and comparative experiments validate the effectiveness and potential of TAlignDiff in tooth alignment. Conclusion: TAlignDiff demonstrates effectiveness and superiority in orthodontic treatment by integrating geometric constraints and diffusion refinement. Abstract: Orthodontic treatment hinges on tooth alignment, which significantly affects occlusal function, facial aesthetics, and patients' quality of life. Current deep learning approaches predominantly concentrate on predicting transformation matrices through imposing point-to-point geometric constraints for tooth alignment. Nevertheless, these matrices are likely associated with the anatomical structure of the human oral cavity and possess particular distribution characteristics that the deterministic point-to-point geometric constraints in prior work fail to capture. To address this, we introduce a new automatic tooth alignment method named TAlignDiff, which is supported by diffusion-based transformation learning. TAlignDiff comprises two main components: a primary point cloud-based regression network (PRN) and a diffusion-based transformation matrix denoising module (DTMD). Geometry-constrained losses supervise PRN learning for point cloud-level alignment. DTMD, as an auxiliary module, learns the latent distribution of transformation matrices from clinical data. We integrate point cloud-based transformation regression and diffusion-based transformation modeling into a unified framework, allowing bidirectional feedback between geometric constraints and diffusion refinement. Extensive ablation and comparative experiments demonstrate the effectiveness and superiority of our method, highlighting its potential in orthodontic treatment.

Jinxing Zhou,Ziheng Zhou,Yanghao Zhou,Yuxin Mao,Zhangling Duan,Dan Guo

Main category: cs.CV

TL;DR: This paper introduces a new approach to tackle the weakly-supervised Dense Audio-Visual Event Localization task, achieving top performance on two datasets.

Details

Motivation: The motivation is to explore the Dense Audio-Visual Event Localization task under a more challenging weakly-supervised setting where only video-level event labels are provided. Method: The paper proposes a method involving a Mutual Event Agreement Evaluation module, a Cross-modal Salient Anchor Identification module, and an Anchor-based Temporal Propagation module to address the W-DAVEL task. Result: The result is the establishment of benchmarks for W-DAVEL on the UnAV-100 and ActivityNet1.3 datasets, showing that the proposed method performs at a state-of-the-art level. Conclusion: The paper concludes that their proposed method achieves state-of-the-art performance on the W-DAVEL task, as demonstrated through extensive experiments on two datasets. Abstract: The Dense Audio-Visual Event Localization (DAVEL) task aims to temporally localize events in untrimmed videos that occur simultaneously in both the audio and visual modalities. This paper explores DAVEL under a new and more challenging weakly-supervised setting (W-DAVEL task), where only video-level event labels are provided and the temporal boundaries of each event are unknown. We address W-DAVEL by exploiting \textit{cross-modal salient anchors}, which are defined as reliable timestamps that are well predicted under weak supervision and exhibit highly consistent event semantics across audio and visual modalities. Specifically, we propose a \textit{Mutual Event Agreement Evaluation} module, which generates an agreement score by measuring the discrepancy between the predicted audio and visual event classes. Then, the agreement score is utilized in a \textit{Cross-modal Salient Anchor Identification} module, which identifies the audio and visual anchor features through global-video and local temporal window identification mechanisms. The anchor features after multimodal integration are fed into an \textit{Anchor-based Temporal Propagation} module to enhance event semantic encoding in the original temporal audio and visual features, facilitating better temporal localization under weak supervision. We establish benchmarks for W-DAVEL on both the UnAV-100 and ActivityNet1.3 datasets. Extensive experiments demonstrate that our method achieves state-of-the-art performance.

[182] DDTracking: A Deep Generative Framework for Diffusion MRI Tractography with Streamline Local-Global Spatiotemporal Modeling

Yijie Li,Wei Zhang,Xi Zhu,Ye Wu,Yogesh Rathi,Lauren J. O'Donnell,Fan Zhang

Main category: cs.CV

TL;DR: DDTracking 是一种新的深度生成模型，用于扩散 MRI 纤维追踪，其性能优于现有方法，并在不同数据集上表现出强大的泛化能力。

Details

Motivation: 为了改进当前最先进的纤维追踪方法，并提供一种解剖结构合理且稳健的解决方案。 Method: 引入了一种双路径编码网络，联合建模局部空间编码和全局时间依赖性，并设计了一个端到端可训练的条件扩散模型模块来预测纤维追踪方向。 Result: 在两个具有真实数据的基准测试（ISMRM 挑战赛和 TractoInferno）上进行了实验，结果表明 DDTracking 明显优于当前最先进的纤维追踪方法，并且在不同数据集上表现出强大的泛化能力。 Conclusion: DDTracking 是一种新的深度生成框架，用于扩散 MRI 纤维追踪，提供了可扩展、可适应且端到端可学习的解决方案，适用于广泛的 dMRI 应用。 Abstract: This paper presents DDTracking, a novel deep generative framework for diffusion MRI tractography that formulates streamline propagation as a conditional denoising diffusion process. In DDTracking, we introduce a dual-pathway encoding network that jointly models local spatial encoding (capturing fine-scale structural details at each streamline point) and global temporal dependencies (ensuring long-range consistency across the entire streamline). Furthermore, we design a conditional diffusion model module, which leverages the learned local and global embeddings to predict streamline propagation orientations for tractography in an end-to-end trainable manner. We conduct a comprehensive evaluation across diverse, independently acquired dMRI datasets, including both synthetic and clinical data. Experiments on two well-established benchmarks with ground truth (ISMRM Challenge and TractoInferno) demonstrate that DDTracking largely outperforms current state-of-the-art tractography methods. Furthermore, our results highlight DDTracking's strong generalizability across heterogeneous datasets, spanning varying health conditions, age groups, imaging protocols, and scanner types. Collectively, DDTracking offers anatomically plausible and robust tractography, presenting a scalable, adaptable, and end-to-end learnable solution for broad dMRI applications. Code is available at: https://github.com/yishengpoxiao/DDtracking.git

[183] Knowledge to Sight: Reasoning over Visual Attributes via Knowledge Decomposition for Abnormality Grounding

Jun Li,Che Liu,Wenjia Bai,Mingxuan Liu,Rossella Arcucci,Cosmin I. Bercea,Julia A. Schnabel

Main category: cs.CV

TL;DR: K2Sight is a framework that improves the localization of clinical findings in medical images by integrating structured semantic supervision derived from domain ontologies, leading to efficient training of compact models that perform as well as larger models with significantly less data.

Details

Motivation: Generalist Vision-Language Models (VLMs) struggle in the medical domain due to rare, compositional, and domain-specific terms, while specialized medical VLMs require substantial annotation and computational resources. Method: K2Sight introduces structured semantic supervision by decomposing clinical concepts into interpretable visual attributes, such as shape, density, and anatomical location, distilled from domain ontologies and encoded into instruction-style prompts. Result: Compact models trained with K2Sight achieve performance on par with or better than 7B+ medical VLMs, with up to 9.82% improvement in $mAP_{50}$, using only 1.5% of the data required by state-of-the-art medical VLMs. Conclusion: The proposed K2Sight framework effectively bridges domain knowledge and spatial structure for data-efficient training of compact models in grounding abnormalities in medical images. Abstract: In this work, we address the problem of grounding abnormalities in medical images, where the goal is to localize clinical findings based on textual descriptions. While generalist Vision-Language Models (VLMs) excel in natural grounding tasks, they often struggle in the medical domain due to rare, compositional, and domain-specific terms that are poorly aligned with visual patterns. Specialized medical VLMs address this challenge via large-scale domain pretraining, but at the cost of substantial annotation and computational resources. To overcome these limitations, we propose \textbf{Knowledge to Sight (K2Sight)}, a framework that introduces structured semantic supervision by decomposing clinical concepts into interpretable visual attributes, such as shape, density, and anatomical location. These attributes are distilled from domain ontologies and encoded into concise instruction-style prompts, which guide region-text alignment during training. Unlike conventional report-level supervision, our approach explicitly bridges domain knowledge and spatial structure, enabling data-efficient training of compact models. We train compact models with 0.23B and 2B parameters using only 1.5\% of the data required by state-of-the-art medical VLMs. Despite their small size and limited training data, these models achieve performance on par with or better than 7B+ medical VLMs, with up to 9.82\% improvement in $mAP_{50}$. Code and models: \href{https://lijunrio.github.io/K2Sight/}{\textcolor{SOTAPink}{https://lijunrio.github.io/K2Sight/}}.

[184] Visual Bias and Interpretability in Deep Learning for Dermatological Image Analysis

Enam Ahmed Taufik,Abdullah Khondoker,Antara Firoz Parsa,Seraj Al Mahmud Mostafa

Main category: cs.CV

TL;DR: 本研究提出一种用于多类皮肤疾病分类的深度学习框架，发现DinoV2模型结合RGB预处理在性能和可解释性方面表现最佳。

Details

Motivation: 由于皮肤疾病的高类间相似性、类内变异性和复杂病变纹理，准确的皮肤疾病分类是一项关键而具有挑战性的任务，因此需要研究改进的计算机辅助诊断方法。 Method: 研究采用深度学习框架进行多类皮肤疾病分类，并系统评估了三种图像预处理技术（标准RGB、CMY颜色空间转换和CLAHE）的效果，同时比较了多种预训练CNN和Transformer模型的性能。 Result: 实验结果显示，使用RGB预处理的DinoV2模型在准确率（最高达93%）和F1分数上表现最佳，同时Grad-CAM可视化增强了病变定位的可解释性。 Conclusion: 该研究得出结论，选择合适的预处理方法和模型对于构建强大且可解释的皮肤疾病CAD系统至关重要，其中DinoV2与RGB预处理方法表现最佳。 Abstract: Accurate skin disease classification is a critical yet challenging task due to high inter-class similarity, intra-class variability, and complex lesion textures. While deep learning-based computer-aided diagnosis (CAD) systems have shown promise in automating dermatological assessments, their performance is highly dependent on image pre-processing and model architecture. This study proposes a deep learning framework for multi-class skin disease classification, systematically evaluating three image pre-processing techniques: standard RGB, CMY color space transformation, and Contrast Limited Adaptive Histogram Equalization (CLAHE). We benchmark the performance of pre-trained convolutional neural networks (DenseNet201, Efficient-NetB5) and transformer-based models (ViT, Swin Transformer, DinoV2 Large) using accuracy and F1-score as evaluation metrics. Results show that DinoV2 with RGB pre-processing achieves the highest accuracy (up to 93%) and F1-scores across all variants. Grad-CAM visualizations applied to RGB inputs further reveal precise lesion localization, enhancing interpretability. These findings underscore the importance of effective pre-processing and model choice in building robust and explainable CAD systems for dermatology.

[185] Face-voice Association in Multilingual Environments (FAME) 2026 Challenge Evaluation Plan

Marta Moscati,Ahmed Abdullah,Muhammad Saad Saeed,Shah Nawaz,Rohan Kumar Das,Muhammad Zaigham Zaheer,Junaid Mir,Muhammad Haroon Yousaf,Khalid Malik,Markus Schedl

Main category: cs.CV

TL;DR: The FAME 2026 Challenge explores face-voice association in multilingual environments using the MAV-Celeb dataset, addressing the unique dynamics of bilingual communication.

Details

Motivation: The motivation stems from the increasing bilingual global population and the need to understand face-voice correlations in multilingual communication scenarios. Method: The study introduces the FAME Challenge, which utilizes the MAV-Celeb dataset to analyze face-voice associations in multilingual contexts. Result: The result is the establishment of the FAME Challenge with a dataset, baseline models, and task details for analyzing audio-visual associations in multilingual settings. Conclusion: FAME 2026 Challenge provides a framework for exploring face-voice association in multilingual environments using the MAV-Celeb dataset. Abstract: The advancements of technology have led to the use of multimodal systems in various real-world applications. Among them, audio-visual systems are among the most widely used multimodal systems. In the recent years, associating face and voice of a person has gained attention due to the presence of unique correlation between them. The Face-voice Association in Multilingual Environments (FAME) 2026 Challenge focuses on exploring face-voice association under the unique condition of a multilingual scenario. This condition is inspired from the fact that half of the world's population is bilingual and most often people communicate under multilingual scenarios. The challenge uses a dataset named Multilingual Audio-Visual (MAV-Celeb) for exploring face-voice association in multilingual environments. This report provides the details of the challenge, dataset, baseline models, and task details for the FAME Challenge.

[186] Pseudo Depth Meets Gaussian: A Feed-forward RGB SLAM Baseline

Linqing Zhao,Xiuwei Xu,Yirui Wang,Hao Wang,Wenzhao Zheng,Yansong Tang,Haibin Yan,Jiwen Lu

Main category: cs.CV

TL;DR: 本文提出了一种高效的在线3D重建方法，通过3D高斯SLAM和前馈预测模块，在保证性能的同时大幅提升跟踪速度。

Details

Motivation: 现有3D重建方法在处理长序列或依赖慢速测试时优化和深度传感器方面存在困难，需要一种更快速且高效的方法。 Method: 将深度估计器集成到RGB-D SLAM系统中，采用3D高斯映射解决几何细节不准确的问题，并引入前馈循环预测模块和局部图渲染技术。 Result: 在Replica和TUM-RGBD数据集上的实验表明，该方法在性能上与SplaTAM相当，但跟踪时间减少了超过90%。 Conclusion: 该论文提出了一种基于3D高斯SLAM的在线3D重建方法，结合了前馈循环预测模块和局部图渲染技术，显著提高了跟踪速度并保持了与最先进方法相当的性能。 Abstract: Incrementally recovering real-sized 3D geometry from a pose-free RGB stream is a challenging task in 3D reconstruction, requiring minimal assumptions on input data. Existing methods can be broadly categorized into end-to-end and visual SLAM-based approaches, both of which either struggle with long sequences or depend on slow test-time optimization and depth sensors. To address this, we first integrate a depth estimator into an RGB-D SLAM system, but this approach is hindered by inaccurate geometric details in predicted depth. Through further investigation, we find that 3D Gaussian mapping can effectively solve this problem. Building on this, we propose an online 3D reconstruction method using 3D Gaussian-based SLAM, combined with a feed-forward recurrent prediction module to directly infer camera pose from optical flow. This approach replaces slow test-time optimization with fast network inference, significantly improving tracking speed. Additionally, we introduce a local graph rendering technique to enhance robustness in feed-forward pose prediction. Experimental results on the Replica and TUM-RGBD datasets, along with a real-world deployment demonstration, show that our method achieves performance on par with the state-of-the-art SplaTAM, while reducing tracking time by more than 90\%.

[187] OmniDepth: Bridging Monocular and Stereo Reasoning with Latent Alignment

Tongfan Guan,Jiaxin Guo,Chen Wang,Yun-Hui Liu

Main category: cs.CV

TL;DR: OmniDepth是一个结合单目和立体深度估计优势的统一框架，通过双向对齐机制提升深度估计精度。

Details

Motivation: 单目方法缺乏几何精度，而立体方法在处理反射或无纹理表面时存在困难。需要结合两者的优势来提升深度估计效果。 Method: 引入了OmniDepth框架，通过迭代双向对齐其潜在表示来桥接单目和立体方法。使用了新的跨注意力对齐机制，将单目上下文线索与立体假设表示同步。 Result: OmniDepth在Middlebury和ETH3D数据集上减少了超过40%的零样本泛化误差，并在透明和反射表面上表现优异。 Conclusion: OmniDepth通过统一框架解决了单目和立体深度估计的局限性，实现了更鲁棒的3D感知。 Abstract: Monocular and stereo depth estimation offer complementary strengths: monocular methods capture rich contextual priors but lack geometric precision, while stereo approaches leverage epipolar geometry yet struggle with ambiguities such as reflective or textureless surfaces. Despite post-hoc synergies, these paradigms remain largely disjoint in practice. We introduce OmniDepth, a unified framework that bridges both through iterative bidirectional alignment of their latent representations. At its core, a novel cross-attentive alignment mechanism dynamically synchronizes monocular contextual cues with stereo hypothesis representations during stereo reasoning. This mutual alignment resolves stereo ambiguities (e.g., specular surfaces) by injecting monocular structure priors while refining monocular depth with stereo geometry within a single network. Extensive experiments demonstrate state-of-the-art results: \textbf{OmniDepth reduces zero-shot generalization error by $\!>\!40\%$ on Middlebury and ETH3D}, while addressing longstanding failures on transparent and reflective surfaces. By harmonizing multi-view geometry with monocular context, OmniDepth enables robust 3D perception that transcends modality-specific limitations. Codes available at https://github.com/aeolusguan/OmniDepth.

[188] How Does Bilateral Ear Symmetry Affect Deep Ear Features?

Kagan Ozturk,Deeksha Arun,Kevin W. Bowyer,Patrick Flynn

Main category: cs.CV

TL;DR: 本文研究了双边耳朵对称性对基于卷积神经网络的耳朵识别系统的影响，提出了一种耳朵侧边分类器，并证明在训练和测试中分别处理左右耳朵可以提高识别性能。

Details

Motivation: 虽然卷积神经网络已被广泛用于直接从原始耳朵图像中学习特征，但双边耳朵对称性对特征学习的影响尚未受到足够关注。 Method: 开发了一种耳朵侧边分类器，并在训练和测试中引入耳朵侧边信息，通过跨数据集评估和消融研究验证方法的有效性。 Result: 实验结果表明，在训练和测试中分别处理左耳和右耳可以显著提高性能，并通过消融研究获得了实用的训练建议。 Conclusion: 处理左右耳朵的对称性可以显著提高基于卷积神经网络的耳朵识别系统的性能。 Abstract: Ear recognition has gained attention as a reliable biometric technique due to the distinctive characteristics of human ears. With the increasing availability of large-scale datasets, convolutional neural networks (CNNs) have been widely adopted to learn features directly from raw ear images, outperforming traditional hand-crafted methods. However, the effect of bilateral ear symmetry on the features learned by CNNs has received little attention in recent studies. In this paper, we investigate how bilateral ear symmetry influences the effectiveness of CNN-based ear recognition. To this end, we first develop an ear side classifier to automatically categorize ear images as either left or right. We then explore the impact of incorporating this side information during both training and test. Cross-dataset evaluations are conducted on five datasets. Our results suggest that treating left and right ears separately during training and testing can lead to notable performance improvements. Furthermore, our ablation studies on alignment strategies, input sizes, and various hyperparameter settings provide practical insights into training CNN-based ear recognition systems on large-scale datasets to achieve higher verification rates.

[189] FinMMR: Make Financial Numerical Reasoning More Multimodal, Comprehensive, and Challenging

Zichen Tang,Haihong E,Jiacheng Liu,Zhongjun Yang,Rongjin Li,Zihua Rong,Haoyang He,Zhuodi Hao,Xinyang Hu,Kun Ji,Ziyan Ma,Mengyuan Ji,Jun Zhang,Chenghao Ma,Qianhe Zheng,Yang Liu,Yiling Huang,Xinyi Hu,Qing Huang,Zijian Xie,Shiyao Peng

Main category: cs.CV

TL;DR: FinMMR是一个新的金融多模态基准测试集，用于评估多模态大语言模型在金融数值推理任务中的推理能力。

Details

Motivation: 现有的基准测试在多模态、全面性和挑战性方面存在不足。 Method: 构建了一个包含4.3K问题和8.7K图像的金融多模态基准测试集，覆盖14个金融子领域。 Result: 表现最好的MLLM在Hard问题上的准确率仅为53.0%。 Conclusion: FinMMR将推动MLLM在现实场景中的推理能力的发展。 Abstract: We present FinMMR, a novel bilingual multimodal benchmark tailored to evaluate the reasoning capabilities of multimodal large language models (MLLMs) in financial numerical reasoning tasks. Compared to existing benchmarks, our work introduces three significant advancements. (1) Multimodality: We meticulously transform existing financial reasoning benchmarks, and construct novel questions from the latest Chinese financial research reports. FinMMR comprises 4.3K questions and 8.7K images spanning 14 categories, including tables, bar charts, and ownership structure charts. (2) Comprehensiveness: FinMMR encompasses 14 financial subdomains, including corporate finance, banking, and industry analysis, significantly exceeding existing benchmarks in financial domain knowledge breadth. (3) Challenge: Models are required to perform multi-step precise numerical reasoning by integrating financial knowledge with the understanding of complex financial images and text. The best-performing MLLM achieves only 53.0% accuracy on Hard problems. We believe that FinMMR will drive advancements in enhancing the reasoning capabilities of MLLMs in real-world scenarios.

[190] EncQA: Benchmarking Vision-Language Models on Visual Encodings for Charts

Kushin Mukherjee,Donghao Ren,Dominik Moritz,Yannick Assogba

Main category: cs.CV

TL;DR: EncQA基准测试揭示了多模态视觉-语言模型在图表理解中的视觉推理能力存在显著差异，指出需要针对性改进而非单纯扩大模型规模。

Details

Motivation: 尽管多模态视觉-语言模型（VLMs）在图表理解基准上取得了进展，但这些进展未能全面反映图表解释所需的视觉推理能力。 Method: 引入了EncQA，这是一个基于可视化文献的新基准测试，包含2,076个合成问答对，覆盖六种视觉编码通道和八种分析任务。 Result: 对9个最先进的VLMs的评估显示，同一任务内不同编码以及不同任务间的性能差异显著，且在许多任务-编码对中，模型大小的增加并未带来性能提升。 Conclusion: EncQA基准测试表明，仅扩大模型或数据集规模并不足以提升图表理解能力，需要有针对性地解决特定的视觉推理差距。 Abstract: Multimodal vision-language models (VLMs) continue to achieve ever-improving scores on chart understanding benchmarks. Yet, we find that this progress does not fully capture the breadth of visual reasoning capabilities essential for interpreting charts. We introduce EncQA, a novel benchmark informed by the visualization literature, designed to provide systematic coverage of visual encodings and analytic tasks that are crucial for chart understanding. EncQA provides 2,076 synthetic question-answer pairs, enabling balanced coverage of six visual encoding channels (position, length, area, color quantitative, color nominal, and shape) and eight tasks (find extrema, retrieve value, find anomaly, filter values, compute derived value exact, compute derived value relative, correlate values, and correlate values relative). Our evaluation of 9 state-of-the-art VLMs reveals that performance varies significantly across encodings within the same task, as well as across tasks. Contrary to expectations, we observe that performance does not improve with model size for many task-encoding pairs. Our results suggest that advancing chart understanding requires targeted strategies addressing specific visual reasoning gaps, rather than solely scaling up model or dataset size.

[191] X-SAM: From Segment Anything to Any Segmentation

Hao Wang,Limeng Qiao,Zequn Jie,Zhijian Huang,Chengjian Feng,Qingfang Zheng,Lin Ma,Xiangyuan Lan,Xiaodan Liang

Main category: cs.CV

TL;DR: X-SAM是一个改进的多模态大语言模型框架，通过引入新的分割任务和训练策略，提高了像素级的视觉理解能力。

Details

Motivation: 现有的大语言模型在像素级感知理解上存在不足，而Segment Anything Model (SAM)在多掩码预测和特定类别分割任务中也存在显著局限性，无法在统一的模型架构中整合所有分割任务。 Method: X-SAM通过引入一种新的分割任务Visual GrounDed (VGD) segmentation和统一的训练策略，实现了更高级的像素级感知理解。 Result: 实验结果表明，X-SAM在广泛的图像分割基准测试中达到了最先进的性能，突出了其在多模态、像素级视觉理解方面的有效性。 Conclusion: X-SAM是一个有效的多模态大语言模型框架，它改进了现有的图像分割范式，具备更强的像素级感知理解能力。 Abstract: Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding. Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from \textit{segment anything} to \textit{any segmentation}. Specifically, we introduce a novel unified framework that enables more advanced pixel-level perceptual comprehension for MLLMs. Furthermore, we propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training on diverse data sources, we present a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency for multimodal, pixel-level visual understanding. Code is available at https://github.com/wanghao9610/X-SAM.

[192] YOLOv8-Based Deep Learning Model for Automated Poultry Disease Detection and Health Monitoring paper

Akhil Saketh Reddy Sabbella,Ch. Lakshmi Prachothan,Eswar Kumar Panta

Main category: cs.CV

TL;DR: 이 연구는 YOLO v8을 기반으로 한 AI 시스템을 통해 닭의 질병을 실시간으로 자동 탐지하여 농장 관리 기술을 개선합니다.

Details

Motivation: 기존 수동 관찰 방법은 노동 집약적이며 오류 발생 가능성이 높기 때문에, 보다 정확하고 실시간으로 닭의 질병을 탐지할 수 있는 자동화된 방법이 필요합니다. Method: YOLO v8 딥러닝 모델을 사용하여 고해상도 닭 사진을 분석하고, 행동 및 외형상의 이상 징후를 탐지하여 질병을 식별합니다. Result: 정확한 실시간 감염 닭 식별이 가능하며, 농장 운영자에게 즉시 경고를 전달할 수 있는 효과적이고 확장 가능한 솔루션을 제공합니다. Conclusion: AI 기반 시스템은 닭의 질병 탐지를 자동화하여 대규모 농장에서 생물학적 안전을 향상시키고, 인간 검사의 필요성을 제거하며, 조기 감염 식별을 촉진합니다. Abstract: In the poultry industry, detecting chicken illnesses is essential to avoid financial losses. Conventional techniques depend on manual observation, which is laborious and prone to mistakes. Using YOLO v8 a deep learning model for real-time object recognition. This study suggests an AI based approach, by developing a system that analyzes high resolution chicken photos, YOLO v8 detects signs of illness, such as abnormalities in behavior and appearance. A sizable, annotated dataset has been used to train the algorithm, which provides accurate real-time identification of infected chicken and prompt warnings to farm operators for prompt action. By facilitating early infection identification, eliminating the need for human inspection, and enhancing biosecurity in large-scale farms, this AI technology improves chicken health management. The real-time features of YOLO v8 provide a scalable and effective method for improving farm management techniques.

[193] PixCuboid: Room Layout Estimation from Multi-view Featuremetric Alignment

Gustav Hanning,Kalle Åström,Viktor Larsson

Main category: cs.CV

TL;DR: PixCuboid is an optimization-based method for room layout estimation that outperforms current methods and can be extended to multi-room environments.

Details

Motivation: Coarse room layout estimation provides important geometric cues for many downstream tasks, but current methods are limited to single views and often assume panoramic images. Method: PixCuboid uses multi-view alignment of dense deep features for room layout estimation, with end-to-end training that learns feature maps yielding large convergence basins and smooth loss landscapes. Result: The authors validated their approach through thorough experiments and showed significant improvements over existing methods. They also introduced two new benchmarks based on ScanNet++ and 2D-3D-Semantics with manually verified ground truth 3D cuboids. Conclusion: The paper concludes that PixCuboid, an optimization-based approach for cuboid-shaped room layout estimation, outperforms current state-of-the-art methods and can be extended to multi-room estimation. Abstract: Coarse room layout estimation provides important geometric cues for many downstream tasks. Current state-of-the-art methods are predominantly based on single views and often assume panoramic images. We introduce PixCuboid, an optimization-based approach for cuboid-shaped room layout estimation, which is based on multi-view alignment of dense deep features. By training with the optimization end-to-end, we learn feature maps that yield large convergence basins and smooth loss landscapes in the alignment. This allows us to initialize the room layout using simple heuristics. For the evaluation we propose two new benchmarks based on ScanNet++ and 2D-3D-Semantics, with manually verified ground truth 3D cuboids. In thorough experiments we validate our approach and significantly outperform the competition. Finally, while our network is trained with single cuboids, the flexibility of the optimization-based approach allow us to easily extend to multi-room estimation, e.g. larger apartments or offices. Code and model weights are available at https://github.com/ghanning/PixCuboid.

[194] HierarchicalPrune: Position-Aware Compression for Large-Scale Diffusion Models

Young D. Kwon,Rui Li,Sijia Li,Da Li,Sourav Bhattacharya,Stylianos I. Venieris

Main category: cs.CV

TL;DR: This paper proposes HierarchicalPrune, a compression framework for text-to-image diffusion models that enables efficient on-device inference while preserving image generation quality.

Details

Motivation: The large parameter scale of state-of-the-art text-to-image diffusion models poses challenges for resource-constrained devices. Method: HierarchicalPrune combines three techniques: Hierarchical Position Pruning, Positional Weight Preservation, and Sensitivity-Guided Distillation. Result: HierarchicalPrune achieves 77.5-80.4% memory footprint reduction and 27.9-38.0% latency reduction with minimal quality drop. Conclusion: HierarchicalPrune significantly compresses diffusion models for efficient on-device inference while maintaining high output quality. Abstract: State-of-the-art text-to-image diffusion models (DMs) achieve remarkable quality, yet their massive parameter scale (8-11B) poses significant challenges for inferences on resource-constrained devices. In this paper, we present HierarchicalPrune, a novel compression framework grounded in a key observation: DM blocks exhibit distinct functional hierarchies, where early blocks establish semantic structures while later blocks handle texture refinements. HierarchicalPrune synergistically combines three techniques: (1) Hierarchical Position Pruning, which identifies and removes less essential later blocks based on position hierarchy; (2) Positional Weight Preservation, which systematically protects early model portions that are essential for semantic structural integrity; and (3) Sensitivity-Guided Distillation, which adjusts knowledge-transfer intensity based on our discovery of block-wise sensitivity variations. As a result, our framework brings billion-scale diffusion models into a range more suitable for on-device inference, while preserving the quality of the output images. Specifically, when combined with INT4 weight quantisation, HierarchicalPrune achieves 77.5-80.4% memory footprint reduction (e.g., from 15.8 GB to 3.2 GB) and 27.9-38.0% latency reduction, measured on server and consumer grade GPUs, with the minimum drop of 2.6% in GenEval score and 7% in HPSv2 score compared to the original model. Last but not least, our comprehensive user study with 85 participants demonstrates that HierarchicalPrune maintains perceptual quality comparable to the original model while significantly outperforming prior works.

[195] ANPrompt: Anti-noise Prompt Tuning for Vision-Language Models

Yansheng Gao,Yufei Zheng,Jinghan Qu,Zixi Zhu,Yukuan Zhang,Shengsheng Wang

Main category: cs.CV

TL;DR: ANPrompt is a novel prompt tuning framework that enhances the robustness of vision-language models to semantic perturbations, improving generalization to unseen classes.

Details

Motivation: Existing prompt tuning methods often overlook the vulnerability of vision-language models to weak semantic perturbations, which degrades their generalization to unseen classes. Method: The ANPrompt framework constructs weak noise text features, generates anti-noise prompts, computes the Noise-Resistant Visual Prompt Prototype, and introduces alignment, robustness, and anti-noise objectives. Result: Experiments across 11 benchmarks demonstrate that ANPrompt consistently outperforms existing prompt tuning approaches, achieving superior robustness to semantic noise and improved generalization to novel categories. Conclusion: ANPrompt effectively enhances the robustness of vision-language models to semantic perturbations and improves generalization to novel categories. Abstract: Prompt tuning has emerged as an efficient and effective technique for adapting vision-language models (VLMs) with low computational overhead. However, existing methods often overlook the vulnerability of prompt-tuned VLMs to weak semantic perturbations-such as subtle image or text noise-that degrade their generalization to unseen classes. To address this limitation, we propose ANPrompt, a novel prompt tuning framework designed to enhance robustness under such perturbations. ANPrompt first constructs weak noise text features by fusing original and noise-perturbed text embeddings, which are then clustered to form noise prompts. These noise prompts are integrated with learnable prompt tokens to generate anti-noise prompts, which are injected into the deeper layers of both image and text encoders. To further capture the noise-aware visual semantics, ANPrompt computes the Noise-Resistant Visual Prompt Prototype (NRVPP) by averaging the output prompt tokens from the vision encoder. Finally, ANPrompt introduces alignment, robustness, and anti-noise objectives by computing a Weak semantic noise Alignment Loss (WALoss) alongside the standard cross-entropy and sim loss. Experiments across 11 benchmarks demonstrate that ANPrompt consistently outperforms existing prompt tuning approaches, achieving superior robustness to semantic noise and improved generalization to novel categories.

[196] Perceiving and Acting in First-Person: A Dataset and Benchmark for Egocentric Human-Object-Human Interactions

Liang Xu,Chengqun Yang,Zili Lin,Fei Xu,Yifan Liu,Congsheng Xu,Yiyi Zhang,Jie Qin,Xingdong Sheng,Yunhui Liu,Xin Jin,Yichao Yan,Wenjun Zeng,Xiaokang Yang

Main category: cs.CV

TL;DR: This paper introduces InterVLA, a new large-scale dataset for AI assistants, combining vision, language, and action data from human-object-human interactions.

Details

Motivation: To build general-purpose intelligent assistants, there is a need for datasets that capture both generalist interaction knowledge and egocentric perception. Method: Embedding manual-assisted tasks into a vision-language-action framework using a hybrid RGB-MoCap system to generate multimodal interaction data. Result: The creation of the InterVLA dataset with 11.4 hours and 1.2M frames of multimodal data, including egocentric and exocentric videos, human/object motions, and verbal commands. Conclusion: InterVLA is the first large-scale human-object-human interaction dataset that provides a comprehensive foundation for future AI agent development in the physical world. Abstract: Learning action models from real-world human-centric interaction datasets is important towards building general-purpose intelligent assistants with efficiency. However, most existing datasets only offer specialist interaction category and ignore that AI assistants perceive and act based on first-person acquisition. We urge that both the generalist interaction knowledge and egocentric modality are indispensable. In this paper, we embed the manual-assisted task into a vision-language-action framework, where the assistant provides services to the instructor following egocentric vision and commands. With our hybrid RGB-MoCap system, pairs of assistants and instructors engage with multiple objects and the scene following GPT-generated scripts. Under this setting, we accomplish InterVLA, the first large-scale human-object-human interaction dataset with 11.4 hours and 1.2M frames of multimodal data, spanning 2 egocentric and 5 exocentric videos, accurate human/object motions and verbal commands. Furthermore, we establish novel benchmarks on egocentric human motion estimation, interaction synthesis, and interaction prediction with comprehensive analysis. We believe that our InterVLA testbed and the benchmarks will foster future works on building AI agents in the physical world.

[197] TurboTrain: Towards Efficient and Balanced Multi-Task Learning for Multi-Agent Perception and Prediction

Zewei Zhou,Seth Z. Zhao,Tianhui Cai,Zhiyu Huang,Bolei Zhou,Jiaqi Ma

Main category: cs.CV

TL;DR: TurboTrain是一种高效的多智能体训练框架，简化了训练流程并提升了性能。

Details

Motivation: 端到端训练多智能体系统虽然能提升多任务性能，但训练困难且需要大量手动设计和监控。 Method: 提出TurboTrain训练框架，包含基于掩码重建学习的多智能体时空预训练方案和基于梯度冲突抑制的平衡多任务学习策略。 Result: 在真实世界协同驾驶数据集V2XPnP-Seq上评估，TurboTrain进一步提升了最先进的多智能体感知与预测模型的性能。 Conclusion: TurboTrain框架有效提升了多智能体感知与预测模型的性能，同时减少了训练时间和复杂度。 Abstract: End-to-end training of multi-agent systems offers significant advantages in improving multi-task performance. However, training such models remains challenging and requires extensive manual design and monitoring. In this work, we introduce TurboTrain, a novel and efficient training framework for multi-agent perception and prediction. TurboTrain comprises two key components: a multi-agent spatiotemporal pretraining scheme based on masked reconstruction learning and a balanced multi-task learning strategy based on gradient conflict suppression. By streamlining the training process, our framework eliminates the need for manually designing and tuning complex multi-stage training pipelines, substantially reducing training time and improving performance. We evaluate TurboTrain on a real-world cooperative driving dataset, V2XPnP-Seq, and demonstrate that it further improves the performance of state-of-the-art multi-agent perception and prediction models. Our results highlight that pretraining effectively captures spatiotemporal multi-agent features and significantly benefits downstream tasks. Moreover, the proposed balanced multi-task learning strategy enhances detection and prediction.

[198] BEVCon: Advancing Bird's Eye View Perception with Contrastive Learning

Ziyang Leng,Jiawei Yang,Zhicheng Ren,Bolei Zhou

Main category: cs.CV

TL;DR: BEVCon是一种基于对比学习的BEV感知框架，通过提升特征表示显著提高了3D目标检测性能。

Details

Motivation: 现有研究主要集中在改进BEV编码器和任务特定头，而对BEV模型中的表示学习潜力探索不足。 Method: 设计了实例特征对比模块和视角视图对比模块，并采用基于检测损失的密集对比学习方法。 Result: 在nuScenes数据集上的实验表明，BEVCon比最先进的基线模型mAP提升了2.4%。 Conclusion: BEVCon通过对比学习模块有效提升了BEV感知的特征表示，突出了表示学习在该领域的重要性。 Abstract: We present BEVCon, a simple yet effective contrastive learning framework designed to improve Bird's Eye View (BEV) perception in autonomous driving. BEV perception offers a top-down-view representation of the surrounding environment, making it crucial for 3D object detection, segmentation, and trajectory prediction tasks. While prior work has primarily focused on enhancing BEV encoders and task-specific heads, we address the underexplored potential of representation learning in BEV models. BEVCon introduces two contrastive learning modules: an instance feature contrast module for refining BEV features and a perspective view contrast module that enhances the image backbone. The dense contrastive learning designed on top of detection losses leads to improved feature representations across both the BEV encoder and the backbone. Extensive experiments on the nuScenes dataset demonstrate that BEVCon achieves consistent performance gains, achieving up to +2.4% mAP improvement over state-of-the-art baselines. Our results highlight the critical role of representation learning in BEV perception and offer a complementary avenue to conventional task-specific optimizations.

[199] Occupancy Learning with Spatiotemporal Memory

Ziyang Leng,Jiawei Yang,Wenlong Yi,Bolei Zhou

Main category: cs.CV

TL;DR: ST-Occ是一种用于3D占用预测的场景级占用表示学习框架，通过引入时空记忆和记忆注意力机制，有效提升时空表示能力及预测准确性。

Details

Motivation: 3D占用作为自动驾驶感知表示的一个有前景的方法，能够在细粒度尺度上对周围环境进行建模，但高效地跨多帧输入聚合3D占用信息仍面临挑战。 Method: ST-Occ包括两个核心设计：一个时空记忆模块，用于捕捉和高效存储历史信息；一个记忆注意力机制，通过考虑不确定性和动态感知模型，将当前占用表示与时空记忆模块相关联。 Result: 实验结果表明，ST-Occ在3D占用预测任务中显著提升了时空表示能力，与最先进的方法相比，mIoU提高了3%，时间不一致性减少了29%。 Conclusion: ST-Occ展现出在3D占用预测任务中优越的时空表示能力，通过利用多帧输入之间的时序依赖性，有效提高了预测的准确性和时间一致性。 Abstract: 3D occupancy becomes a promising perception representation for autonomous driving to model the surrounding environment at a fine-grained scale. However, it remains challenging to efficiently aggregate 3D occupancy over time across multiple input frames due to the high processing cost and the uncertainty and dynamics of voxels. To address this issue, we propose ST-Occ, a scene-level occupancy representation learning framework that effectively learns the spatiotemporal feature with temporal consistency. ST-Occ consists of two core designs: a spatiotemporal memory that captures comprehensive historical information and stores it efficiently through a scene-level representation and a memory attention that conditions the current occupancy representation on the spatiotemporal memory with a model of uncertainty and dynamic awareness. Our method significantly enhances the spatiotemporal representation learned for 3D occupancy prediction tasks by exploiting the temporal dependency between multi-frame inputs. Experiments show that our approach outperforms the state-of-the-art methods by a margin of 3 mIoU and reduces the temporal inconsistency by 29%.

Table of Contents

cs.CL [Back]

[1] How Deep Is Representational Bias in LLMs? The Cases of Caste and Religion

[2] FeynTune: Large Language Models for High-Energy Theory

[3] Intent Aware Context Retrieval for Multi-Turn Agricultural Question Answering

[4] Hierarchical Verification of Speculative Beams for Accelerating LLM Inference

[5] WINELL: Wikipedia Never-Ending Updating with LLM Agents

[6] GanitBench: A bi-lingual benchmark for evaluating mathematical reasoning in Vision Language Models

[7] AttnTrace: Attention-based Context Traceback for Long-Context LLMs

[8] Majority Bit-Aware Watermarking For Large Language Models

[9] Hallucination to Truth: A Review of Fact-Checking and Factuality Evaluation in Large Language Models

[10] An Entity Linking Agent for Question Answering

[11] Sotopia-RL: Reward Design for Social Intelligence

[12] CoAct-1: Computer-using Agents with Coding as Actions

[13] CAP-LLM: Context-Augmented Personalized Large Language Models for News Headline Generation

[14] Data and AI governance: Promoting equity, ethics, and fairness in large language models

[15] Confidence-Weighted Token Set Cover for Early Hypothesis Pruning in Self-Consistency

[16] Are Today's LLMs Ready to Explain Well-Being Concepts?

[17] Transferring Expert Cognitive Models to Social Robots via Agentic Concept Bottleneck Models

[18] HarmonyGuard: Toward Safety and Utility in Web Agents via Adaptive Policy Enhancement and Dual-Objective Optimization

[19] Step More: Going Beyond Single Backpropagation in Meta Learning Based Model Editing

[20] ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents

[21] Large Reasoning Models Are Autonomous Jailbreak Agents

[22] DTPA: Dynamic Token-level Prefix Augmentation for Controllable Text Generation

[23] PAIRS: Parametric-Verified Adaptive Information Retrieval and Selection for Efficient RAG

[24] Efficient Strategy for Improving Large Language Model (LLM) Capabilities

[25] ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients"

[26] GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning

[27] Unveiling Over-Memorization in Finetuning LLMs for Reasoning Tasks

[28] Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap

[29] The State Of TTS: A Case Study with Human Fooling Rates

[30] Hacking Hallucinations of MLLMs with Causal Sufficiency and Necessity

[31] Characterizing Deep Research: A Benchmark and Formal Definition

[32] Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models

[33] Reasoning Beyond Labels: Measuring LLM Sentiment in Low-Resource, Culturally Nuanced Contexts

[34] ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments

[35] Hierarchical Text Classification Using Black Box Large Language Models

[36] DP-GPT4MTS: Dual-Prompt Large Language Model for Textual-Numerical Time Series Forecasting

[37] TalkDep: Clinically Grounded LLM Personas for Conversation-Centric Depression Screening

[38] KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs

[39] ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents

[40] A Few Words Can Distort Graphs: Knowledge Poisoning Attacks on Graph-based Retrieval-Augmented Generation of Large Language Models

[41] Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models

[42] Modelling and Classifying the Components of a Literature Review

[43] GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy

[44] Chain of Questions: Guiding Multimodal Curiosity in Language Models

[45] AIC CTU@FEVER 8: On-premise fact checking through long context RAG

[46] Improving Crash Data Quality with Large Language Models: Evidence from Secondary Crash Narratives in Kentucky

[47] Why are LLMs' abilities emergent?

[48] What Do Humans Hear When Interacting? Experiments on Selective Listening for Evaluating ASR of Spoken Dialogue Systems

[49] Dialogue Response Prefetching Based on Semantic Similarity and Prediction Confidence of Language Model

[50] Evaluating, Synthesizing, and Enhancing for Customer Support Conversation

[51] StepFun-Formalizer: Unlocking the Autoformalization Potential of LLMs through Knowledge-Reasoning Fusion

[52] Automated Generation of Curriculum-Aligned Multiple-Choice Questions for Malaysian Secondary Mathematics Using Generative AI

[53] CALE : Concept-Aligned Embeddings for Both Within-Lemma and Inter-Lemma Sense Differentiation

[54] StyliTruth : Unlocking Stylized yet Truthful LLM Generation via Disentangled Steering

[55] Unveiling the Landscape of Clinical Depression Assessment: From Behavioral Signatures to Psychiatric Reasoning

[56] Beyond Brainstorming: What Drives High-Quality Scientific Ideas? Lessons from Multi-Agent Collaboration

[57] Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning

[58] TURA: Tool-Augmented Unified Retrieval Agent for AI Search

[59] Lightweight Transformers for Zero-Shot and Fine-Tuned Text-to-SQL Generation Using Spider

[60] P-Aligner: Enabling Pre-Alignment of Language Models via Principled Instruction Synthesis

[61] IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards

[62] Can NLP Tackle Hate Speech in the Real World? Stakeholder-Informed Feedback and Survey on Counterspeech

[63] Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs

[64] Sculptor: Empowering LLMs with Cognitive Agency via Active Context Management

[65] GeRe: Towards Efficient Anti-Forgetting in Continual Learning of LLM via General Samples Replay

[66] FaST: Feature-aware Sampling and Tuning for Personalized Preference Alignment with Limited Data

[67] Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis

cs.CV [Back]

[68] Text2VR: Automated instruction Generation in Virtual Reality using Large language Models for Assembly Task

[69] Outlier Detection Algorithm for Circle Fitting

[70] Enhancing Diameter Measurement Accuracy in Machine Vision Applications

[71] Multimodal Video Emotion Recognition with Reliable Reasoning Priors

[72] From Waveforms to Pixels: A Survey on Audio-Visual Segmentation

[73] A Large Language Model Powered Integrated Circuit Footprint Geometry Understanding

[74] TIR-Diffusion: Diffusion-based Thermal Infrared Image Denoising via Latent and Wavelet Domain Optimization

[75] What is Beneath Misogyny: Misogynous Memes Classification and Explanation

[76] StorySync: Training-Free Subject Consistency in Text-to-Image Generation via Region Harmonization

[77] Fusion of Pervasive RF Data with Spatial Images via Vision Transformers for Enhanced Mapping in Smart Cities