Skip to content

Table of Contents

cs.CL [Back]

[1] A2HCoder: An LLM-Driven Coding Agent for Hierarchical Algorithm-to-HDL Translation

Jie Lei,Ruofan Jia,J. Andrew Zhang,Hao Zhang

Main category: cs.CL

TL;DR: A2HCoder is a hierarchical algorithm-to-HDL coding agent that effectively bridges the gap between algorithm design and hardware implementation, enhancing deployment efficiency and reliability in wireless communication systems through a structured translation process powered by LLMs.

Details Motivation: The increasing demand for efficient algorithm-to-hardware deployment in wireless communication systems has highlighted a significant gap between algorithm design and hardware implementation, primarily due to mismatches between high-level programming languages and hardware description languages. Method: A2HCoder uses a hierarchical framework powered by large language models (LLMs) to translate algorithms into hardware descriptions. It decomposes algorithms horizontally into functional blocks and employs step-by-step translation vertically, leveraging external toolchains for debugging and synthesis. Result: A2HCoder demonstrates practicality and reliability in a real-world 5G deployment case, significantly mitigating hallucinations in LLM-generated code and ensuring hardware-level correctness. Conclusion: A2HCoder successfully bridges the gap between algorithm design and hardware implementation in wireless communication systems, enhancing deployment efficiency and reliability. Abstract: In wireless communication systems, stringent requirements such as ultra-low latency and power consumption have significantly increased the demand for efficient algorithm-to-hardware deployment. However, a persistent and substantial gap remains between algorithm design and hardware implementation. Bridging this gap traditionally requires extensive domain expertise and time-consuming manual development, due to fundamental mismatches between high-level programming languages like MATLAB and hardware description languages (HDLs) such as Verilog-in terms of memory access patterns, data processing manners, and datatype representations. To address this challenge, we propose A2HCoder: a Hierarchical Algorithm-to-HDL Coding Agent, powered by large language models (LLMs), designed to enable agile and reliable algorithm-to-hardware translation. A2HCoder introduces a hierarchical framework that enhances both robustness and interpretability while suppressing common hallucination issues in LLM-generated code. In the horizontal dimension, A2HCoder decomposes complex algorithms into modular functional blocks, simplifying code generation and improving consistency. In the vertical dimension, instead of relying on end-to-end generation, A2HCoder performs step-by-step, fine-grained translation, leveraging external toolchains such as MATLAB and Vitis HLS for debugging and circuit-level synthesis. This structured process significantly mitigates hallucinations and ensures hardware-level correctness. We validate A2HCoder through a real-world deployment case in the 5G wireless communication domain, demonstrating its practicality, reliability, and deployment efficiency.

[2] PersonaTwin: A Multi-Tier Prompt Conditioning Framework for Generating and Evaluating Personalized Digital Twins

Sihan Chen,John P. Lalor,Yi Yang,Ahmed Abbasi

Main category: cs.CL

TL;DR: 本文提出PersonaTwin框架,通过多层提示条件建模,结合多种用户数据,提升大型语言模型在个性化用户模拟中的准确性和公平性。

Details Motivation: 大型语言模型(LLMs)在用户建模和人类行为近似方面具有潜力,但往往无法捕捉个体用户的多维细微差别。 Method: 引入PersonaTwin框架,结合人口统计学、行为学和心理测量数据,使用8500多人的医疗数据集进行基准测试,并采用文本相似度指标和人口统计公平性评估进行评估。 Result: 实验结果显示,PersonaTwin在模拟保真度方面与理想设置相当,下游模型在预测和公平性指标上也接近基于个体训练的模型。 Conclusion: PersonaTwin是一个多层提示条件框架,能够生成逼真的用户模拟,为个性化数字用户建模和行为分析提供了有力工具。 Abstract: While large language models (LLMs) afford new possibilities for user modeling and approximation of human behaviors, they often fail to capture the multidimensional nuances of individual users. In this work, we introduce PersonaTwin, a multi-tier prompt conditioning framework that builds adaptive digital twins by integrating demographic, behavioral, and psychometric data. Using a comprehensive data set in the healthcare context of more than 8,500 individuals, we systematically benchmark PersonaTwin against standard LLM outputs, and our rigorous evaluation unites state-of-the-art text similarity metrics with dedicated demographic parity assessments, ensuring that generated responses remain accurate and unbiased. Experimental results show that our framework produces simulation fidelity on par with oracle settings. Moreover, downstream models trained on persona-twins approximate models trained on individuals in terms of prediction and fairness metrics across both GPT-4o-based and Llama-based models. Together, these findings underscore the potential for LLM digital twin-based approaches in producing realistic and emotionally nuanced user simulations, offering a powerful tool for personalized digital user modeling and behavior analysis.

[3] gpt-oss-120b & gpt-oss-20b Model Card

OpenAI,:,Sandhini Agarwal,Lama Ahmad,Jason Ai,Sam Altman,Andy Applebaum,Edwin Arbus,Rahul K. Arora,Yu Bai,Bowen Baker,Haiming Bao,Boaz Barak,Ally Bennett,Tyler Bertao,Nivedita Brett,Eugene Brevdo,Greg Brockman,Sebastien Bubeck,Che Chang,Kai Chen,Mark Chen,Enoch Cheung,Aidan Clark,Dan Cook,Marat Dukhan,Casey Dvorak,Kevin Fives,Vlad Fomenko,Timur Garipov,Kristian Georgiev,Mia Glaese,Tarun Gogineni,Adam Goucher,Lukas Gross,Katia Gil Guzman,John Hallman,Jackie Hehir,Johannes Heidecke,Alec Helyar,Haitang Hu,Romain Huet,Jacob Huh,Saachi Jain,Zach Johnson,Chris Koch,Irina Kofman,Dominik Kundel,Jason Kwon,Volodymyr Kyrylov,Elaine Ya Le,Guillaume Leclerc,James Park Lennon,Scott Lessans,Mario Lezcano-Casado,Yuanzhi Li,Zhuohan Li,Ji Lin,Jordan Liss,Lily,Liu,Jiancheng Liu,Kevin Lu,Chris Lu,Zoran Martinovic,Lindsay McCallum,Josh McGrath,Scott McKinney,Aidan McLaughlin,Song Mei,Steve Mostovoy,Tong Mu,Gideon Myles,Alexander Neitz,Alex Nichol,Jakub Pachocki,Alex Paino,Dana Palmie,Ashley Pantuliano,Giambattista Parascandolo,Jongsoo Park,Leher Pathak,Carolina Paz,Ludovic Peran,Dmitry Pimenov,Michelle Pokrass,Elizabeth Proehl,Huida Qiu,Gaby Raila,Filippo Raso,Hongyu Ren,Kimmy Richardson,David Robinson,Bob Rotsted,Hadi Salman,Suvansh Sanjeev,Max Schwarzer,D. Sculley,Harshit Sikchi,Kendal Simon,Karan Singhal,Yang Song,Dane Stuckey,Zhiqing Sun,Philippe Tillet,Sam Toizer,Foivos Tsimpourlas,Nikhil Vyas,Eric Wallace,Xin Wang,Miles Wang,Olivia Watkins,Kevin Weil,Amy Wendling,Kevin Whinnery,Cedric Whitney,Hannah Wong,Lin Yang,Yu Yang,Michihiro Yasunaga,Kristen Ying,Wojciech Zaremba,Wenting Zhan,Cyril Zhang,Brian Zhang,Eddie Zhang,Shengjia Zhao

Main category: cs.CL

TL;DR: This paper introduces two open-weight reasoning models, gpt-oss-120b and gpt-oss-20b, that offer high accuracy and low inference cost through an efficient mixture-of-expert transformer architecture and advanced training techniques, with the goal of enabling broad use and research in the field.

Details Motivation: The motivation behind this paper is to push the frontier of accuracy and inference cost in reasoning models while enhancing agentic capabilities such as deep research browsing, Python tool use, and support for developer-provided functions. Method: The models use an efficient mixture-of-expert transformer architecture and are trained using large-scale distillation and reinforcement learning. Result: The models achieve strong results on benchmarks ranging from mathematics, coding, and safety, while maintaining clear instruction following and role delineation through a rendered chat format. Conclusion: The paper concludes by emphasizing that the developed models achieve strong results on various benchmarks and are released under an Apache 2.0 license to promote broad use and further research. Abstract: We present gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models that push the frontier of accuracy and inference cost. The models use an efficient mixture-of-expert transformer architecture and are trained using large-scale distillation and reinforcement learning. We optimize the models to have strong agentic capabilities (deep research browsing, python tool use, and support for developer-provided functions), all while using a rendered chat format that enables clear instruction following and role delineation. Both models achieve strong results on benchmarks ranging from mathematics, coding, and safety. We release the model weights, inference implementations, tool environments, and tokenizers under an Apache 2.0 license to enable broad use and further research.

[4] Modeling and Detecting Company Risks from News: A Case Study in Bloomberg News

Jiaxin Pei,Soumya Vadlamannati,Liang-Kang Huang,Daniel Preotiuc-Pietro,Xinyu Hua

Main category: cs.CL

TL;DR: 本研究构建了一个计算框架,通过分析新闻文章识别公司风险因素,并发现微调的预训练语言模型在这一任务上优于零样本和少样本提示的LLM。

Details Motivation: 识别与公司相关的风险对投资者和整体金融市场的健康发展至关重要。 Method: 本研究构建了一个用于从新闻文章中自动提取公司风险因素的计算框架,并提出了包含七个不同方面的新型模式。通过采样和注释744篇新闻文章并基准测试各种机器学习模型来进行实验。 Result: 实验结果显示,零样本和少样本提示的先进LLM(如LLaMA-2)在识别风险因素方面表现中等到较低,而微调的预训练语言模型在大多数风险因素上表现更好。通过该模型分析了超过277,000篇彭博新闻文章,展示了从新闻中识别风险因素可以提供对公司和行业运营的广泛洞察。 Conclusion: 本研究得出结论,尽管大型语言模型在各种NLP任务中取得了显著进展,但在识别风险因素方面,零样本和少样本提示的LLM表现仅达到中等到低水平,而微调的预训练语言模型在大多数风险因素上表现更好。 Abstract: Identifying risks associated with a company is important to investors and the well-being of the overall financial market. In this study, we build a computational framework to automatically extract company risk factors from news articles. Our newly proposed schema comprises seven distinct aspects, such as supply chain, regulations, and competitions. We sample and annotate 744 news articles and benchmark various machine learning models. While large language models have achieved huge progress in various types of NLP tasks, our experiment shows that zero-shot and few-shot prompting state-of-the-art LLMs (e.g. LLaMA-2) can only achieve moderate to low performances in identifying risk factors. And fine-tuned pre-trained language models are performing better on most of the risk factors. Using this model, we analyze over 277K Bloomberg news articles and demonstrate that identifying risk factors from news could provide extensive insight into the operations of companies and industries.

[5] Rule2Text: A Framework for Generating and Evaluating Natural Language Explanations of Knowledge Graph Rules

Nasim Shirvani-Mahdavi,Chengkai Li

Main category: cs.CL

TL;DR: This paper introduces Rule2Text, a framework that uses large language models to generate understandable explanations for logical rules in knowledge graphs, significantly improving their usability.

Details Motivation: The motivation is to enhance the interpretability of complex logical rules in KGs, which are often difficult for humans to understand due to their complexity and inconsistent labeling conventions. Method: The authors introduce Rule2Text, a framework using LLMs to generate explanations for logical rules in KGs. They evaluate multiple LLMs and prompting strategies, develop an LLM-as-a-judge framework for scalability, and fine-tune the Zephyr model using high-quality datasets constructed with Gemini 2.0 Flash and human feedback. Result: The proposed framework achieves significant improvements in explanation quality, especially in domain-specific datasets, and includes a type inference module for KGs without explicit type information. Conclusion: This work concludes that leveraging LLMs with the Rule2Text framework significantly improves the accessibility and usability of KGs through high-quality natural language explanations of mined logical rules. Abstract: Knowledge graphs (KGs) can be enhanced through rule mining; however, the resulting logical rules are often difficult for humans to interpret due to their inherent complexity and the idiosyncratic labeling conventions of individual KGs. This work presents Rule2Text, a comprehensive framework that leverages large language models (LLMs) to generate natural language explanations for mined logical rules, thereby improving KG accessibility and usability. We conduct extensive experiments using multiple datasets, including Freebase variants (FB-CVT-REV, FB+CVT-REV, and FB15k-237) as well as the ogbl-biokg dataset, with rules mined using AMIE 3.5.1. We systematically evaluate several LLMs across a comprehensive range of prompting strategies, including zero-shot, few-shot, variable type incorporation, and Chain-of-Thought reasoning. To systematically assess models' performance, we conduct a human evaluation of generated explanations on correctness and clarity. To address evaluation scalability, we develop and validate an LLM-as-a-judge framework that demonstrates strong agreement with human evaluators. Leveraging the best-performing model (Gemini 2.0 Flash), LLM judge, and human-in-the-loop feedback, we construct high-quality ground truth datasets, which we use to fine-tune the open-source Zephyr model. Our results demonstrate significant improvements in explanation quality after fine-tuning, with particularly strong gains in the domain-specific dataset. Additionally, we integrate a type inference module to support KGs lacking explicit type information. All code and data are publicly available at https://github.com/idirlab/KGRule2NL.

[6] Improving Text Style Transfer using Masked Diffusion Language Models with Inference-time Scaling

Tejomay Kishor Padole,Suyash P Awate,Pushpak Bhattacharyya

Main category: cs.CL

TL;DR: This paper introduces a verifier-based inference-time scaling technique for masked diffusion language models (MDMs), showing improved generation quality over traditional autoregressive models.

Details Motivation: To enhance the generation quality of masked diffusion language models (MDMs) by leveraging inference-time scaling through a verifier-based approach. Method: The paper proposes a verifier-based inference-time scaling method that improves the candidate generation during the MDM denoising process. Result: Experiments showed that MDMs perform well on text-style transfer tasks and that using a soft-value-based verifier setup significantly improves generation quality. Conclusion: MDMs prove to be a better alternative to autoregressive language models, especially with the proposed verifier-based inference-time scaling method enhancing generation quality. Abstract: Masked diffusion language models (MDMs) have recently gained traction as a viable generative framework for natural language. This can be attributed to its scalability and ease of training compared to other diffusion model paradigms for discrete data, establishing itself as the state-of-the-art non-autoregressive generator for discrete data. Diffusion models, in general, have shown excellent ability to improve the generation quality by leveraging inference-time scaling either by increasing the number of denoising steps or by using external verifiers on top of the outputs of each step to guide the generation. In this work, we propose a verifier-based inference-time scaling method that aids in finding a better candidate generation during the denoising process of the MDM. Our experiments demonstrate the application of MDMs for standard text-style transfer tasks and establish MDMs as a better alternative to autoregressive language models. Additionally, we show that a simple soft-value-based verifier setup for MDMs using off-the-shelf pre-trained embedding models leads to significant gains in generation quality even when used on top of typical classifier-free guidance setups in the existing literature.

[7] SproutBench: A Benchmark for Safe and Ethical Large Language Models for Youth

Wenpeng Xing,Lanyi Wei,Haixiao Hu,Rongchang Li,Mohan Li,Changting Lin,Meng Han

Main category: cs.CL

TL;DR: This paper identifies critical gaps in current AI safety frameworks for protecting minors, introduces a new evaluation tool called SproutBench to assess risks in LLMs, and provides guidelines for designing AI that better safeguards children and adolescents.

Details Motivation: The motivation stems from the increasing use of LLMs in applications targeting children and adolescents, which highlights the inadequacy of current AI safety frameworks that are primarily designed for adults and overlook the developmental vulnerabilities of minors. Method: The researchers developed SproutBench, an evaluation suite containing 1,283 developmentally grounded adversarial prompts designed to assess risks specific to children and adolescents. They conducted a rigorous empirical evaluation across 47 diverse LLMs. Result: The evaluation revealed significant safety vulnerabilities in LLMs. The study also found robust inter-dimensional correlations—such as between Safety and Risk Prevention—and a notable inverse relationship between Interactivity and Age Appropriateness. Conclusion: The study concludes that existing AI safety frameworks are insufficient for protecting minors from the unique risks posed by large language models (LLMs), and it offers practical guidelines for advancing child-centric AI design and deployment. Abstract: The rapid proliferation of large language models (LLMs) in applications targeting children and adolescents necessitates a fundamental reassessment of prevailing AI safety frameworks, which are largely tailored to adult users and neglect the distinct developmental vulnerabilities of minors. This paper highlights key deficiencies in existing LLM safety benchmarks, including their inadequate coverage of age-specific cognitive, emotional, and social risks spanning early childhood (ages 0--6), middle childhood (7--12), and adolescence (13--18). To bridge these gaps, we introduce SproutBench, an innovative evaluation suite comprising 1,283 developmentally grounded adversarial prompts designed to probe risks such as emotional dependency, privacy violations, and imitation of hazardous behaviors. Through rigorous empirical evaluation of 47 diverse LLMs, we uncover substantial safety vulnerabilities, corroborated by robust inter-dimensional correlations (e.g., between Safety and Risk Prevention) and a notable inverse relationship between Interactivity and Age Appropriateness. These insights yield practical guidelines for advancing child-centric AI design and deployment.

[8] Beyond the Rosetta Stone: Unification Forces in Generalization Dynamics

Carter Blum,Katja Filipova,Ann Yuan,Asma Ghandeharioun,Julian Zimmert,Fred Zhang,Jessica Hoffmann,Tal Linzen,Martin Wattenberg,Lucas Dixon,Mor Geva

Main category: cs.CL

TL;DR: This work investigates cross-lingual knowledge transfer challenges in LLMs by training small models on synthetic datasets, revealing that unified representations are essential for transfer success.

Details Motivation: LLMs struggle with cross-lingual knowledge transfer and tend to hallucinate when asked about facts expressed in different languages. This work aims to study the causes and dynamics of this phenomenon. Method: The study trains small Transformer models from scratch on synthetic multilingual datasets to analyze the development of separate or unified representations of facts across languages. Result: The study identifies a learning phase where models develop either separate or unified representations of facts across languages, and demonstrates that unification is crucial for effective cross-lingual transfer. Conclusion: This work concludes that unification of representations across languages is essential for cross-lingual transfer in LLMs, and controlled settings can shed light on pre-training dynamics, suggesting new directions for improvement. Abstract: Large language models (LLMs) struggle with cross-lingual knowledge transfer: they hallucinate when asked in one language about facts expressed in a different language during training. This work introduces a controlled setting to study the causes and dynamics of this phenomenon by training small Transformer models from scratch on synthetic multilingual datasets. We identify a learning phase wherein a model develops either separate or unified representations of the same facts across languages, and show that unification is essential for cross-lingual transfer. We also show that the degree of unification depends on mutual information between facts and training data language, and on how easy it is to extract that language. Based on these insights, we develop methods to modulate the level of cross-lingual transfer by manipulating data distribution and tokenization, and we introduce metrics and visualizations to formally characterize their effects on unification. Our work shows how controlled settings can shed light on pre-training dynamics and suggests new directions for improving cross-lingual transfer in LLMs.

[9] Hell or High Water: Evaluating Agentic Recovery from External Failures

Andrew Wang,Sophia Hager,Adi Asija,Daniel Khashabi,Nicholas Andrews

Main category: cs.CL

TL;DR: 本文研究了语言模型代理在面对环境反馈和外部失败时的规划能力,发现它们在适应反馈和执行备用计划方面存在显著困难。

Details Motivation: 随着语言模型代理被应用于日益复杂的真实世界问题,它们需要在广阔的搜索空间中制定计划。如果这些计划因不可控原因失败,代理如何寻找替代方法来实现目标是一个重要问题。 Method: 设计了一个专门的代理规划基准,通过组合函数调用来解决每个规划问题。代理从超过四千个可能性中搜索相关功能,并观察环境反馈。 Result: 研究发现,语言代理在应对环境反馈时难以制定和执行备用计划。即使搜索空间被人为限制,最先进的模型也难以适应环境反馈并经常无法追求替代行动方案。 Conclusion: 语言代理在面对环境反馈和外部失败时,难以制定和执行备用计划,尽管最先进的模型能够识别正确的情境下使用正确的功能,但它们在适应环境反馈和追求替代行动方案方面存在困难。 Abstract: As language model agents are applied to real world problems of increasing complexity, they will be expected to formulate plans across large search spaces. If those plans fail for reasons beyond their control, how well do language agents search for alternative ways to achieve their goals? We devise a specialized agentic planning benchmark to study this question. Each planning problem is solved via combinations of function calls. The agent searches for relevant functions from a set of over four thousand possibilities, and observes environmental feedback in the form of function outputs or error messages. Our benchmark confronts the agent with external failures in its workflow, such as functions that suddenly become unavailable. At the same time, even with the introduction of these failures, we guarantee that the task remains solvable. Ideally, an agent's performance on the planning task should not be affected by the presence of external failures. Overall, we find that language agents struggle to formulate and execute backup plans in response to environment feedback. While state-of-the-art models are often able to identify the correct function to use in the right context, they struggle to adapt to feedback from the environment and often fail to pursue alternate courses of action, even when the search space is artificially restricted. We provide a systematic analysis of the failures of both open-source and commercial models, examining the effects of search space size, as well as the benefits of scaling model size in our setting. Our analysis identifies key challenges for current generative models as well as promising directions for future work.

[10] BIPOLAR: Polarization-based granular framework for LLM bias evaluation

Martin Pavlíček,Tomáš Filip,Petr Sosík

Main category: cs.CL

TL;DR: 本研究提出了一种可重用、细致且与主题无关的框架,用于评估大型语言模型中的极化相关偏见,并通过关注俄乌战争的合成数据集验证了其有效性。

Details Motivation: 尽管在偏见检测和缓解技术方面取得了重大进展,但某些挑战仍未被充分探索。本研究提出了一种可重用、细致且与主题无关的框架,以评估大型语言模型(LLM)中的极化相关偏见。 Method: 我们的方法结合了对极化敏感的情感度量和一个合成生成的平衡冲突相关陈述数据集,使用预定义的语义类别。 Result: 除了总体趋势显示对乌克兰的情感更为积极之外,该框架还允许进行细粒度分析,在不同语义类别之间显示出显著的差异,并揭示了不同模型之间的行为模式差异。 Conclusion: 该框架支持自动化数据集生成和细粒度的偏见评估,适用于各种极化驱动的场景和主题,并与许多其他偏见评估策略正交。 Abstract: Large language models (LLMs) are known to exhibit biases in downstream tasks, especially when dealing with sensitive topics such as political discourse, gender identity, ethnic relations, or national stereotypes. Although significant progress has been made in bias detection and mitigation techniques, certain challenges remain underexplored. This study proposes a reusable, granular, and topic-agnostic framework to evaluate polarisation-related biases in LLM (both open-source and closed-source). Our approach combines polarisation-sensitive sentiment metrics with a synthetically generated balanced dataset of conflict-related statements, using a predefined set of semantic categories. As a case study, we created a synthetic dataset that focusses on the Russia-Ukraine war, and we evaluated the bias in several LLMs: Llama-3, Mistral, GPT-4, Claude 3.5, and Gemini 1.0. Beyond aggregate bias scores, with a general trend for more positive sentiment toward Ukraine, the framework allowed fine-grained analysis with considerable variation between semantic categories, uncovering divergent behavioural patterns among models. Adaptation to prompt modifications showed further bias towards preconceived language and citizenship modification. Overall, the framework supports automated dataset generation and fine-grained bias assessment, is applicable to a variety of polarisation-driven scenarios and topics, and is orthogonal to many other bias-evaluation strategies.

[11] Approaching the Source of Symbol Grounding with Confluent Reductions of Abstract Meaning Representation Directed Graphs

Nicolas Goulet,Alexandre Blondin Massé,Moussa Abdendi

Main category: cs.CL

TL;DR: This paper explores the embedding of digital dictionaries into Abstract Meaning Representation graphs using large language models, followed by confluent reduction of these graphs and analysis of their properties in relation to the symbol grounding problem.

Details Motivation: The motivation is to explore how digital dictionaries can be integrated into Abstract Meaning Representation (AMR) and to examine the resulting graph properties in the context of the symbol grounding problem. Method: The study uses state-of-the-art pre-trained large language models to embed real digital dictionaries into AMR directed graphs, and then reduces these graphs in a confluent manner, ensuring transformations preserve circuit space. Result: The result is a method for embedding and reducing AMR digraphs while preserving their circuit space, with an analysis of how these reductions relate to the symbol grounding problem. Conclusion: The paper concludes by analyzing and discussing the properties of reduced digraphs in relation to the symbol grounding problem. Abstract: Abstract meaning representation (AMR) is a semantic formalism used to represent the meaning of sentences as directed acyclic graphs. In this paper, we describe how real digital dictionaries can be embedded into AMR directed graphs (digraphs), using state-of-the-art pre-trained large language models. Then, we reduce those graphs in a confluent manner, i.e. with transformations that preserve their circuit space. Finally, the properties of these reduces digraphs are analyzed and discussed in relation to the symbol grounding problem.

[12] Towards Reliable Multi-Agent Systems for Marketing Applications via Reflection, Memory, and Planning

Lorenzo Jaime Yu Flores,Junyi Shen,Xiaoyuan Gu

Main category: cs.CL

TL;DR: 该论文介绍了一种名为RAMP的多代理框架,用于解决营销任务中的受众策划问题,通过利用大型语言模型的规划和记忆能力显著提高了准确性,并展示了迭代验证和反思在提升性能方面的有效性。

Details Motivation: 文献中关于大型语言模型在现实世界应用中的可靠性仍然有限,因此引入了一个多代理框架来解决营销任务中的受众策划问题。 Method: 论文介绍了一个名为RAMP的多代理框架,该框架迭代地进行规划、调用工具、验证输出并生成改进建议。此外,模型配备了长期记忆存储,用于保存客户特定的事实和过去查询的知识库。 Result: 使用RAMP框架后,在88个评估查询上准确性提高了28个百分点。在更模糊的查询上,通过更多的验证/反思迭代,回忆率提高了大约20个百分点,并且用户满意度更高。 Conclusion: 论文得出结论,通过使用RAMP框架,利用大型语言模型(LLM)规划和记忆可以提高在营销任务中的准确性,并提供在动态、面向行业的环境中部署可靠LLM系统实用见解。 Abstract: Recent advances in large language models (LLMs) enabled the development of AI agents that can plan and interact with tools to complete complex tasks. However, literature on their reliability in real-world applications remains limited. In this paper, we introduce a multi-agent framework for a marketing task: audience curation. To solve this, we introduce a framework called RAMP that iteratively plans, calls tools, verifies the output, and generates suggestions to improve the quality of the audience generated. Additionally, we equip the model with a long-term memory store, which is a knowledge base of client-specific facts and past queries. Overall, we demonstrate the use of LLM planning and memory, which increases accuracy by 28 percentage points on a set of 88 evaluation queries. Moreover, we show the impact of iterative verification and reflection on more ambiguous queries, showing progressively better recall (roughly +20 percentage points) with more verify/reflect iterations on a smaller challenge set, and higher user satisfaction. Our results provide practical insights for deploying reliable LLM-based systems in dynamic, industry-facing environments.

[13] MoNaCo: More Natural and Complex Questions for Reasoning Across Dozens of Documents

Tomer Wolfson,Harsh Trivedi,Mor Geva,Yoav Goldberg,Dan Roth,Tushar Khot,Ashish Sabharwal,Reut Tsarfaty

Main category: cs.CL

TL;DR: MoNaCo is a new benchmark with 1,315 complex questions designed to evaluate LLMs on real-world reasoning tasks, revealing significant performance gaps.

Details Motivation: Current LLM benchmarks lack natural and time-consuming questions that reflect real-world complexity, which MoNaCo aims to address. Method: Developed a decomposed annotation pipeline to collect and manually answer a large number of natural and complex questions, creating the MoNaCo benchmark. Result: Frontier LLMs achieved a maximum of 61.2% F1 score on MoNaCo, showing limitations in recall and issues with hallucinations. Conclusion: The MoNaCo benchmark highlights the limitations of current LLMs in handling complex, natural, and information-seeking questions and emphasizes the need for improved reasoning models. Abstract: Large language models (LLMs) are emerging as a go-to tool for querying information. However, current LLM benchmarks rarely feature natural questions that are both information-seeking as well as genuinely time-consuming for humans. To address this gap we introduce MoNaCo, a benchmark of 1,315 natural and complex questions that require dozens, and at times hundreds, of intermediate steps to solve -- far more than any existing QA benchmark. To build MoNaCo, we developed a decomposed annotation pipeline to elicit and manually answer natural time-consuming questions at scale. Frontier LLMs evaluated on MoNaCo achieve at most 61.2% F1, hampered by low recall and hallucinations. Our results underscore the need for reasoning models that better handle the complexity and sheer breadth of real-world information-seeking questions -- with MoNaCo providing an effective resource for tracking such progress. The MONACO benchmark, codebase, prompts and models predictions are publicly available at: https://tomerwolgithub.github.io/monaco

[14] MobQA: A Benchmark Dataset for Semantic Understanding of Human Mobility Data through Question Answering

Hikaru Asano,Hiroki Ouchi,Akira Kasuga,Ryo Yonetani

Main category: cs.CL

TL;DR: MobQA数据集评估LLMs对人类移动数据的理解能力,发现模型在语义推理方面存在局限。

Details Motivation: 现有的模型虽然擅长预测人类移动模式,但难以解释这些模式背后的原因和语义含义。 Method: 构建了一个包含5800个高质量问答对的MobQA基准数据集,涵盖事实检索、多选推理和自由解释三种类型。 Result: 主要LLMs在事实检索任务上表现良好,但在语义推理和解释性问答任务中存在显著限制。 Conclusion: MobQA基准数据集揭示了LLMs在语义推理和解释性问答方面的显著局限性,同时展示了其在事实检索方面的强大性能。 Abstract: This paper presents MobQA, a benchmark dataset designed to evaluate the semantic understanding capabilities of large language models (LLMs) for human mobility data through natural language question answering. While existing models excel at predicting human movement patterns, it remains unobvious how much they can interpret the underlying reasons or semantic meaning of those patterns. MobQA provides a comprehensive evaluation framework for LLMs to answer questions about diverse human GPS trajectories spanning daily to weekly granularities. It comprises 5,800 high-quality question-answer pairs across three complementary question types: factual retrieval (precise data extraction), multiple-choice reasoning (semantic inference), and free-form explanation (interpretive description), which all require spatial, temporal, and semantic reasoning. Our evaluation of major LLMs reveals strong performance on factual retrieval but significant limitations in semantic reasoning and explanation question answering, with trajectory length substantially impacting model effectiveness. These findings demonstrate the achievements and limitations of state-of-the-art LLMs for semantic mobility understanding.\footnote{MobQA dataset is available at https://github.com/CyberAgentAILab/mobqa.}

[15] Overcoming Low-Resource Barriers in Tulu: Neural Models and Corpus Creation for OffensiveLanguage Identification

Anusha M D,Deepthi Vikram,Bharathi Raja Chakravarthi,Parameshwar R Hegde

Main category: cs.CL

TL;DR: This study introduces the first benchmark dataset for offensive language identification in code-mixed Tulu social media content and evaluates deep learning models, with the BiGRU model achieving the best performance.

Details Motivation: Tulu, a low-resource Dravidian language, has limited computational resources despite its growing digital presence, and there is a need for offensive language identification in code-mixed social media content. Method: A benchmark dataset was created and annotated for Offensive Language Identification in Tulu. Deep learning models including GRU, LSTM, BiGRU, BiLSTM, CNN, attention-based variants, and transformers were evaluated. Result: The dataset consists of 3,845 YouTube comments with high inter-annotator agreement. The BiGRU model with self-attention achieved 82% accuracy and a 0.81 macro F1-score, while transformer models underperformed. Conclusion: The study concludes that the BiGRU model with self-attention performs best for offensive language identification in code-mixed Tulu content, and transformer models have limitations in low-resource, code-mixed contexts. Abstract: Tulu, a low-resource Dravidian language predominantly spoken in southern India, has limited computational resources despite its growing digital presence. This study presents the first benchmark dataset for Offensive Language Identification (OLI) in code-mixed Tulu social media content, collected from YouTube comments across various domains. The dataset, annotated with high inter-annotator agreement (Krippendorff's alpha = 0.984), includes 3,845 comments categorized into four classes: Not Offensive, Not Tulu, Offensive Untargeted, and Offensive Targeted. We evaluate a suite of deep learning models, including GRU, LSTM, BiGRU, BiLSTM, CNN, and attention-based variants, alongside transformer architectures (mBERT, XLM-RoBERTa). The BiGRU model with self-attention achieves the best performance with 82% accuracy and a 0.81 macro F1-score. Transformer models underperform, highlighting the limitations of multilingual pretraining in code-mixed, under-resourced contexts. This work lays the foundation for further NLP research in Tulu and similar low-resource, code-mixed languages.

[16] Personalized Distractor Generation via MCTS-Guided Reasoning Reconstruction

Tao Wu,Jingyuan Chen,Wang Lin,Jian Zhan,Mengze Li,Kun Kuang,Fei Wu

Main category: cs.CL

TL;DR: 本文研究了如何利用大语言模型生成个性化干扰项,以提高教育评估中对学生误解的诊断效果。

Details Motivation: 干扰项在教育评估中起着关键作用,通过诊断学生的误解来发挥作用。最近的研究利用大语言模型(LLMs)通过学习大规模学生群体中的常见错误模式来生成共享的群体级干扰项。然而,这些干扰项往往无法捕捉到个别学生的多样化推理错误,从而限制了它们的诊断效果。为了解决这一限制,我们引入了个性化干扰项生成的任务。 Method: 为了解决这一限制,我们引入了个性化干扰项生成的任务,旨在根据每个学生过去回答问题的记录,生成量身定制的干扰项,以有效地暴露其特定的推理错误。为了克服这一问题,我们提出了一种无需训练的两阶段框架。在第一阶段,我们通过应用蒙特卡洛树搜索(MCTS)从过去的错误答案中恢复学生的推理轨迹,从而构建一个特定于学生的误解原型。在第二阶段,该原型指导学生在新问题上的推理模拟,从而生成与学生反复出现的误解相一致的个性化干扰项。 Result: 实验表明,我们的方法在生成个性化干扰项方面表现最佳,并且还能有效推广到群体水平设置,突显了其鲁棒性和适应性。 Conclusion: 实验表明,该方法在为140名学生生成个性化干扰项方面表现最佳,并且还能有效推广到群体水平设置,突显了其鲁棒性和适应性。 Abstract: Distractors, incorrect but plausible answer choices in multiple-choice questions (MCQs), play a critical role in educational assessment by diagnosing student misconceptions. Recent work has leveraged large language models (LLMs) to generate shared, group-level distractors by learning common error patterns across large student populations. However, such distractors often fail to capture the diverse reasoning errors of individual students, limiting their diagnostic effectiveness. To address this limitation, we introduce the task of personalized distractor generation, which aims to generate tailored distractors based on individual misconceptions inferred from each student's past question-answering (QA) records, ensuring every student receives options that effectively exposes their specific reasoning errors. While promising, this task is challenging because each student typically has only a few QA records, which often lack the student's underlying reasoning processes, making training-based group-level approaches infeasible. To overcome this, we propose a training-free two-stage framework. In the first stage, we construct a student-specific misconception prototype by applying Monte Carlo Tree Search (MCTS) to recover the student's reasoning trajectories from past incorrect answers. In the second stage, this prototype guides the simulation of the student's reasoning on new questions, enabling the generation of personalized distractors that align with the student's recurring misconceptions. Experiments show that our approach achieves the best performance in generating plausible, personalized distractors for 140 students, and also effectively generalizes to group-level settings, highlighting its robustness and adaptability.

[17] Novel Parasitic Dual-Scale Modeling for Efficient and Accurate Multilingual Speech Translation

Chenyang Le,Yinfeng Xia,Huiyan Li,Manhong Wang,Yutao Sun,Xingyang Ma,Yanmin Qian

Main category: cs.CL

TL;DR: This paper proposes a Parasitic Dual-Scale Approach with a KVSPN module to improve the efficiency and performance of multilingual speech-to-text translation models.

Details Motivation: To address the challenge of balancing inference efficiency and performance in multilingual speech-to-text models, particularly for local deployment scenarios. Method: The approach combines an enhanced speculative sampling method with model compression and knowledge distillation techniques, integrating a novel KVSPN module into the Whisper Medium model. Result: The integration of KVSPN achieves a 40% speedup without BLEU score degradation and a 2.6x speedup over the original Whisper Medium with superior performance. Conclusion: The proposed Parasitic Dual-Scale Approach enhances multilingual speech-to-text translation performance while significantly improving inference efficiency. Abstract: Recent advancements in speech-to-text translation have led to the development of multilingual models capable of handling multiple language pairs simultaneously. However, these unified models often suffer from large parameter sizes, making it challenging to balance inference efficiency and performance, particularly in local deployment scenarios. We propose an innovative Parasitic Dual-Scale Approach, which combines an enhanced speculative sampling method with model compression and knowledge distillation techniques. Building on the Whisper Medium model, we enhance it for multilingual speech translation into whisperM2M, and integrate our novel KVSPN module, achieving state-of-the-art (SOTA) performance across six popular languages with improved inference efficiency. KVSPN enables a 40\% speedup with no BLEU score degradation. Combined with distillation methods, it represents a 2.6$\times$ speedup over the original Whisper Medium with superior performance.

[18] E-CaTCH: Event-Centric Cross-Modal Attention with Temporal Consistency and Class-Imbalance Handling for Misinformation Detection

Ahmad Mousavi,Yeganeh Abdollahinejad,Roberto Corizzo,Nathalie Japkowicz,Zois Boukouvalas

Main category: cs.CL

TL;DR: E-CaTCH是一个用于检测社交媒体上多模态错误信息的可解释且可扩展的框架,通过聚类帖子、提取和对齐文本与视觉特征、使用趋势感知LSTM建模时间演变以及解决类别不平衡问题,实现了优于现有方法的性能。

Details Motivation: 社交媒体上的多模态错误信息检测仍然具有挑战性,因为模态之间存在不一致性、时间模式的变化以及类别不平衡问题。现有方法通常将帖子独立处理,未能捕捉到连接它们的事件级结构。 Method: E-CaTCH首先根据文本相似性和时间接近性将帖子聚类为伪事件,然后在每个事件中独立处理。通过预训练的BERT和ResNet编码器提取文本和视觉特征,利用模内自注意力进行优化,并通过双向跨模态注意力对齐。采用软门控机制融合这些表示,形成上下文相关的内容感知嵌入。为建模时间演变,E-CaTCH将事件分割为重叠时间窗口,并使用增强的语义变化和动量信号的趋势感知LSTM来编码叙事进展。通过自适应类别加权、时间一致性正则化和困难样本挖掘解决类别不平衡问题。 Result: E-CaTCH在Fakeddit、IND和COVID-19 MISINFOGRAPH数据集上的实验表明其性能优于最先进的基线方法。跨数据集评估进一步证明了其在不同错误信息场景下的鲁棒性、泛化能力和实际适用性。 Conclusion: E-CaTCH是一个有效的多模态错误信息检测框架,通过结合模态对齐、时间建模和类别平衡策略,解决了社交媒体中错误信息检测的关键挑战。 Abstract: Detecting multimodal misinformation on social media remains challenging due to inconsistencies between modalities, changes in temporal patterns, and substantial class imbalance. Many existing methods treat posts independently and fail to capture the event-level structure that connects them across time and modality. We propose E-CaTCH, an interpretable and scalable framework for robustly detecting misinformation. If needed, E-CaTCH clusters posts into pseudo-events based on textual similarity and temporal proximity, then processes each event independently. Within each event, textual and visual features are extracted using pre-trained BERT and ResNet encoders, refined via intra-modal self-attention, and aligned through bidirectional cross-modal attention. A soft gating mechanism fuses these representations to form contextualized, content-aware embeddings of each post. To model temporal evolution, E-CaTCH segments events into overlapping time windows and uses a trend-aware LSTM, enhanced with semantic shift and momentum signals, to encode narrative progression over time. Classification is performed at the event level, enabling better alignment with real-world misinformation dynamics. To address class imbalance and promote stable learning, the model integrates adaptive class weighting, temporal consistency regularization, and hard-example mining. The total loss is aggregated across all events. Extensive experiments on Fakeddit, IND, and COVID-19 MISINFOGRAPH demonstrate that E-CaTCH consistently outperforms state-of-the-art baselines. Cross-dataset evaluations further demonstrate its robustness, generalizability, and practical applicability across diverse misinformation scenarios.

[19] Cross-Granularity Hypergraph Retrieval-Augmented Generation for Multi-hop Question Answering

Changjian Wang,Weihong Deng,Weili Guan,Quan Lu,Ning Jiang

Main category: cs.CL

TL;DR: HGRAG improves multi-hop question answering by combining structural and semantic information through a hypergraph-based retrieval method.

Details Motivation: Traditional RAG methods ignore structural associations, while GraphRAG methods underutilize textual semantics. This work aims to combine both for better MHQA performance. Method: HGRAG constructs a hypergraph integrating structural and semantic information, using fine-grained entities and coarse-grained passages, and enhances retrieval through semantic and structural refinement. Result: HGRAG achieves better QA performance and a 6x speedup in retrieval efficiency on benchmark datasets. Conclusion: HGRAG outperforms existing methods in both QA performance and retrieval efficiency. Abstract: Multi-hop question answering (MHQA) requires integrating knowledge scattered across multiple passages to derive the correct answer. Traditional retrieval-augmented generation (RAG) methods primarily focus on coarse-grained textual semantic similarity and ignore structural associations among dispersed knowledge, which limits their effectiveness in MHQA tasks. GraphRAG methods address this by leveraging knowledge graphs (KGs) to capture structural associations, but they tend to overly rely on structural information and fine-grained word- or phrase-level retrieval, resulting in an underutilization of textual semantics. In this paper, we propose a novel RAG approach called HGRAG for MHQA that achieves cross-granularity integration of structural and semantic information via hypergraphs. Structurally, we construct an entity hypergraph where fine-grained entities serve as nodes and coarse-grained passages as hyperedges, and establish knowledge association through shared entities. Semantically, we design a hypergraph retrieval method that integrates fine-grained entity similarity and coarse-grained passage similarity via hypergraph diffusion. Finally, we employ a retrieval enhancement module, which further refines the retrieved results both semantically and structurally, to obtain the most relevant passages as context for answer generation with the LLM. Experimental results on benchmark datasets demonstrate that our approach outperforms state-of-the-art methods in QA performance, and achieves a 6$\times$ speedup in retrieval efficiency.

[20] UNVEILING: What Makes Linguistics Olympiad Puzzles Tricky for LLMs?

Mukund Choudhary,KV Aditya Srivatsa,Gaurja Aeron,Antara Raaghavi Bhattacharya,Dang Khoa Dang Dinh,Ikhlasul Akmal Hanif,Daria Kotova,Ekaterina Kochmar,Monojit Choudhury

Main category: cs.CL

TL;DR: 这篇论文研究了大型语言模型在低资源语言的语言学谜题中的表现,发现它们在处理形态复杂性方面存在困难,并提出了改进标记化器的建议。

Details Motivation: 大型语言模型在推理任务中表现出潜力,但在语言学谜题上的表现仍然不佳。通过分析这些问题,可以评估和改进模型在低资源语言环境下的语言学推理能力。 Method: 该论文通过分析629个问题和41种低资源语言的大型语言模型表现,使用语言学特征标记问题,以揭示弱点。 Result: 研究显示,大型语言模型在处理涉及更高形态复杂性的谜题时遇到困难,并且在处理英语中也存在的语言特征时表现更好。通过预处理步骤将单词拆分为语素可以提高可解性。 Conclusion: 该论文的结论是,大型语言模型在语言学推理任务中存在挑战,特别是在处理形态复杂性和低资源语言方面。通过改进和语言特定的标记化器可以提升解决这些问题的能力。 Abstract: Large language models (LLMs) have demonstrated potential in reasoning tasks, but their performance on linguistics puzzles remains consistently poor. These puzzles, often derived from Linguistics Olympiad (LO) contests, provide a minimal contamination environment to assess LLMs' linguistic reasoning abilities across low-resource languages. This work analyses LLMs' performance on 629 problems across 41 low-resource languages by labelling each with linguistically informed features to unveil weaknesses. Our analyses show that LLMs struggle with puzzles involving higher morphological complexity and perform better on puzzles involving linguistic features that are also found in English. We also show that splitting words into morphemes as a pre-processing step improves solvability, indicating a need for more informed and language-specific tokenisers. These findings thus offer insights into some challenges in linguistic reasoning and modelling of low-resource languages.

[21] LETToT: Label-Free Evaluation of Large Language Models On Tourism Using Expert Tree-of-Thought

Ruiyan Qi,Congding Wen,Weibo Zhou,Shangsong Liang,Lingbo Li

Main category: cs.CL

TL;DR: 本文提出了一种无需标注数据的旅游领域大型语言模型评估框架LETToT,通过专家推理结构实现对模型的有效评估。

Details Motivation: 由于标注基准数据集的成本高昂以及模型幻觉等问题,对特定领域如旅游的大型语言模型进行评估仍然具有挑战性。 Method: 通过与通用质量维度和专家反馈对齐,迭代优化和验证分层ToT组件,并将优化后的专家ToT应用于不同规模的模型评估。 Result: 研究结果表明,与基线相比,优化后的专家ToT在质量上提升了4.99-14.15%。此外,研究发现缩放定律在特定领域仍然适用,但增强推理能力的小型模型可以缩小这一差距。 Conclusion: 论文提出了一种名为LETToT的框架,利用专家推理结构对旅游领域的大型语言模型进行评估,提供了一种可扩展的、无标签的领域特定LLM评估范式。 Abstract: Evaluating large language models (LLMs) in specific domain like tourism remains challenging due to the prohibitive cost of annotated benchmarks and persistent issues like hallucinations. We propose $\textbf{L}$able-Free $\textbf{E}$valuation of LLM on $\textbf{T}$ourism using Expert $\textbf{T}$ree-$\textbf{o}$f-$\textbf{T}$hought (LETToT), a framework that leverages expert-derived reasoning structures-instead of labeled data-to access LLMs in tourism. First, we iteratively refine and validate hierarchical ToT components through alignment with generic quality dimensions and expert feedback. Results demonstrate the effectiveness of our systematically optimized expert ToT with 4.99-14.15\% relative quality gains over baselines. Second, we apply LETToT's optimized expert ToT to evaluate models of varying scales (32B-671B parameters), revealing: (1) Scaling laws persist in specialized domains (DeepSeek-V3 leads), yet reasoning-enhanced smaller models (e.g., DeepSeek-R1-Distill-Llama-70B) close this gap; (2) For sub-72B models, explicit reasoning architectures outperform counterparts in accuracy and conciseness ($p<0.05$). Our work established a scalable, label-free paradigm for domain-specific LLM evaluation, offering a robust alternative to conventional annotated benchmarks.

[22] ToxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection

Axel Delaval,Shujian Yang,Haicheng Wang,Han Qiu,Jialiang Lu

Main category: cs.CL

TL;DR: 本论文介绍了一个新的法语毒性内容检测基准TOXIFRENCH,并提出了一种基于动态加权损失的思维链微调策略,显著提高了小型语言模型的性能和可靠性,同时展现出强大的多语言能力。

Details Motivation: 本文的动机是法语中的毒性内容检测仍然不发达,主要是由于缺乏文化相关的大规模数据集。此外,研究发现小型语言模型在毒性检测任务中表现出比许多大型模型更好的鲁棒性和泛化能力,这激发了对提高小型模型性能的新策略的研究。 Method: 本文采用了一种半自动注释流程,构建了一个包含53,622条法国在线评论的新公共基准TOXIFRENCH。然后,通过基准测试多种模型,提出了一种新的思维链(CoT)微调策略,使用动态加权损失逐步强调模型的最终决策,从而显著提高模型的可靠性。 Result: 本文构建了一个新的法语毒性内容检测基准TOXIFRENCH,并发现小型语言模型在鲁棒性和泛化能力方面优于许多大型模型。通过提出的思维链微调策略,4B参数模型的F1分数比基线提高了13%,并且在跨语言毒性基准测试中表现出色,超越了如GPT-40和Gemini-2.5等大型语言模型。 Conclusion: 本论文得出的结论是,通过使用动态加权损失的思维链(CoT)微调策略,可以显著提高小型语言模型在毒性内容检测任务中的性能和可靠性。此外,该方法还展现出强大的多语言能力,可扩展到其他语言和安全关键分类任务。 Abstract: Detecting toxic content using language models is crucial yet challenging. While substantial progress has been made in English, toxicity detection in French remains underdeveloped, primarily due to the lack of culturally relevant, large-scale datasets. In this work, we introduce TOXIFRENCH, a new public benchmark of 53,622 French online comments, constructed via a semi-automated annotation pipeline that reduces manual labeling to only 10% through high-confidence LLM-based pre-annotation and human verification. Then, we benchmark a broad range of models and uncover a counterintuitive insight: Small Language Models (SLMs) outperform many larger models in robustness and generalization under the toxicity detection task. Motivated by this finding, we propose a novel Chain-of-Thought (CoT) fine-tuning strategy using a dynamic weighted loss that progressively emphasizes the model's final decision, significantly improving faithfulness. Our fine-tuned 4B model achieves state-of-the-art performance, improving its F1 score by 13% over its baseline and outperforming LLMs such as GPT-40 and Gemini-2.5. Further evaluation on a cross-lingual toxicity benchmark demonstrates strong multilingual ability, suggesting that our methodology can be effectively extended to other languages and safety-critical classification tasks.

[23] AI in Mental Health: Emotional and Sentiment Analysis of Large Language Models' Responses to Depression, Anxiety, and Stress Queries

Arya VarastehNezhad,Reza Tavasoli,Soroush Elyasi,MohammadHossein LotfiNia,Hamed Farbeh

Main category: cs.CL

TL;DR: 该研究分析了多个大型语言模型在回答关于心理健康问题时的情感特征,发现了模型和问题类型对情感表达有显著影响,而用户档案的差异影响较小。

Details Motivation: 抑郁、焦虑和压力是普遍的心理健康问题,越来越多的人转向大型语言模型(LLMs)寻求信息,研究LLMs的回答有助于理解其在心理健康支持中的应用和影响。 Method: 研究分析了8个LLMs对20个关于抑郁、焦虑和压力的问题的回答,这些问题是为6个用户档案设计的。生成了2880个回答,并使用先进的工具对情感和情绪进行了评分。 Result: 研究揭示了乐观、恐惧和悲伤是回答中的主导情绪,同时发现不同模型在情感表达上存在显著差异,例如Mixtral表现出最高水平的负面情绪,而Llama则最为乐观和快乐。焦虑问题引发的恐惧感最强,抑郁问题增加了悲伤和负面情绪,而压力相关问题则产生最乐观的反应。 Conclusion: 该研究发现不同的大型语言模型(LLMs)在回答关于抑郁、焦虑和压力的问题时表现出显著不同的情感特征,模型选择对心理健康应用的效果具有重要影响。 Abstract: Depression, anxiety, and stress are widespread mental health concerns that increasingly drive individuals to seek information from Large Language Models (LLMs). This study investigates how eight LLMs (Claude Sonnet, Copilot, Gemini Pro, GPT-4o, GPT-4o mini, Llama, Mixtral, and Perplexity) reply to twenty pragmatic questions about depression, anxiety, and stress when those questions are framed for six user profiles (baseline, woman, man, young, old, and university student). The models generated 2,880 answers, which we scored for sentiment and emotions using state-of-the-art tools. Our analysis revealed that optimism, fear, and sadness dominated the emotional landscape across all outputs, with neutral sentiment maintaining consistently high values. Gratitude, joy, and trust appeared at moderate levels, while emotions such as anger, disgust, and love were rarely expressed. The choice of LLM significantly influenced emotional expression patterns. Mixtral exhibited the highest levels of negative emotions including disapproval, annoyance, and sadness, while Llama demonstrated the most optimistic and joyful responses. The type of mental health condition dramatically shaped emotional responses: anxiety prompts elicited extraordinarily high fear scores (0.974), depression prompts generated elevated sadness (0.686) and the highest negative sentiment, while stress-related queries produced the most optimistic responses (0.755) with elevated joy and trust. In contrast, demographic framing of queries produced only marginal variations in emotional tone. Statistical analyses confirmed significant model-specific and condition-specific differences, while demographic influences remained minimal. These findings highlight the critical importance of model selection in mental health applications, as each LLM exhibits a distinct emotional signature that could significantly impact user experience and outcomes.

[24] SafeConstellations: Steering LLM Safety to Reduce Over-Refusals Through Task-Specific Trajectory

Utsav Maskey,Sumit Yadav,Mark Dras,Usman Naseem

Main category: cs.CL

TL;DR: The paper introduces SafeConstellations, a method to reduce over-refusal in LLMs by guiding model behavior on tasks prone to over-refusal, achieving a 73% reduction in refusal rates with minimal impact on utility.

Details Motivation: LLMs increasingly exhibit over-refusal behavior, where safety mechanisms cause models to reject benign instructions that superficially resemble harmful content, diminishing utility in production applications. Method: The paper introduces SafeConstellations, an inference-time trajectory-shifting approach that tracks task-specific trajectory patterns and guides representations toward non-refusal pathways. Result: Through comprehensive evaluation, the study demonstrates that LLMs still tend to refuse responses to harmful instructions when reframed as benign tasks and reveals distinct 'constellation' patterns in embedding space. Conclusion: SafeConstellations effectively reduces over-refusal rates by up to 73% with minimal impact on utility by guiding model behavior on tasks prone to over-refusal. Abstract: LLMs increasingly exhibit over-refusal behavior, where safety mechanisms cause models to reject benign instructions that superficially resemble harmful content. This phenomena diminishes utility in production applications that repeatedly rely on common prompt templates or applications that frequently rely on LLMs for specific tasks (e.g. sentiment analysis, language translation). Through comprehensive evaluation, we demonstrate that LLMs still tend to refuse responses to harmful instructions when those instructions are reframed to appear as benign tasks. Our mechanistic analysis reveal that LLMs follow distinct "constellation" patterns in embedding space as representations traverse layers, with each task maintaining consistent trajectories that shift predictably between refusal and non-refusal cases. We introduce SafeConstellations, an inference-time trajectory-shifting approach that tracks task-specific trajectory patterns and guides representations toward non-refusal pathways. By selectively guiding model behavior only on tasks prone to over-refusal, and by preserving general model behavior, our method reduces over-refusal rates by up to 73% with minimal impact on utility-offering a principled approach to mitigating over-refusals.

[25] SGSimEval: A Comprehensive Multifaceted and Similarity-Enhanced Benchmark for Automatic Survey Generation Systems

Beichen Guo,Zhiyuan Wen,Yu Yang,Peng Gao,Ruosong Yang,Jiaxing Shen

Main category: cs.CL

TL;DR: 本论文提出了SGSimEval,一个综合基准用于评估自动调查生成系统,结合LLM评分和定量指标,并引入了强调固有质量和与人类相似性的人类偏好指标。

Details Motivation: 现有的评估方法存在偏差、缺乏人类偏好以及过度依赖LLMs作为评判,需要一个稳健的评估方法。 Method: 提出SGSimEval,结合了大纲、内容和参考文献的评估,以及LLM评分和定量指标,并引入人类偏好指标。 Result: 实验表明,当前的ASG系统在大纲生成方面显示了与人类相当的优势,而在内容和参考文献生成方面还有很大的改进空间。 Conclusion: SGSimEval提供了一个多方面的评估框架,评估指标与人类评估保持高度一致。 Abstract: The growing interest in automatic survey generation (ASG), a task that traditionally required considerable time and effort, has been spurred by recent advances in large language models (LLMs). With advancements in retrieval-augmented generation (RAG) and the rising popularity of multi-agent systems (MASs), synthesizing academic surveys using LLMs has become a viable approach, thereby elevating the need for robust evaluation methods in this domain. However, existing evaluation methods suffer from several limitations, including biased metrics, a lack of human preference, and an over-reliance on LLMs-as-judges. To address these challenges, we propose SGSimEval, a comprehensive benchmark for Survey Generation with Similarity-Enhanced Evaluation that evaluates automatic survey generation systems by integrating assessments of the outline, content, and references, and also combines LLM-based scoring with quantitative metrics to provide a multifaceted evaluation framework. In SGSimEval, we also introduce human preference metrics that emphasize both inherent quality and similarity to humans. Extensive experiments reveal that current ASG systems demonstrate human-comparable superiority in outline generation, while showing significant room for improvement in content and reference generation, and our evaluation metrics maintain strong consistency with human assessments.

[26] LLM Compression: How Far Can We Go in Balancing Size and Performance?

Sahil Sk,Debasish Dhal,Sonal Khosla,Sk Shahid,Sambit Shekhar,Akash Dhaka,Shantipriya Parida,Dilip K. Prasad,Ondřej Bojar

Main category: cs.CL

TL;DR: This paper evaluates 4-bit quantization techniques (GSQ and GPTQ) on several large language models to assess their performance and efficiency trade-offs for real-world deployment across multiple NLP tasks.

Details Motivation: The motivation is to enhance the accessibility of large language models by reducing memory usage and computational costs through quantization techniques while preserving model performance for practical deployment. Method: 4-bit Group Scaling Quantization (GSQ) and Generative Pretrained Transformer Quantization (GPTQ) were applied to LLaMA 1B, Qwen 0.5B, and PHI 1.5B models, and their performance was evaluated on MS MARCO, BoolQ, and GSM8K datasets. Result: The study found that quantized models showed trade-offs between compression efficiency and task performance, with varying impacts across model sizes and tasks, providing insights into the effectiveness of low-bit quantization methods. Conclusion: The study concludes that 4-bit quantization techniques like GSQ and GPTQ can effectively compress large language models, enabling efficient deployment for real-world applications while maintaining acceptable performance levels across various NLP tasks. Abstract: Quantization is an essential and popular technique for improving the accessibility of large language models (LLMs) by reducing memory usage and computational costs while maintaining performance. In this study, we apply 4-bit Group Scaling Quantization (GSQ) and Generative Pretrained Transformer Quantization (GPTQ) to LLaMA 1B, Qwen 0.5B, and PHI 1.5B, evaluating their impact across multiple NLP tasks. We benchmark these models on MS MARCO (Information Retrieval), BoolQ (Boolean Question Answering), and GSM8K (Mathematical Reasoning) datasets, assessing both accuracy and efficiency across various tasks. The study measures the trade-offs between model compression and task performance, analyzing key evaluation metrics, namely accuracy, inference latency, and throughput (total output tokens generated per second), providing insights into the suitability of low-bit quantization for real-world deployment. Using the results, users can then make suitable decisions based on the specifications that need to be met. We discuss the pros and cons of GSQ and GPTQ techniques on models of different sizes, which also serve as a benchmark for future experiments.

[27] SpecDetect: Simple, Fast, and Training-Free Detection of LLM-Generated Text via Spectral Analysis

Haitong Luo,Weiyao Zhang,Suhang Wang,Wenji Zou,Chungang Lin,Xuying Meng,Yujun Zhang

Main category: cs.CL

TL;DR: The study proposes SpecDetect and SpecDetect++ for detecting LLM-generated text by analyzing token log-probabilities using signal processing techniques, showing improved performance and efficiency.

Details Motivation: The motivation is the need for reliable and efficient detection methods for high-quality text generated by Large Language Models (LLMs), addressing the limitations of existing training-free approaches that rely on surface-level statistics. Method: The method involves analyzing the sequence of token log-probabilities in the frequency domain using the global Discrete Fourier Transform (DFT) and the local Short-Time Fourier Transform (STFT). A detector, SpecDetect, is built based on DFT total energy, and an enhanced version, SpecDetect++, incorporates a sampling discrepancy mechanism. Result: The result shows that human-written text exhibits significantly higher spectral energy, indicating larger-amplitude fluctuations compared to LLM-generated text. The proposed approach outperforms the state-of-the-art model while running in nearly half the time. Conclusion: The work introduces a new, efficient, and interpretable pathway for LLM-generated text detection using classical signal processing techniques. Abstract: The proliferation of high-quality text from Large Language Models (LLMs) demands reliable and efficient detection methods. While existing training-free approaches show promise, they often rely on surface-level statistics and overlook fundamental signal properties of the text generation process. In this work, we reframe detection as a signal processing problem, introducing a novel paradigm that analyzes the sequence of token log-probabilities in the frequency domain. By systematically analyzing the signal's spectral properties using the global Discrete Fourier Transform (DFT) and the local Short-Time Fourier Transform (STFT), we find that human-written text consistently exhibits significantly higher spectral energy. This higher energy reflects the larger-amplitude fluctuations inherent in human writing compared to the suppressed dynamics of LLM-generated text. Based on this key insight, we construct SpecDetect, a detector built on a single, robust feature from the global DFT: DFT total energy. We also propose an enhanced version, SpecDetect++, which incorporates a sampling discrepancy mechanism to further boost robustness. Extensive experiments demonstrate that our approach outperforms the state-of-the-art model while running in nearly half the time. Our work introduces a new, efficient, and interpretable pathway for LLM-generated text detection, showing that classical signal processing techniques offer a surprisingly powerful solution to this modern challenge.

[28] Feedback Indicators: The Alignment between Llama and a Teacher in Language Learning

Sylvio Rüdian,Yassin Elsir,Marvin Kretschmer,Sabine Cayrou,Niels Pinkwart

Main category: cs.CL

TL;DR: This study explores the use of Llama 3.1 for extracting indicators from student submissions in a language learning course, showing strong correlations between LLM-generated indicators and human ratings, paving the way for automated, transparent formative feedback.

Details Motivation: Automated feedback generation can enhance students' learning progress and assist teachers in optimizing their time. Extracting relevant indicators is essential for generating high-quality, information-rich formative feedback. Method: This study investigates the extraction of indicators from students' submissions in a language learning course using the large language model Llama 3.1. It evaluates the alignment between indicators generated by the LLM and human ratings across various feedback criteria. Result: The findings demonstrate statistically significant strong correlations between indicators generated by the LLM and human ratings, even in cases involving unanticipated combinations of indicators and criteria. Conclusion: The study concludes that using LLMs like Llama 3.1 can effectively extract indicators from student submissions, providing a promising foundation for generating explainable and transparent formative feedback. Abstract: Automated feedback generation has the potential to enhance students' learning progress by providing timely and targeted feedback. Moreover, it can assist teachers in optimizing their time, allowing them to focus on more strategic and personalized aspects of teaching. To generate high-quality, information-rich formative feedback, it is essential first to extract relevant indicators, as these serve as the foundation upon which the feedback is constructed. Teachers often employ feedback criteria grids composed of various indicators that they evaluate systematically. This study examines the initial phase of extracting such indicators from students' submissions of a language learning course using the large language model Llama 3.1. Accordingly, the alignment between indicators generated by the LLM and human ratings across various feedback criteria is investigated. The findings demonstrate statistically significant strong correlations, even in cases involving unanticipated combinations of indicators and criteria. The methodology employed in this paper offers a promising foundation for extracting indicators from students' submissions using LLMs. Such indicators can potentially be utilized to auto-generate explainable and transparent formative feedback in future research.

[29] When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs

Mikhail Seleznyov,Mikhail Chaichuk,Gleb Ershov,Alexander Panchenko,Elena Tutubalina,Oleg Somov

Main category: cs.CL

TL;DR: 本文系统评估了提升大语言模型提示鲁棒性的5种方法,测试了多个模型和任务,为实际应用提供了可行建议。

Details Motivation: 大语言模型(LLMs)对提示语的措辞和格式的微小变化非常敏感,因此需要系统评估现有的提升提示鲁棒性的方法。 Method: 在统一的实验框架下,对5种提升提示鲁棒性的方法进行了系统评估,测试了来自Llama、Qwen和Gemma系列的8个模型,并在52个Natural Instructions数据集任务中进行了基准测试。 Result: 研究提供了对提示鲁棒性方法相对有效性的深入分析,并评估了前沿模型(如GPT-4.1和DeepSeek V3)对格式扰动的鲁棒性。 Conclusion: 研究发现不同的提示鲁棒性方法在不同模型和任务中的效果各异,研究为提高大语言模型在现实应用中的稳定性提供了可行建议。 Abstract: Large Language Models (LLMs) are highly sensitive to subtle, non-semantic variations in prompt phrasing and formatting. In this work, we present the first systematic evaluation of 5 methods for improving prompt robustness within a unified experimental framework. We benchmark these techniques on 8 models from Llama, Qwen and Gemma families across 52 tasks from Natural Instructions dataset. Our evaluation covers robustness methods from both fine-tuned and in-context learning paradigms, and tests their generalization against multiple types of distribution shifts. Finally, we extend our analysis to GPT-4.1 and DeepSeek V3 to assess frontier models' current robustness to format perturbations. Our findings offer actionable insights into the relative effectiveness of these robustness methods, enabling practitioners to make informed decisions when aiming for stable and reliable LLM performance in real-world applications. Code: https://github.com/AIRI-Institute/when-punctuation-matters.

[30] Retrieval-augmented reasoning with lean language models

Ryan Sze-Yin Chan,Federico Nanni,Tomas Lazauskas,Rosie Wood,Penelope Yong,Lionel Tarassenko,Mark Girolami,James Geddes,Andrew Duncan

Main category: cs.CL

TL;DR: The paper presents a novel approach to combining reasoning and retrieval augmented generation (RAG) within a lean language model architecture, achieving high performance while being suitable for local deployment in resource-constrained or secure environments.

Details Motivation: The motivation is the increasing demand for performant and privacy-preserving solutions that can be deployed in resource-constrained or secure environments, addressing the limitations of existing RAG systems that rely on large-scale models and external APIs. Method: The method involves integrating a dense retriever with fine-tuned Qwen2.5-Instruct models, using synthetic query generation and reasoning traces derived from frontier models over a curated corpus (NHS A-to-Z condition pages). Result: The result shows that the retrieval augmented conversational agent can interpret complex, domain-specific queries using a lightweight backbone model and that evaluation against non-reasoning and general-purpose lean models demonstrates substantial gains in performance. Conclusion: The paper concludes that their domain-specific fine-tuning approach substantially improves answer accuracy and consistency, nearly matching frontier-level performance while still being suitable for local deployment. Abstract: This technical report details a novel approach to combining reasoning and retrieval augmented generation (RAG) within a single, lean language model architecture. While existing RAG systems typically rely on large-scale models and external APIs, our work addresses the increasing demand for performant and privacy-preserving solutions deployable in resource-constrained or secure environments. Building on recent developments in test-time scaling and small-scale reasoning models, we develop a retrieval augmented conversational agent capable of interpreting complex, domain-specific queries using a lightweight backbone model. Our system integrates a dense retriever with fine-tuned Qwen2.5-Instruct models, using synthetic query generation and reasoning traces derived from frontier models (e.g., DeepSeek-R1) over a curated corpus, in this case, the NHS A-to-Z condition pages. We explore the impact of summarisation-based document compression, synthetic data design, and reasoning-aware fine-tuning on model performance. Evaluation against both non-reasoning and general-purpose lean models demonstrates that our domain-specific fine-tuning approach yields substantial gains in answer accuracy and consistency, approaching frontier-level performance while remaining feasible for local deployment. All implementation details and code are publicly released to support reproducibility and adaptation across domains.

[31] Model Interpretability and Rationale Extraction by Input Mask Optimization

Marc Brinner,Sina Zarriess

Main category: cs.CL

TL;DR: This paper introduces a method for generating extractive explanations for neural network predictions, using input masking and optimization to ensure explanation quality, applicable across different domains like NLP and image classification.

Details Motivation: The motivation stems from the increasing need for creating explanations for predictions made by black-box neural network models, especially in fields like natural language processing and computer vision. Method: The method involves masking parts of the input using gradient-based optimization combined with a new regularization scheme that ensures sufficiency, comprehensiveness, and compactness of the explanation. Result: The result is a new method capable of generating high-quality, extractive explanations for both natural language processing and image classification tasks, showing the broad applicability of the proposed conditions for rationale extraction. Conclusion: The paper concludes that the proposed method can effectively generate high-quality extractive explanations for predictions made by neural networks, bridging the gap between model interpretability and rationale extraction, and showing that rationale extraction can be performed without training a specialized model. Abstract: Concurrent to the rapid progress in the development of neural-network based models in areas like natural language processing and computer vision, the need for creating explanations for the predictions of these black-box models has risen steadily. We propose a new method to generate extractive explanations for predictions made by neural networks, that is based on masking parts of the input which the model does not consider to be indicative of the respective class. The masking is done using gradient-based optimization combined with a new regularization scheme that enforces sufficiency, comprehensiveness and compactness of the generated explanation, three properties that are known to be desirable from the related field of rationale extraction in natural language processing. In this way, we bridge the gap between model interpretability and rationale extraction, thereby proving that the latter of which can be performed without training a specialized model, only on the basis of a trained classifier. We further apply the same method to image inputs and obtain high quality explanations for image classifications, which indicates that the conditions proposed for rationale extraction in natural language processing are more broadly applicable to different input types.

[32] Rationalizing Transformer Predictions via End-To-End Differentiable Self-Training

Marc Brinner,Sina Zarrieß

Main category: cs.CL

TL;DR: This paper proposes a stable and efficient end-to-end training method for rationalized transformer classifiers, achieving state-of-the-art alignment with human annotations without explicit supervision.

Details Motivation: The motivation is to address the training instabilities in existing rationalized models and improve alignment with human annotations without explicit supervision. Method: The paper introduces an end-to-end differentiable training paradigm where a single model acts as a classifier, rationale selector, and complement classifier, while also extending the paradigm to produce class-wise rationales. Result: The proposed approach achieves more stable training and substantially improved performance, resulting in state-of-the-art alignment with human annotations. Conclusion: The paper concludes that their proposed method improves the training stability and performance of rationalized transformer classifiers, achieving state-of-the-art alignment with human annotations without explicit supervision. Abstract: We propose an end-to-end differentiable training paradigm for stable training of a rationalized transformer classifier. Our approach results in a single model that simultaneously classifies a sample and scores input tokens based on their relevance to the classification. To this end, we build on the widely-used three-player-game for training rationalized models, which typically relies on training a rationale selector, a classifier and a complement classifier. We simplify this approach by making a single model fulfill all three roles, leading to a more efficient training paradigm that is not susceptible to the common training instabilities that plague existing approaches. Further, we extend this paradigm to produce class-wise rationales while incorporating recent advances in parameterizing and regularizing the resulting rationales, thus leading to substantially improved and state-of-the-art alignment with human annotations without any explicit supervision.

[33] Survey-to-Behavior: Downstream Alignment of Human Values in LLMs via Survey Questions

Shangrui Nie,Florian Mai,David Kaczér,Charles Welch,Zhixue Zhao,Lucie Flek

Main category: cs.CL

TL;DR: 通过微调模型回答价值调查问题,可以有效改变其在相关任务中的行为,实现价值观对齐。

Details Motivation: 大型语言模型隐含地编码了人类价值观的偏好,但通常需要大量训练数据来引导它们。本文研究是否可以通过训练模型按特定方式回答价值调查问题来可靠地修改其价值体系。 Method: 首先构建了多个开源大语言模型的价值观基线,然后通过微调模型来调整其价值体系,并在领域内和领域外的场景中评估微调效果。 Result: 该方法不仅改变了模型在领域内调查问题上的回答,还在隐式的下游任务行为中观察到了显著的价值观对齐效果。 Conclusion: 实验表明,通过对价值调查问题进行微调,可以有效改变模型在领域内问题上的回答,同时在下游任务行为中也产生了显著的价值观对齐效果。 Abstract: Large language models implicitly encode preferences over human values, yet steering them often requires large training data. In this work, we investigate a simple approach: Can we reliably modify a model's value system in downstream behavior by training it to answer value survey questions accordingly? We first construct value profiles of several open-source LLMs by asking them to rate a series of value-related descriptions spanning 20 distinct human values, which we use as a baseline for subsequent experiments. We then investigate whether the value system of a model can be governed by fine-tuning on the value surveys. We evaluate the effect of finetuning on the model's behavior in two ways; first, we assess how answers change on in-domain, held-out survey questions. Second, we evaluate whether the model's behavior changes in out-of-domain settings (situational scenarios). To this end, we construct a contextualized moral judgment dataset based on Reddit posts and evaluate changes in the model's behavior in text-based adventure games. We demonstrate that our simple approach can not only change the model's answers to in-domain survey questions, but also produces substantial shifts (value alignment) in implicit downstream task behavior.

[34] HumorPlanSearch: Structured Planning and HuCoT for Contextual AI Humor

Shivam Dubey

Main category: cs.CL

TL;DR: 本文介绍了一个名为HumorPlanSearch的模块化管道,通过显式建模背景来改进自动化幽默生成,并提出了一个评估背景敏感性和喜剧质量的评分系统。

Details Motivation: 由于幽默深深扎根于听众的文化背景、思维模式和即时背景,自动化幽默生成往往产生出感觉通用、重复或不合时宜的笑话。 Method: 引入了HumorPlanSearch,一个显式建模背景的模块化管道,包括计划搜索、幽默思维链模板、知识图谱、语义嵌入的新颖性过滤以及迭代的判断驱动修订循环。此外,提出了幽默生成评分(HGS)来评估背景敏感性和喜剧质量。 Result: 在涉及九个主题的实验中,通过13个评委的反馈,完整的管道(KG + Revision)使平均HGS提高了15.4%。 Conclusion: HumorPlanSearch通过强调从策略规划到多信号评估的每个阶段的背景,推进了人工智能驱动的幽默向更连贯、适应性和文化适应的喜剧发展。 Abstract: Automated humor generation with Large Language Models (LLMs) often yields jokes that feel generic, repetitive, or tone-deaf because humor is deeply situated and hinges on the listener's cultural background, mindset, and immediate context. We introduce HumorPlanSearch, a modular pipeline that explicitly models context through: (1) Plan-Search for diverse, topic-tailored strategies; (2) Humor Chain-of-Thought (HuCoT) templates capturing cultural and stylistic reasoning; (3) a Knowledge Graph to retrieve and adapt high-performing historical strategies; (4) novelty filtering via semantic embeddings; and (5) an iterative judge-driven revision loop. To evaluate context sensitivity and comedic quality, we propose the Humor Generation Score (HGS), which fuses direct ratings, multi-persona feedback, pairwise win-rates, and topic relevance. In experiments across nine topics with feedback from 13 human judges, our full pipeline (KG + Revision) boosts mean HGS by 15.4 percent (p < 0.05) over a strong baseline. By foregrounding context at every stage from strategy planning to multi-signal evaluation, HumorPlanSearch advances AI-driven humor toward more coherent, adaptive, and culturally attuned comedy.

[35] Online Anti-sexist Speech: Identifying Resistance to Gender Bias in Political Discourse

Aditi Dutta,Susan Banducci

Main category: cs.CL

TL;DR: 研究发现,自动化内容审核系统常将反性别歧视言论误判为有害内容,建议引入人工审核并改进训练数据以更好保护抵抗性言论。

Details Motivation: 反性别歧视言论在塑造在线民主辩论中起着关键作用,但自动化内容审核系统可能无法有效识别这类言论,反而将其误判为有害内容。这种误判可能对边缘化声音造成不成比例的影响,因此需要深入研究这一问题。 Method: 研究方法包括对五个大型语言模型(LLMs)如何分类英国2022年涉及女性议员的政治推文(包括性别歧视、反性别歧视和中性言论)进行分析,特别关注高敏感事件中的误分类情况。 Result: 研究结果显示,大型语言模型经常将反性别歧视言论误判为有害内容,尤其是在政治敏感事件期间,当攻击性言论和抵抗性言论风格交织时,这种误判更为普遍。 Conclusion: 该研究得出结论,当前基于大型语言模型的自动化内容审核系统难以有效区分反性别歧视言论与性别歧视言论,尤其是在政治敏感事件期间,这可能导致对挑战性别歧视者的言论压制。研究建议审核设计应超越简单的有害/无害二元分类,引入人工审核,并在训练数据中明确包含反言论。 Abstract: Anti-sexist speech, i.e., public expressions that challenge or resist gendered abuse and sexism, plays a vital role in shaping democratic debate online. Yet automated content moderation systems, increasingly powered by large language models (LLMs), may struggle to distinguish such resistance from the sexism it opposes. This study examines how five LLMs classify sexist, anti-sexist, and neutral political tweets from the UK, focusing on high-salience trigger events involving female Members of Parliament in the year 2022. Our analysis show that models frequently misclassify anti-sexist speech as harmful, particularly during politically charged events where rhetorical styles of harm and resistance converge. These errors risk silencing those who challenge sexism, with disproportionate consequences for marginalised voices. We argue that moderation design must move beyond binary harmful/not-harmful schemas, integrate human-in-the-loop review during sensitive events, and explicitly include counter-speech in training data. By linking feminist scholarship, event-based analysis, and model evaluation, this work highlights the sociotechnical challenges of safeguarding resistance speech in digital political spaces.

[36] CoDiEmb: A Collaborative yet Distinct Framework for Unified Representation Learning in Information Retrieval and Semantic Textual Similarity

Bowen Zhang,Zixin Song,Chunquan Chen,Qian-Wen Zhang,Di Yin,Xing Sun

Main category: cs.CL

TL;DR: CoDiEmb是一个统一的框架,通过任务专用目标、动态采样器和delta引导模型融合策略,有效地解决了信息检索和语义文本相似性任务之间的冲突,提高了嵌入空间的几何属性。

Details Motivation: 在统一文本嵌入的学习中,解决信息检索和语义文本相似性任务之间的冲突。 Method: CoDiEmb引入了三种关键创新:任务专用目标与动态采样器、delta引导模型融合策略以及高效单阶段训练管道。 Result: 框架不仅缓解了跨任务权衡,还提高了嵌入空间的几何属性。 Conclusion: CoDiEmb有效缓解了跨任务权衡,并提高了嵌入空间的几何属性。 Abstract: Learning unified text embeddings that excel across diverse downstream tasks is a central goal in representation learning, yet negative transfer remains a persistent obstacle. This challenge is particularly pronounced when jointly training a single encoder for Information Retrieval (IR) and Semantic Textual Similarity (STS), two essential but fundamentally disparate tasks for which naive co-training typically yields steep performance trade-offs. We argue that resolving this conflict requires systematically decoupling task-specific learning signals throughout the training pipeline. To this end, we introduce CoDiEmb, a unified framework that reconciles the divergent requirements of IR and STS in a collaborative yet distinct manner. CoDiEmb integrates three key innovations for effective joint optimization: (1) Task-specialized objectives paired with a dynamic sampler that forms single-task batches and balances per-task updates, thereby preventing gradient interference. For IR, we employ a contrastive loss with multiple positives and hard negatives, augmented by cross-device sampling. For STS, we adopt order-aware objectives that directly optimize correlation and ranking consistency. (2) A delta-guided model fusion strategy that computes fine-grained merging weights for checkpoints by analyzing each parameter's deviation from its pre-trained initialization, proving more effective than traditional Model Soups. (3) An efficient, single-stage training pipeline that is simple to implement and converges stably. Extensive experiments on 15 standard IR and STS benchmarks across three base encoders validate CoDiEmb. Our results and analysis demonstrate that the framework not only mitigates cross-task trade-offs but also measurably improves the geometric properties of the embedding space.

[37] Reference Points in LLM Sentiment Analysis: The Role of Structured Context

Junichiro Niimi

Main category: cs.CL

TL;DR: This study shows that structured prompting (e.g., JSON format) with contextual information enhances sentiment analysis performance on small LLMs, making them viable for marketing applications on edge devices.

Details Motivation: Traditional sentiment analysis focuses only on review text, while marketing theories suggest that customer evaluations are also influenced by reference points. This study explores how structured supplementary information can improve sentiment analysis using LLMs. Method: This study compares the effectiveness of natural language (NL) and JSON-formatted prompts on a lightweight 3B parameter LLM for sentiment analysis. It evaluates performance on two Yelp categories (Restaurant and Nightlife) using metrics like Macro-F1 and RMSE. Result: The JSON prompt with additional contextual information outperformed baseline methods without fine-tuning, improving Macro-F1 by 1.6% and 4%, and reducing RMSE by 16% and 9.1% on the Restaurant and Nightlife Yelp categories, respectively. Conclusion: Structured prompting with JSON-formatted inputs can significantly enhance the performance of smaller LLMs in sentiment analysis, providing a practical alternative to deploying large-scale models, especially in resource-constrained environments. Abstract: Large language models (LLMs) are now widely used across many fields, including marketing research. Sentiment analysis, in particular, helps firms understand consumer preferences. While most NLP studies classify sentiment from review text alone, marketing theories, such as prospect theory and expectation--disconfirmation theory, point out that customer evaluations are shaped not only by the actual experience but also by additional reference points. This study therefore investigates how the content and format of such supplementary information affect sentiment analysis using LLMs. We compare natural language (NL) and JSON-formatted prompts using a lightweight 3B parameter model suitable for practical marketing applications. Experiments on two Yelp categories (Restaurant and Nightlife) show that the JSON prompt with additional information outperforms all baselines without fine-tuning: Macro-F1 rises by 1.6% and 4% while RMSE falls by 16% and 9.1%, respectively, making it deployable in resource-constrained edge devices. Furthermore, a follow-up analysis confirms that performance gains stem from genuine contextual reasoning rather than label proxying. This work demonstrates that structured prompting can enable smaller models to achieve competitive performance, offering a practical alternative to large-scale model deployment.

[38] Speciesism in AI: Evaluating Discrimination Against Animals in Large Language Models

Monika Jotautaitė,Lucius Caviola,David A. Brewster,Thilo Hagendorff

Main category: cs.CL

TL;DR: The study finds that large language models show speciesist biases, often treating discrimination against animals as acceptable and prioritizing humans over animals unless cognitive capacities are equal. This highlights the need for AI fairness frameworks to include non-human moral considerations to reduce such biases.

Details Motivation: As large language models become more widely deployed, it is important to examine their ethical tendencies, particularly whether they exhibit speciesist bias—discrimination based on species membership—and how they value non-human animals. Method: The study uses three paradigms: (1) SpeciesismBench, a benchmark to assess recognition and moral evaluation of speciesist statements; (2) psychological measures comparing LLM responses to human participants; and (3) text-generation tasks examining how LLMs elaborate on or resist speciesist rationalizations. Result: LLMs reliably detected speciesist statements but rarely condemned them, often treating speciesist attitudes as morally acceptable. They expressed slightly lower explicit speciesism than humans but prioritized saving one human over multiple animals in trade-offs. When cognitive capacities were equal, LLMs showed no species preference, and they prioritized more capable animals over less capable humans. In text generation, LLMs normalized harm toward farmed animals but not non-farmed ones. Conclusion: LLMs reproduce a mixture of progressive and mainstream human views regarding speciesism, but they often normalize entrenched cultural norms around animal exploitation. This suggests the necessity of expanding AI fairness and alignment frameworks to include non-human moral patients in order to reduce such biases. Abstract: As large language models (LLMs) become more widely deployed, it is crucial to examine their ethical tendencies. Building on research on fairness and discrimination in AI, we investigate whether LLMs exhibit speciesist bias -- discrimination based on species membership -- and how they value non-human animals. We systematically examine this issue across three paradigms: (1) SpeciesismBench, a 1,003-item benchmark assessing recognition and moral evaluation of speciesist statements; (2) established psychological measures comparing model responses with those of human participants; (3) text-generation tasks probing elaboration on, or resistance to, speciesist rationalizations. In our benchmark, LLMs reliably detected speciesist statements but rarely condemned them, often treating speciesist attitudes as morally acceptable. On psychological measures, results were mixed: LLMs expressed slightly lower explicit speciesism than people, yet in direct trade-offs they more often chose to save one human over multiple animals. A tentative interpretation is that LLMs may weight cognitive capacity rather than species per se: when capacities were equal, they showed no species preference, and when an animal was described as more capable, they tended to prioritize it over a less capable human. In open-ended text generation tasks, LLMs frequently normalized or rationalized harm toward farmed animals while refusing to do so for non-farmed animals. These findings suggest that while LLMs reflect a mixture of progressive and mainstream human views, they nonetheless reproduce entrenched cultural norms around animal exploitation. We argue that expanding AI fairness and alignment frameworks to explicitly include non-human moral patients is essential for reducing these biases and preventing the entrenchment of speciesist attitudes in AI systems and the societies they influence.

[39] Language models align with brain regions that represent concepts across modalities

Maria Ryskina,Greta Tuckute,Alexander Fung,Ashley Malkin,Evelina Fedorenko

Main category: cs.CL

TL;DR: 本文研究语言模型与大脑处理语言和概念意义的关系,发现语言模型在预测大脑信号时在意义一致性较强的区域表现更好,表明语言模型可能表征跨模态的概念意义。

Details Motivation: 认知科学、神经科学以及当今的语言模型都面临着将语言表征与概念意义表征区分开的挑战。 Method: 研究了语言模型-大脑对齐与两种神经指标的关系:大脑在句子处理期间的激活水平和跨输入模态的意义一致性。 Result: 语言模型在大脑意义一致性较强的区域预测信号更好,即使这些区域对语言处理不敏感。 Conclusion: 语言模型可能在内部表征跨模态的概念意义。 Abstract: Cognitive science and neuroscience have long faced the challenge of disentangling representations of language from representations of conceptual meaning. As the same problem arises in today's language models (LMs), we investigate the relationship between LM--brain alignment and two neural metrics: (1) the level of brain activation during processing of sentences, targeting linguistic processing, and (2) a novel measure of meaning consistency across input modalities, which quantifies how consistently a brain region responds to the same concept across paradigms (sentence, word cloud, image) using an fMRI dataset (Pereira et al., 2018). Our experiments show that both language-only and language-vision models predict the signal better in more meaning-consistent areas of the brain, even when these areas are not strongly sensitive to language processing, suggesting that LMs might internally represent cross-modal conceptual meaning.

[40] AgentMental: An Interactive Multi-Agent Framework for Explainable and Adaptive Mental Health Assessment

Jinpeng Hu,Ao Wang,Qianqian Xie,Hui Ma,Zhuo Li,Dan Guo

Main category: cs.CL

TL;DR: 本文提出了一种基于多智能体框架的交互式心理健康评估方法,结合自适应提问与树状记忆结构,提升了信息提取与上下文追踪能力,在DAIC-WOZ数据集上取得了优于现有方法的表现。

Details Motivation: 传统基于临床医生的心理健康评估方法受限于专业人员短缺,而现有自动化评估方法受限于静态文本分析,难以捕捉动态互动中的深度信息。 Method: 提出了一种多智能体框架,模拟临床医患对话,并引入了自适应提问机制和树状记忆结构。 Result: 在DAIC-WOZ数据集上的实验表明,该方法相比现有方法具有更好的性能表现。 Conclusion: 实验结果表明,所提出的基于多智能体的交互式心理健康评估框架在DAIC-WOZ数据集上优于现有方法。 Abstract: Mental health assessment is crucial for early intervention and effective treatment, yet traditional clinician-based approaches are limited by the shortage of qualified professionals. Recent advances in artificial intelligence have sparked growing interest in automated psychological assessment, yet most existing approaches are constrained by their reliance on static text analysis, limiting their ability to capture deeper and more informative insights that emerge through dynamic interaction and iterative questioning. Therefore, in this paper, we propose a multi-agent framework for mental health evaluation that simulates clinical doctor-patient dialogues, with specialized agents assigned to questioning, adequacy evaluation, scoring, and updating. We introduce an adaptive questioning mechanism in which an evaluation agent assesses the adequacy of user responses to determine the necessity of generating targeted follow-up queries to address ambiguity and missing information. Additionally, we employ a tree-structured memory in which the root node encodes the user's basic information, while child nodes (e.g., topic and statement) organize key information according to distinct symptom categories and interaction turns. This memory is dynamically updated throughout the interaction to reduce redundant questioning and further enhance the information extraction and contextual tracking capabilities. Experimental results on the DAIC-WOZ dataset illustrate the effectiveness of our proposed method, which achieves better performance than existing approaches.

[41] Aware First, Think Less: Dynamic Boundary Self-Awareness Drives Extreme Reasoning Efficiency in Large Language Models

Qiguang Chen,Dengyun Peng,Jinhao Liu,HuiKang Su,Jiannan Guan,Libo Qin,Wanxiang Che

Main category: cs.CL

TL;DR: 本文提出了一种提升语言模型推理效率的动态框架,通过模型自我感知调整推理深度,在保证准确率的同时显著降低了计算资源消耗。

Details Motivation: 长链思维推理方法存在较大的冗余,影响计算效率,而现有方法依赖人工定义的难度先验,效率不高。 Method: 提出了动态推理边界自我感知框架DR.SAF,包括边界自我感知对齐、自适应奖励管理和边界保持机制三个关键组件。 Result: 实验结果显示,DR.SAF总响应token减少49.27%,准确率损失极小;token效率增益6.59倍,训练时间减少5倍;极端训练下token效率超过传统模型且准确率提升超过16%。 Conclusion: DR.SAF框架在保持准确性的同时显著提升了推理效率,适用于资源受限的场景。 Abstract: Recent advancements in large language models (LLMs) have greatly improved their capabilities on complex reasoning tasks through Long Chain-of-Thought (CoT). However, this approach often results in substantial redundancy, impairing computational efficiency and causing significant delays in real-time applications. To improve the efficiency, current methods often rely on human-defined difficulty priors, which do not align with the LLM's self-awared difficulty, leading to inefficiencies. In this paper, we introduce the Dynamic Reasoning-Boundary Self-Awareness Framework (DR. SAF), which enables models to dynamically assess and adjust their reasoning depth in response to problem complexity. DR. SAF integrates three key components: Boundary Self-Awareness Alignment, Adaptive Reward Management, and a Boundary Preservation Mechanism. These components allow models to optimize their reasoning processes, balancing efficiency and accuracy without compromising performance. Our experimental results demonstrate that DR. SAF achieves a 49.27% reduction in total response tokens with minimal loss in accuracy. The framework also delivers a 6.59x gain in token efficiency and a 5x reduction in training time, making it well-suited to resource-limited settings. During extreme training, DR. SAF can even surpass traditional instruction-based models in token efficiency with more than 16% accuracy improvement.

[42] Representing Speech Through Autoregressive Prediction of Cochlear Tokens

Greta Tuckute,Klemen Kotar,Evelina Fedorenko,Daniel L. K. Yamins

Main category: cs.CL

TL;DR: AuriStream 是一个受生物启发的语音编码模型,采用两阶段框架模拟人类听觉处理过程,能够学习有效的语音表示并在多种任务中表现出色。

Details Motivation: 开发一个更人性化且高效处理各种语音任务的语音表示学习模型。 Method: AuriStream 通过一个受人类听觉处理层次启发的两阶段框架对语音进行编码。第一阶段将原始音频转换为基于人类耳蜗的时间-频率表示,并从中提取离散的耳蜗标记。第二阶段在耳蜗标记上应用自回归序列模型。 Result: AuriStream 学习到了有意义的音素和词表示,并取得了最先进的词汇语义效果。它在多样化的下游 SUPERB 语音任务中表现出竞争力。此外,AuriStream 能生成可以在频谱图空间中可视化并解码回音频的音频延续,从而提供对模型预测的洞察。 Conclusion: AuriStream 是一个用于语音表示学习的两阶段框架,旨在推进更人性化、高效处理各种语音任务的模型的发展。 Abstract: We introduce AuriStream, a biologically inspired model for encoding speech via a two-stage framework inspired by the human auditory processing hierarchy. The first stage transforms raw audio into a time-frequency representation based on the human cochlea, from which we extract discrete \textbf{cochlear tokens}. The second stage applies an autoregressive sequence model over the cochlear tokens. AuriStream learns meaningful phoneme and word representations, and state-of-the-art lexical semantics. AuriStream shows competitive performance on diverse downstream SUPERB speech tasks. Complementing AuriStream's strong representational capabilities, it generates continuations of audio which can be visualized in a spectrogram space and decoded back into audio, providing insights into the model's predictions. In summary, we present a two-stage framework for speech representation learning to advance the development of more human-like models that efficiently handle a range of speech-based tasks.

[43] Dataset Creation for Visual Entailment using Generative AI

Rob Reijtenbach,Suzan Verberne,Gijs Wijnholds

Main category: cs.CL

TL;DR: 该研究提出了一种用于训练视觉蕴含模型的新合成数据集,并验证了其在数据稀缺情况下的有效性。

Details Motivation: 现有的视觉蕴含数据集相比文本蕴含数据集较小且稀疏,而手动创建数据集费时费力。 Method: 基于SNLI数据集生成图像,使用Stable Diffusion模型生成图像,并使用CLIP特征向量评估生成的图像质量。 Result: 在SNLI-VE数据集上,合成训练数据导致F分数从0.703轻微下降到0.686;在SICK-VTE数据集上,F分数从0.400下降到0.384。 Conclusion: 合成数据集在数据稀缺情况下可以作为训练视觉蕴含模型的有希望的解决方案。 Abstract: In this paper we present and validate a new synthetic dataset for training visual entailment models. Existing datasets for visual entailment are small and sparse compared to datasets for textual entailment. Manually creating datasets is labor-intensive. We base our synthetic dataset on the SNLI dataset for textual entailment. We take the premise text from SNLI as input prompts in a generative image model, Stable Diffusion, creating an image to replace each textual premise. We evaluate our dataset both intrinsically and extrinsically. For extrinsic evaluation, we evaluate the validity of the generated images by using them as training data for a visual entailment classifier based on CLIP feature vectors. We find that synthetic training data only leads to a slight drop in quality on SNLI-VE, with an F-score 0.686 compared to 0.703 when trained on real data. We also compare the quality of our generated training data to original training data on another dataset: SICK-VTE. Again, there is only a slight drop in F-score: from 0.400 to 0.384. These results indicate that in settings with data sparsity, synthetic data can be a promising solution for training visual entailment models.

[44] TinyTim: A Family of Language Models for Divergent Generation

Christopher J. Agostino

Main category: cs.CL

TL;DR: TinyTim V1, a language model trained on `Finnegans Wake,' shows unique creativity traits with high lexical diversity and low semantic coherence, making it valuable for creative problem-solving and discovery systems.

Details Motivation: To explore the potential of specialized language models in functioning as divergent knowledge sources for creativity and automated discovery. Method: Quantitative evaluation of TinyTim V1 against baseline models to assess generative profile characteristics. Result: TinyTim V1 exhibits statistically distinct high lexical diversity and low semantic coherence in its generative profile. Conclusion: TinyTim V1, a large language model fine-tuned on `Finnegans Wake,' demonstrates a unique generative profile with high lexical diversity and low semantic coherence, positioning it as a divergent knowledge source for creative architectures. Abstract: This work introduces TinyTim, a family of large language models fine-tuned on James Joyce's `Finnegans Wake'. Through quantitative evaluation against baseline models, we demonstrate that TinyTim V1 produces a statistically distinct generative profile characterized by high lexical diversity and low semantic coherence. These findings are interpreted through theories of creativity and complex problem-solving, arguing that such specialized models can function as divergent knowledge sources within more extensive creative architectures, powering automated discovery mechanisms in diverse settings.

cs.CV [Back]

[45] Privacy Enhancement for Gaze Data Using a Noise-Infused Autoencoder

Samantha Aziz,Oleg Komogortsev

Main category: cs.CV

TL;DR: 本文提出了一种使用潜在噪声自动编码器的隐私增强机制,有效防止用户在不同会话中被重新识别,并保持数据在良性任务中的可用性。

Details Motivation: 提出一种保护用户隐私的同时不影响数据可用性的注视信号隐私增强机制。 Method: 我们提出了一种使用潜在噪声自动编码器的隐私增强机制,该机制在未经用户同意的情况下防止用户在不同游戏会话中被重新识别,同时保持数据在良性任务中的可用性。 Result: 在生物识别身份验证和注视预测任务中,我们的方法显著减少了生物识别可识别性,同时对实用性影响最小。 Conclusion: 这项工作通过提供一种可用且有效的机制来保护敏感的注视数据,推进了基于注视系统的隐私保护。 Abstract: We present a privacy-enhancing mechanism for gaze signals using a latent-noise autoencoder that prevents users from being re-identified across play sessions without their consent, while retaining the usability of the data for benign tasks. We evaluate privacy-utility trade-offs across biometric identification and gaze prediction tasks, showing that our approach significantly reduces biometric identifiability with minimal utility degradation. Unlike prior methods in this direction, our framework retains physiologically plausible gaze patterns suitable for downstream use, which produces favorable privacy-utility trade-off. This work advances privacy in gaze-based systems by providing a usable and effective mechanism for protecting sensitive gaze data.

[46] A Survey on Video Temporal Grounding with Multimodal Large Language Model

Jianlong Wu,Wei Liu,Ye Liu,Meng Liu,Liqiang Nie,Zhouchen Lin,Chang Wen Chen

Main category: cs.CV

TL;DR: This survey provides a comprehensive review of VTG-MLLMs, analyzing their architecture, training strategies, and spatiotemporal representation techniques.

Details Motivation: VTG-MLLMs are surpassing traditional methods and have strong generalization capabilities, yet comprehensive reviews focusing on this area are scarce. Method: A systematic review of VTG-MLLMs using a three-dimensional taxonomy: functional roles of MLLMs, training paradigms, and video feature processing techniques. Result: A detailed analysis of current research, including benchmark datasets, evaluation protocols, and empirical findings related to VTG-MLLMs. Conclusion: The survey identifies existing limitations in VTG-MLLMs and proposes promising research directions while providing additional resources in an online repository. Abstract: The recent advancement in video temporal grounding (VTG) has significantly enhanced fine-grained video understanding, primarily driven by multimodal large language models (MLLMs). With superior multimodal comprehension and reasoning abilities, VTG approaches based on MLLMs (VTG-MLLMs) are gradually surpassing traditional fine-tuned methods. They not only achieve competitive performance but also excel in generalization across zero-shot, multi-task, and multi-domain settings. Despite extensive surveys on general video-language understanding, comprehensive reviews specifically addressing VTG-MLLMs remain scarce. To fill this gap, this survey systematically examines current research on VTG-MLLMs through a three-dimensional taxonomy: 1) the functional roles of MLLMs, highlighting their architectural significance; 2) training paradigms, analyzing strategies for temporal reasoning and task adaptation; and 3) video feature processing techniques, which determine spatiotemporal representation effectiveness. We further discuss benchmark datasets, evaluation protocols, and summarize empirical findings. Finally, we identify existing limitations and propose promising research directions. For additional resources and details, readers are encouraged to visit our repository at https://github.com/ki-lw/Awesome-MLLMs-for-Video-Temporal-Grounding.

[47] VSF: Simple, Efficient, and Effective Negative Guidance in Few-Step Image Generation Models By \underline{V}alue \underline{S}ign \underline{F}lip

Wenqi Guo,Shan Du

Main category: cs.CV

TL;DR: Value Sign Flip (VSF) is a new method for better negative prompt guidance in image generation models, offering improved performance with minimal additional computation.

Details Motivation: To address the challenge of effectively incorporating negative prompt guidance in few-step diffusion and flow-matching models while minimizing computational overhead. Method: The Value Sign Flip (VSF) method dynamically suppresses undesired content by flipping the sign of attention values from negative prompts in image generation models. Result: VSF demonstrates superior performance in static image and video generation tasks, significantly improving negative prompt adherence compared to existing methods like CFG, NASA, and NAG. Conclusion: VSF is an efficient method that improves negative prompt adherence in few-step diffusion and flow-matching models with minimal computational overhead and broad compatibility. Abstract: We introduce Value Sign Flip (VSF), a simple and efficient method for incorporating negative prompt guidance in few-step diffusion and flow-matching image generation models. Unlike existing approaches such as classifier-free guidance (CFG), NASA, and NAG, VSF dynamically suppresses undesired content by flipping the sign of attention values from negative prompts. Our method requires only small computational overhead and integrates effectively with MMDiT-style architectures such as Stable Diffusion 3.5 Turbo, as well as cross-attention-based models like Wan. We validate VSF on challenging datasets with complex prompt pairs and demonstrate superior performance in both static image and video generation tasks. Experimental results show that VSF significantly improves negative prompt adherence compared to prior methods in few-step models, and even CFG in non-few-step models, while maintaining competitive image quality. Code and ComfyUI node are available in https://github.com/weathon/VSF/tree/main.

[48] Relative Pose Regression with Pose Auto-Encoders: Enhancing Accuracy and Data Efficiency for Retail Applications

Yoli Shavit,Yosi Keller

Main category: cs.CV

TL;DR: 本文提出了一种基于相机姿态自编码器(PAE)的相对姿态回归(RPR)方法,并将其用于改进绝对姿态回归(APR)的预测结果,从而在减少数据收集负担的同时提高室内定位的准确性。

Details Motivation: 准确的相机定位对于现代零售环境至关重要,但现有的绝对姿态回归方法在准确性和数据需求方面仍存在挑战。 Method: 本文扩展了相机姿态自编码器(PAE)的应用,将其应用于相对姿态回归(RPR)任务,并提出了一种无需额外存储图像或姿态数据的重定位策略,以优化APR的预测结果。 Result: 实验表明,基于PAE的RPR方法在室内基准测试中提升了APR的定位准确性,并且在仅使用30%训练数据的情况下仍能保持竞争力。 Conclusion: 该方法有效减少了零售部署中的数据收集负担,同时提高了相机定位的准确性和实用性。 Abstract: Accurate camera localization is crucial for modern retail environments, enabling enhanced customer experiences, streamlined inventory management, and autonomous operations. While Absolute Pose Regression (APR) from a single image offers a promising solution, approaches that incorporate visual and spatial scene priors tend to achieve higher accuracy. Camera Pose Auto-Encoders (PAEs) have recently been introduced to embed such priors into APR. In this work, we extend PAEs to the task of Relative Pose Regression (RPR) and propose a novel re-localization scheme that refines APR predictions using PAE-based RPR, without requiring additional storage of images or pose data. We first introduce PAE-based RPR and establish its effectiveness by comparing it with image-based RPR models of equivalent architectures. We then demonstrate that our refinement strategy, driven by a PAE-based RPR, enhances APR localization accuracy on indoor benchmarks. Notably, our method is shown to achieve competitive performance even when trained with only 30% of the data, substantially reducing the data collection burden for retail deployment. Our code and pre-trained models are available at: https://github.com/yolish/camera-pose-auto-encoders

[49] ViPE: Video Pose Engine for 3D Geometric Perception

Jiahui Huang,Qunjie Zhou,Hesam Rabeti,Aleksandr Korovko,Huan Ling,Xuanchi Ren,Tianchang Shen,Jun Gao,Dmitry Slepichev,Chen-Hsuan Lin,Jiawei Ren,Kevin Xie,Joydeep Biswas,Laura Leal-Taixe,Sanja Fidler

Main category: cs.CV

TL;DR: ViPE 是一种高效的视频处理引擎,可从各种视频中提取精确的3D几何信息,为大规模空间AI数据集提供注释,并已开源以促进相关领域的发展。

Details Motivation: 当前最先进的方法依赖于大规模训练数据,而从野外视频中获取一致且精确的3D注释仍是一个主要挑战。因此,研究者提出了 ViPE 来解决这一问题。 Method: ViPE 利用视频处理技术从非约束原始视频中高效估计相机内参、相机运动和密集的近度量深度图,并在多种场景和相机模型中表现出鲁棒性。 Result: ViPE 在多个基准测试中表现优异,比现有的无校准姿态估计基线在 TUM/KITTI 序列上分别高出18%/50%,并在标准输入分辨率下单个GPU上以3-5FPS的速度运行。 Conclusion: ViPE 是一种强大且多功能的视频处理引擎,为3D几何感知提供了准确的相机姿态和密集深度图的注释,有助于加速空间AI系统的发展。 Abstract: Accurate 3D geometric perception is an important prerequisite for a wide range of spatial AI systems. While state-of-the-art methods depend on large-scale training data, acquiring consistent and precise 3D annotations from in-the-wild videos remains a key challenge. In this work, we introduce ViPE, a handy and versatile video processing engine designed to bridge this gap. ViPE efficiently estimates camera intrinsics, camera motion, and dense, near-metric depth maps from unconstrained raw videos. It is robust to diverse scenarios, including dynamic selfie videos, cinematic shots, or dashcams, and supports various camera models such as pinhole, wide-angle, and 360{\deg} panoramas. We have benchmarked ViPE on multiple benchmarks. Notably, it outperforms existing uncalibrated pose estimation baselines by 18%/50% on TUM/KITTI sequences, and runs at 3-5FPS on a single GPU for standard input resolutions. We use ViPE to annotate a large-scale collection of videos. This collection includes around 100K real-world internet videos, 1M high-quality AI-generated videos, and 2K panoramic videos, totaling approximately 96M frames -- all annotated with accurate camera poses and dense depth maps. We open-source ViPE and the annotated dataset with the hope of accelerating the development of spatial AI systems.

[50] HQ-OV3D: A High Box Quality Open-World 3D Detection Framework based on Diffision Model

Qi Liu,Yabei Li,Hongsong Wang,Lei He

Main category: cs.CV

TL;DR: The paper proposes a High Box Quality Open-Vocabulary 3D Detection (HQ-OV3D) framework that improves the geometric quality of pseudo-labels in open-vocabulary 3D detection, achieving a 7.37% improvement in mAP on novel classes compared to the state-of-the-art method.

Details Motivation: Traditional closed-set 3D detection frameworks fail to meet the demands of open-world applications like autonomous driving. Existing open-vocabulary 3D detection methods typically adopt a two-stage pipeline consisting of pseudo-label generation followed by semantic alignment. While vision-language models (VLMs) recently have dramatically improved the semantic accuracy of pseudo-labels, their geometric quality, particularly bounding box precision, remains commonly neglected. Method: The framework comprises two key components: an Intra-Modality Cross-Validated (IMCV) Proposal Generator that utilizes cross-modality geometric consistency to generate high-quality initial 3D proposals, and an Annotated-Class Assisted (ACA) Denoiser that progressively refines 3D proposals by leveraging geometric priors from annotated categories through a DDIM-based denoising mechanism. Result: Compared to the state-of-the-art method, training with pseudo-labels generated by our approach achieves a 7.37% improvement in mAP on novel classes, demonstrating the superior quality of the pseudo-labels produced by our framework. Conclusion: HQ-OV3D can serve not only as a strong standalone open-vocabulary 3D detector but also as a plug-in high-quality pseudo-label generator for existing open-vocabulary detection or annotation pipelines. Abstract: Traditional closed-set 3D detection frameworks fail to meet the demands of open-world applications like autonomous driving. Existing open-vocabulary 3D detection methods typically adopt a two-stage pipeline consisting of pseudo-label generation followed by semantic alignment. While vision-language models (VLMs) recently have dramatically improved the semantic accuracy of pseudo-labels, their geometric quality, particularly bounding box precision, remains commonly neglected.To address this issue, we propose a High Box Quality Open-Vocabulary 3D Detection (HQ-OV3D) framework, dedicated to generate and refine high-quality pseudo-labels for open-vocabulary classes. The framework comprises two key components: an Intra-Modality Cross-Validated (IMCV) Proposal Generator that utilizes cross-modality geometric consistency to generate high-quality initial 3D proposals, and an Annotated-Class Assisted (ACA) Denoiser that progressively refines 3D proposals by leveraging geometric priors from annotated categories through a DDIM-based denoising mechanism.Compared to the state-of-the-art method, training with pseudo-labels generated by our approach achieves a 7.37% improvement in mAP on novel classes, demonstrating the superior quality of the pseudo-labels produced by our framework. HQ-OV3D can serve not only as a strong standalone open-vocabulary 3D detector but also as a plug-in high-quality pseudo-label generator for existing open-vocabulary detection or annotation pipelines.

[51] Vision-Only Gaussian Splatting for Collaborative Semantic Occupancy Prediction

Cheng Chen,Hao Huang,Saurabh Bagchi

Main category: cs.CV

TL;DR: This paper proposes a collaborative 3D semantic occupancy prediction method using sparse 3D semantic Gaussian splatting that improves performance while reducing communication costs.

Details Motivation: Existing vision-only methods for 3D semantic occupancy prediction face challenges such as high communication costs, reliance on depth supervision, or limitations in collaborative scenarios. This work addresses these challenges. Method: The method uses sparse 3D semantic Gaussian splatting for collaborative 3D semantic occupancy prediction, enabling neighborhood-based cross-agent fusion and joint encoding of geometry and semantics. Result: The proposed approach outperforms single-agent perception and baseline collaborative methods by +8.42 and +3.28 points in mIoU, and +5.11 and +22.41 points in IoU, respectively. It also achieves a +1.9 improvement in mIoU using only 34.6% communication volume. Conclusion: The proposed method demonstrates robust performance in collaborative 3D semantic occupancy prediction, even under limited communication budgets. Abstract: Collaborative perception enables connected vehicles to share information, overcoming occlusions and extending the limited sensing range inherent in single-agent (non-collaborative) systems. Existing vision-only methods for 3D semantic occupancy prediction commonly rely on dense 3D voxels, which incur high communication costs, or 2D planar features, which require accurate depth estimation or additional supervision, limiting their applicability to collaborative scenarios. To address these challenges, we propose the first approach leveraging sparse 3D semantic Gaussian splatting for collaborative 3D semantic occupancy prediction. By sharing and fusing intermediate Gaussian primitives, our method provides three benefits: a neighborhood-based cross-agent fusion that removes duplicates and suppresses noisy or inconsistent Gaussians; a joint encoding of geometry and semantics in each primitive, which reduces reliance on depth supervision and allows simple rigid alignment; and sparse, object-centric messages that preserve structural information while reducing communication volume. Extensive experiments demonstrate that our approach outperforms single-agent perception and baseline collaborative methods by +8.42 and +3.28 points in mIoU, and +5.11 and +22.41 points in IoU, respectively. When further reducing the number of transmitted Gaussians, our method still achieves a +1.9 improvement in mIoU, using only 34.6% communication volume, highlighting robust performance under limited communication budgets.

[52] Personalized Face Super-Resolution with Identity Decoupling and Fitting

Jiarui Yang,Hang Guo,Wen Huang,Tao Dai,Shutao Xia

Main category: cs.CV

TL;DR: 本论文提出了一种新的FSR方法(IDFSR),在极端退化情况下显著提高了身份一致性和感知质量。

Details Motivation: 在极端退化场景下,传统模型难以重建真实且身份一致的人脸,现有方法容易生成幻觉人脸。 Method: 提出了一种新的FSR方法,包括面部区域掩码、参考图像变形和身份嵌入,使用基于扩散的模型进行预训练,并对身份嵌入进行轻量级微调。 Result: IDFSR在极端退化情况下大幅优于现有方法,特别是在身份一致性方面。 Conclusion: IDFSR在极端退化情况下显著提高了身份一致性和感知质量,特别是在身份一致性方面表现优越。 Abstract: In recent years, face super-resolution (FSR) methods have achieved remarkable progress, generally maintaining high image fidelity and identity (ID) consistency under standard settings. However, in extreme degradation scenarios (e.g., scale $> 8\times$), critical attributes and ID information are often severely lost in the input image, making it difficult for conventional models to reconstruct realistic and ID-consistent faces. Existing methods tend to generate hallucinated faces under such conditions, producing restored images lacking authentic ID constraints. To address this challenge, we propose a novel FSR method with Identity Decoupling and Fitting (IDFSR), designed to enhance ID restoration under large scaling factors while mitigating hallucination effects. Our approach involves three key designs: 1) \textbf{Masking} the facial region in the low-resolution (LR) image to eliminate unreliable ID cues; 2) \textbf{Warping} a reference image to align with the LR input, providing style guidance; 3) Leveraging \textbf{ID embeddings} extracted from ground truth (GT) images for fine-grained ID modeling and personalized adaptation. We first pretrain a diffusion-based model to explicitly decouple style and ID by forcing it to reconstruct masked LR face regions using both style and identity embeddings. Subsequently, we freeze most network parameters and perform lightweight fine-tuning of the ID embedding using a small set of target ID images. This embedding encodes fine-grained facial attributes and precise ID information, significantly improving both ID consistency and perceptual quality. Extensive quantitative evaluations and visual comparisons demonstrate that the proposed IDFSR substantially outperforms existing approaches under extreme degradation, particularly achieving superior performance on ID consistency.

[53] Deep Learning for Automated Identification of Vietnamese Timber Species: A Tool for Ecological Monitoring and Conservation

Tianyu Song,Van-Doan Duong,Thi-Phuong Le,Ton Viet Ta

Main category: cs.CV

TL;DR: 本研究利用深度学习技术自动化分类越南常见的十种木材种类,构建了自定义图像数据集并评估了五种卷积神经网络模型,结果显示 ShuffleNetV2 在资源受限环境中实现了高精度分类,为生态信息学提供了可扩展的解决方案。

Details Motivation: 准确识别木材种类在生态监测、生物多样性保护和可持续森林管理中起着关键作用,而依赖宏观和微观检查的传统分类方法费时费力且需要专业知识。 Method: 构建了来自实地采集木材样本的自定义图像数据集,并评估了五种最先进的卷积神经网络架构(ResNet50、EfficientNet、MobileViT、MobileNetV3 和 ShuffleNetV2)的分类性能。 Result: ShuffleNetV2 在分类性能和计算效率之间达到了最佳平衡,在 20 次独立运行中平均准确率达到 99.29%,F1 分数达到 99.35%。 Conclusion: 研究证明轻量级深度学习模型在资源受限环境中具有实现实时、高精度木材分类的潜力,为生态信息学提供了可扩展的基于图像的自动化木材分类和森林生物多样性评估解决方案。 Abstract: Accurate identification of wood species plays a critical role in ecological monitoring, biodiversity conservation, and sustainable forest management. Traditional classification approaches relying on macroscopic and microscopic inspection are labor-intensive and require expert knowledge. In this study, we explore the application of deep learning to automate the classification of ten wood species commonly found in Vietnam. A custom image dataset was constructed from field-collected wood samples, and five state-of-the-art convolutional neural network architectures--ResNet50, EfficientNet, MobileViT, MobileNetV3, and ShuffleNetV2--were evaluated. Among these, ShuffleNetV2 achieved the best balance between classification performance and computational efficiency, with an average accuracy of 99.29\% and F1-score of 99.35\% over 20 independent runs. These results demonstrate the potential of lightweight deep learning models for real-time, high-accuracy species identification in resource-constrained environments. Our work contributes to the growing field of ecological informatics by providing scalable, image-based solutions for automated wood classification and forest biodiversity assessment.

[54] NIRMAL Pooling: An Adaptive Max Pooling Approach with Non-linear Activation for Enhanced Image Classification

Nirmal Gaud,Krishna Kumar Jha,Jhimli Adhikari,Adhini Nasarin P S,Joydeep Das,Samarth S Deshpande,Nitasha Barara,Vaduguru Venkata Ramya,Santu Saha,Mehmet Tarik Baran,Sarangi Venkateshwarlu,Anusha M D,Surej Mouli,Preeti Katiyar,Vipin Kumar Chaudhary

Main category: cs.CV

TL;DR: NIRMAL Pooling combines adaptive max pooling with ReLU activation to improve CNN accuracy, particularly for complex image datasets.

Details Motivation: The motivation is to enhance robustness and feature expressiveness in CNNs by combining adaptive max pooling with non-linear activation functions. Method: NIRMAL Pooling integrates adaptive max pooling with a non-linear activation function (ReLU) post-pooling, dynamically adjusting pooling parameters based on desired output dimensions. Result: NIRMAL Pooling outperforms standard Max Pooling on three benchmark datasets: MNIST Digits (99.25% vs. 99.12%), MNIST Fashion (91.59% vs. 91.44%), and CIFAR-10 (70.49% vs. 68.87%). Conclusion: NIRMAL Pooling is a promising alternative to traditional pooling methods in CNNs, showing consistent performance improvements, especially on complex datasets. Abstract: This paper presents NIRMAL Pooling, a novel pooling layer for Convolutional Neural Networks (CNNs) that integrates adaptive max pooling with non-linear activation function for image classification tasks. The acronym NIRMAL stands for Non-linear Activation, Intermediate Aggregation, Reduction, Maximum, Adaptive, and Localized. By dynamically adjusting pooling parameters based on desired output dimensions and applying a Rectified Linear Unit (ReLU) activation post-pooling, NIRMAL Pooling improves robustness and feature expressiveness. We evaluated its performance against standard Max Pooling on three benchmark datasets: MNIST Digits, MNIST Fashion, and CIFAR-10. NIRMAL Pooling achieves test accuracies of 99.25% (vs. 99.12% for Max Pooling) on MNIST Digits, 91.59% (vs. 91.44%) on MNIST Fashion, and 70.49% (vs. 68.87%) on CIFAR-10, demonstrating consistent improvements, particularly on complex datasets. This work highlights the potential of NIRMAL Pooling to enhance CNN performance in diverse image recognition tasks, offering a flexible and reliable alternative to traditional pooling methods.

[55] Topological Structure Description for Artcode Detection Using the Shape of Orientation Histogram

Liming Xu,Dave Towey,Andrew P. French,Steve Benford

Main category: cs.CV

TL;DR: This paper introduces a new feature descriptor for detecting Artcodes—decorative, machine-readable markers that blend into the environment—showing promising results for future applications in augmented reality.

Details Motivation: With the increasing use of smartphones and VR/AR, there is a need to detect virtual objects like Artcodes in everyday environments, which are both human-meaningful and machine-readable. Method: The authors propose a new feature descriptor called the shape of orientation histogram to describe the generic topological structure of Artcodes and evaluate its performance through experiments on collected datasets. Result: The proposed feature descriptor effectively represents topological structures and achieves good performance in detecting Artcode proposals. Conclusion: The work demonstrates the feasibility and effectiveness of a new feature vector for detecting Artcode proposals, opening up new interaction opportunities and potential applications. Abstract: The increasing ubiquity of smartphones and resurgence of VR/AR techniques, it is expected that our everyday environment may soon be decorating with objects connecting with virtual elements. Alerting to the presence of these objects is therefore the first step for motivating follow-up further inspection and triggering digital material attached to the objects. This work studies a special kind of these objects -- Artcodes -- a human-meaningful and machine-readable decorative markers that camouflage themselves with freeform appearance by encoding information into their topology. We formulate this problem of recongising the presence of Artcodes as Artcode proposal detection, a distinct computer vision task that classifies topologically similar but geometrically and semantically different objects as a same class. To deal with this problem, we propose a new feature descriptor, called the shape of orientation histogram, to describe the generic topological structure of an Artcode. We collect datasets and conduct comprehensive experiments to evaluate the performance of the Artcode detection proposer built upon this new feature vector. Our experimental results show the feasibility of the proposed feature vector for representing topological structures and the effectiveness of the system for detecting Artcode proposals. Although this work is an initial attempt to develop a feature-based system for detecting topological objects like Artcodes, it would open up new interaction opportunities and spark potential applications of topological object detection.

[56] Analysis of the Compaction Behavior of Textile Reinforcements in Low-Resolution In-Situ CT Scans via Machine-Learning and Descriptor-Based Methods

Christian Düreth,Jan Condé-Wolter,Marek Danczak,Karsten Tittmann,Jörn Jaschinski,Andreas Hornig,Maik Gude

Main category: cs.CV

TL;DR: 本研究结合低分辨率CT技术和深度学习模型,成功量化了纺织复合材料中的嵌套行为,为复合材料的结构分析和建模提供了新方法。

Details Motivation: 为了实现对纺织复合材料的可预测建模,需要深入理解材料结构在多个尺度上的特性,尤其是相邻织物层通过纱线局部穿透和错位形成的嵌套行为对机械性能的影响。 Method: 使用低分辨率计算机断层扫描(CT)和定制的3D-UNet模型对不同堆叠结构的干纺织增强材料进行语义分割,并利用两点相关函数S2分析空间结构。 Result: 模型实现了平均交并比0.822和F1分数0.902,提取的平均层厚度和嵌套度与显微图像验证结果高度一致。 Conclusion: 该研究通过低分辨率CT扫描和3D-UNet模型实现了对纺织复合材料中织物层嵌套行为的定量分析,为工业相关的复合材料结构分析提供了基础。 Abstract: A detailed understanding of material structure across multiple scales is essential for predictive modeling of textile-reinforced composites. Nesting -- characterized by the interlocking of adjacent fabric layers through local interpenetration and misalignment of yarns -- plays a critical role in defining mechanical properties such as stiffness, permeability, and damage tolerance. This study presents a framework to quantify nesting behavior in dry textile reinforcements under compaction using low-resolution computed tomography (CT). In-situ compaction experiments were conducted on various stacking configurations, with CT scans acquired at 20.22 $\mu$m per voxel resolution. A tailored 3D{-}UNet enabled semantic segmentation of matrix, weft, and fill phases across compaction stages corresponding to fiber volume contents of 50--60 %. The model achieved a minimum mean Intersection-over-Union of 0.822 and an $F1$ score of 0.902. Spatial structure was subsequently analyzed using the two-point correlation function $S_2$, allowing for probabilistic extraction of average layer thickness and nesting degree. The results show strong agreement with micrograph-based validation. This methodology provides a robust approach for extracting key geometrical features from industrially relevant CT data and establishes a foundation for reverse modeling and descriptor-based structural analysis of composite preforms.

[57] iWatchRoad: Scalable Detection and Geospatial Visualization of Potholes for Smart Cities

Rishi Raj Sahoo,Surbhi Saswati Mohanty,Subhankar Mishra

Main category: cs.CV

TL;DR: iWatchRoad is an automated, end-to-end system for real-time pothole detection, GPS tagging, and mapping using a customized YOLO model and OCR, designed for efficient urban and rural road maintenance in India.

Details Motivation: Potholes are a major road safety hazard and maintenance challenge, especially in India. This work aims to provide an automated, affordable, and accurate solution for pothole detection and mapping to aid road maintenance planning. Method: The paper describes an end-to-end system using a fine-tuned YOLO model for pothole detection, a custom OCR module for timestamp extraction, GPS tagging for location data, and OpenStreetMap for real-time visualization. Result: The system achieves accurate pothole detection under challenging conditions and provides real-time GPS-tagged mapping through a user-friendly web interface, suitable for government use in road assessment. Conclusion: iWatchRoad is a cost-effective, scalable, and automated system for pothole detection and mapping, suitable for urban and rural road management in developing regions. Abstract: Potholes on the roads are a serious hazard and maintenance burden. This poses a significant threat to road safety and vehicle longevity, especially on the diverse and under-maintained roads of India. In this paper, we present a complete end-to-end system called iWatchRoad for automated pothole detection, Global Positioning System (GPS) tagging, and real time mapping using OpenStreetMap (OSM). We curated a large, self-annotated dataset of over 7,000 frames captured across various road types, lighting conditions, and weather scenarios unique to Indian environments, leveraging dashcam footage. This dataset is used to fine-tune, Ultralytics You Only Look Once (YOLO) model to perform real time pothole detection, while a custom Optical Character Recognition (OCR) module was employed to extract timestamps directly from video frames. The timestamps are synchronized with GPS logs to geotag each detected potholes accurately. The processed data includes the potholes' details and frames as metadata is stored in a database and visualized via a user friendly web interface using OSM. iWatchRoad not only improves detection accuracy under challenging conditions but also provides government compatible outputs for road assessment and maintenance planning through the metadata visible on the website. Our solution is cost effective, hardware efficient, and scalable, offering a practical tool for urban and rural road management in developing regions, making the system automated. iWatchRoad is available at https://smlab.niser.ac.in/project/iwatchroad

[58] IPG: Incremental Patch Generation for Generalized Adversarial Patch Training

Wonho Lee,Hyunsik Na,Jisu Lee,Daeseon Choi

Main category: cs.CV

TL;DR: This paper proposes Incremental Patch Generation (IPG), a highly efficient method for creating adversarial patches that effectively target AI model vulnerabilities, with potential applications in real-world AI security scenarios.

Details Motivation: The motivation stems from the challenge adversarial patches pose to AI model robustness, particularly in critical domains like computer vision. Traditional adversarial examples are less targeted, prompting the need for a more efficient and specific method for generating adversarial patches. Method: The paper introduces Incremental Patch Generation (IPG), a method designed to generate adversarial patches more efficiently than existing approaches. The method was evaluated through experiments and ablation studies, including YOLO's feature distribution visualization and adversarial training results. Result: The IPG method generates adversarial patches up to 11.1 times more efficiently than existing methods while maintaining similar attack performance. The generated patches are shown to generalize well and cover a broader range of model vulnerabilities. Additionally, IPG-generated datasets contribute to constructing robust AI models. Conclusion: The paper concludes that IPG is a highly efficient method for generating adversarial patches, offering broader coverage of model vulnerabilities and serving as a foundation for building robust AI models. It highlights the method's potential in various real-world applications, such as autonomous vehicles, security systems, and medical imaging. Abstract: The advent of adversarial patches poses a significant challenge to the robustness of AI models, particularly in the domain of computer vision tasks such as object detection. In contradistinction to traditional adversarial examples, these patches target specific regions of an image, resulting in the malfunction of AI models. This paper proposes Incremental Patch Generation (IPG), a method that generates adversarial patches up to 11.1 times more efficiently than existing approaches while maintaining comparable attack performance. The efficacy of IPG is demonstrated by experiments and ablation studies including YOLO's feature distribution visualization and adversarial training results, which show that it produces well-generalized patches that effectively cover a broader range of model vulnerabilities. Furthermore, IPG-generated datasets can serve as a robust knowledge foundation for constructing a robust model, enabling structured representation, advanced reasoning, and proactive defenses in AI security ecosystems. The findings of this study suggest that IPG has considerable potential for future utilization not only in adversarial patch defense but also in real-world applications such as autonomous vehicles, security systems, and medical imaging, where AI models must remain resilient to adversarial attacks in dynamic and high-stakes environments.

[59] MedAtlas: Evaluating LLMs for Multi-Round, Multi-Task Medical Reasoning Across Diverse Imaging Modalities and Clinical Text

Ronghao Xu,Zhen Huang,Yangbo Wei,Xiaoqian Zhou,Zikang Xu,Ting Liu,Zihang Jiang,S. Kevin Zhou

Main category: cs.CV

TL;DR: MedAtlas 是一个用于评估大型语言模型在真实医学推理任务中的表现的新基准框架。

Details Motivation: 现有的医学多模态基准通常局限于单图像、单轮任务,缺乏多模态医学图像集成,并且未能捕捉到临床实践中固有的纵向和多模态互动性质。 Method: 介绍了一个新的基准框架 MedAtlas,支持多轮对话、多模态医学图像交互、多任务集成和高临床保真度。 Result: 利用现有多模态模型的基准结果显示,在多阶段临床推理中存在显著的性能差距。 Conclusion: MedAtlas 是一个具有挑战性的评估平台,旨在推动稳健和可信的医学人工智能的发展。 Abstract: Artificial intelligence has demonstrated significant potential in clinical decision-making; however, developing models capable of adapting to diverse real-world scenarios and performing complex diagnostic reasoning remains a major challenge. Existing medical multi-modal benchmarks are typically limited to single-image, single-turn tasks, lacking multi-modal medical image integration and failing to capture the longitudinal and multi-modal interactive nature inherent to clinical practice. To address this gap, we introduce MedAtlas, a novel benchmark framework designed to evaluate large language models on realistic medical reasoning tasks. MedAtlas is characterized by four key features: multi-turn dialogue, multi-modal medical image interaction, multi-task integration, and high clinical fidelity. It supports four core tasks: open-ended multi-turn question answering, closed-ended multi-turn question answering, multi-image joint reasoning, and comprehensive disease diagnosis. Each case is derived from real diagnostic workflows and incorporates temporal interactions between textual medical histories and multiple imaging modalities, including CT, MRI, PET, ultrasound, and X-ray, requiring models to perform deep integrative reasoning across images and clinical texts. MedAtlas provides expert-annotated gold standards for all tasks. Furthermore, we propose two novel evaluation metrics: Round Chain Accuracy and Error Propagation Resistance. Benchmark results with existing multi-modal models reveal substantial performance gaps in multi-stage clinical reasoning. MedAtlas establishes a challenging evaluation platform to advance the development of robust and trustworthy medical AI.

[60] From Promise to Practical Reality: Transforming Diffusion MRI Analysis with Fast Deep Learning Enhancement

Xinyi Wang,Michael Barnett,Frederique Boonstra,Yael Barnett,Mariano Cabezas,Arkiev D'Souza,Matthew C. Kiernan,Kain Kyle,Meng Law,Lynette Masters,Zihao Tang,Stephen Tisch,Sicong Tu,Anneke Van Der Walt,Dongang Wang,Fernando Calamante,Weidong Cai,Chenyu Wang

Main category: cs.CV

TL;DR: FastFOD-Net 是一种加速的端到端深度学习框架,用于增强纤维方向分布(FOD),在临床神经科学中具有广泛应用潜力。

Details Motivation: 现有的FOD增强方法主要在健康人群中评估,限制了其在临床环境中的应用。本文旨在验证FastFOD-Net在健康对照组和六种神经系统疾病中的效果,以推动其在临床中的应用。 Method: FastFOD-Net是一种深度学习增强框架,能够高效地提升FOD的质量,与现有方法相比,其训练和推理速度提高了60倍。 Result: FastFOD-Net 能够显著提升扩散MRI分析的准确性,减少测量误差,降低样本量需求,并在疾病区分、连接组学应用的可解释性方面表现出色。 Conclusion: FastFOD-Net 为扩散MRI数据的稳健分析提供了新工具,使临床应用的数据分析质量接近高质量研究采集的水平,推动了基于深度学习的扩散MRI增强方法的临床应用。 Abstract: Fiber orientation distribution (FOD) is an advanced diffusion MRI modeling technique that represents complex white matter fiber configurations, and a key step for subsequent brain tractography and connectome analysis. Its reliability and accuracy, however, heavily rely on the quality of the MRI acquisition and the subsequent estimation of the FODs at each voxel. Generating reliable FODs from widely available clinical protocols with single-shell and low-angular-resolution acquisitions remains challenging but could potentially be addressed with recent advances in deep learning-based enhancement techniques. Despite advancements, existing methods have predominantly been assessed on healthy subjects, which have proved to be a major hurdle for their clinical adoption. In this work, we validate a newly optimized enhancement framework, FastFOD-Net, across healthy controls and six neurological disorders. This accelerated end-to-end deep learning framework enhancing FODs with superior performance and delivering training/inference efficiency for clinical use ($60\times$ faster comparing to its predecessor). With the most comprehensive clinical evaluation to date, our work demonstrates the potential of FastFOD-Net in accelerating clinical neuroscience research, empowering diffusion MRI analysis for disease differentiation, improving interpretability in connectome applications, and reducing measurement errors to lower sample size requirements. Critically, this work will facilitate the more widespread adoption of, and build clinical trust in, deep learning based methods for diffusion MRI enhancement. Specifically, FastFOD-Net enables robust analysis of real-world, clinical diffusion MRI data, comparable to that achievable with high-quality research acquisitions.

[61] Empowering Multimodal LLMs with External Tools: A Comprehensive Survey

Wenbin An,Jiahao Nie,Yaqiang Wu,Feng Tian,Shijian Lu,Qinghua Zheng

Main category: cs.CV

TL;DR: This paper surveys how external tools can enhance Multimodal Large Language Models (MLLMs) by improving data quality, model performance, and evaluation protocols, while identifying current limitations and future directions.

Details Motivation: The motivation stems from the current limitations of MLLMs, including poor data quality, suboptimal performance on complex tasks, and inadequate evaluation protocols. The study explores augmenting MLLMs with external tools as a promising strategy to overcome these challenges, inspired by human use of tools for problem-solving. Method: The paper presents a comprehensive survey analyzing the use of external tools across four key dimensions: data acquisition and annotation, improvement of MLLM performance on complex tasks, evaluation of MLLMs, and exploration of current limitations and future directions. Result: The survey highlights how external tools such as APIs, expert models, and knowledge bases can improve MLLM capabilities across data quality, task performance, and evaluation comprehensiveness, while also identifying current limitations and future research directions. Conclusion: The paper concludes that leveraging external tools can significantly enhance the performance and evaluation of Multimodal Large Language Models (MLLMs), offering a promising pathway for their future development and application. Abstract: By integrating the perception capabilities of multimodal encoders with the generative power of Large Language Models (LLMs), Multimodal Large Language Models (MLLMs), exemplified by GPT-4V, have achieved great success in various multimodal tasks, pointing toward a promising pathway to artificial general intelligence. Despite this progress, the limited quality of multimodal data, poor performance on many complex downstream tasks, and inadequate evaluation protocols continue to hinder the reliability and broader applicability of MLLMs across diverse domains. Inspired by the human ability to leverage external tools for enhanced reasoning and problem-solving, augmenting MLLMs with external tools (e.g., APIs, expert models, and knowledge bases) offers a promising strategy to overcome these challenges. In this paper, we present a comprehensive survey on leveraging external tools to enhance MLLM performance. Our discussion is structured along four key dimensions about external tools: (1) how they can facilitate the acquisition and annotation of high-quality multimodal data; (2) how they can assist in improving MLLM performance on challenging downstream tasks; (3) how they enable comprehensive and accurate evaluation of MLLMs; (4) the current limitations and future directions of tool-augmented MLLMs. Through this survey, we aim to underscore the transformative potential of external tools in advancing MLLM capabilities, offering a forward-looking perspective on their development and applications. The project page of this paper is publicly available athttps://github.com/Lackel/Awesome-Tools-for-MLLMs.

[62] ORBIT: An Object Property Reasoning Benchmark for Visual Inference Tasks

Abhishek Kolari,Mohammadhossein Khojasteh,Yifan Jiang,Floris den Hengst,Filip Ilievski

Main category: cs.CV

TL;DR: 本文介绍ORBIT,这是一个新的多级推理视觉问答基准,旨在评估视觉语言模型在对象属性推理方面的能力,结果表明现有模型在此任务上表现不佳。

Details Motivation: 尽管视觉语言模型在许多视觉问答基准测试中取得了显著进展,但它们是否对所描绘的对象进行抽象和推理仍不清楚。受到人类对象分类的启发,对象属性推理涉及识别和理解低级细节和高级抽象。 Method: 开发了一个系统性的评估框架,包括三种具有代表性的图像类型、三个逐渐复杂的推理层次,以及四个基于常识推理的对象属性维度,并将其具体化为包含360张图像和1,080个基于计数问题的多层次推理视觉问答基准ORBIT。 Result: 实验结果显示,与人类相比,最先进的视觉语言模型在零样本设置下表现显著受限,表现最好的模型仅达到40%的准确率。视觉语言模型在处理现实(摄影)图像、关于物理和功能属性的反事实推理以及更高数量的计数时尤其困难。 Conclusion: ORBIT揭示了现有视觉语言模型在对象属性推理方面的显著局限性,并指出了开发可扩展基准测试方法、泛化注释指南和探索额外推理方法的必要性。 Abstract: While vision-language models (VLMs) have made remarkable progress on many popular visual question answering (VQA) benchmarks, it remains unclear whether they abstract and reason over depicted objects. Inspired by human object categorisation, object property reasoning involves identifying and recognising low-level details and higher-level abstractions. While current VQA benchmarks consider a limited set of object property attributes like size, they typically blend perception and reasoning, and lack representativeness in terms of reasoning and image categories. To this end, we introduce a systematic evaluation framework with images of three representative types, three reasoning levels of increasing complexity, and four object property dimensions driven by prior work on commonsense reasoning. We develop a procedure to instantiate this benchmark into ORBIT, a multi-level reasoning VQA benchmark for object properties comprising 360 images paired with a total of 1,080 count-based questions. Experiments with 12 state-of-the-art VLMs in zero-shot settings reveal significant limitations compared to humans, with the best-performing model only reaching 40\% accuracy. VLMs struggle particularly with realistic (photographic) images, counterfactual reasoning about physical and functional properties, and higher counts. ORBIT points to the need to develop methods for scalable benchmarking, generalize annotation guidelines, and explore additional reasoning VLMs. We make the ORBIT benchmark and the experimental code available to support such endeavors.

[63] CSNR and JMIM Based Spectral Band Selection for Reducing Metamerism in Urban Driving

Jiarong Li,Imad Ali Shah,Diarmaid Geever,Fiachra Collins,Enda Ward,Martin Glavin,Edward Jones,Brian Deegan

Main category: cs.CV

TL;DR: 本研究通过高光谱成像和优化波段选择,显著提高了脆弱道路使用者的识别能力,为自动驾驶和高级驾驶辅助系统提供了更可靠的基础。

Details Motivation: 保护脆弱道路使用者(VRU)是汽车感知系统的关键安全挑战,尤其是在RGB图像中由于同色异谱现象导致视觉模糊时。 Method: 结合信息论技术(联合互信息最大化、相关性分析)和图像质量度量(对比信噪比)来选择最具光谱信息量的波段,并使用H-City V2数据集进行实验验证。 Result: 选择的HSI波段在定量结果上显示出显著改进,包括70.24%、528.46%、1206.83%和246.62%的提升,并有效减少了同色异谱混淆。 Conclusion: 该论文提出了一种基于高光谱成像(HSI)的带选择策略,以增强对脆弱道路使用者(VRU)的识别能力,从而提高道路安全性。 Abstract: Protecting Vulnerable Road Users (VRU) is a critical safety challenge for automotive perception systems, particularly under visual ambiguity caused by metamerism, a phenomenon where distinct materials appear similar in RGB imagery. This work investigates hyperspectral imaging (HSI) to overcome this limitation by capturing unique material signatures beyond the visible spectrum, especially in the Near-Infrared (NIR). To manage the inherent high-dimensionality of HSI data, we propose a band selection strategy that integrates information theory techniques (joint mutual information maximization, correlation analysis) with a novel application of an image quality metric (contrast signal-to-noise ratio) to identify the most spectrally informative bands. Using the Hyperspectral City V2 (H-City) dataset, we identify three informative bands (497 nm, 607 nm, and 895 nm, $\pm$27 nm) and reconstruct pseudo-color images for comparison with co-registered RGB. Quantitative results demonstrate increased dissimilarity and perceptual separability of VRU from the background. The selected HSI bands yield improvements of 70.24%, 528.46%, 1206.83%, and 246.62% for dissimilarity (Euclidean, SAM, $T^2$) and perception (CIE $\Delta E$) metrics, consistently outperforming RGB and confirming a marked reduction in metameric confusion. By providing a spectrally optimized input, our method enhances VRU separability, establishing a robust foundation for downstream perception tasks in Advanced Driver Assistance Systems (ADAS) and Autonomous Driving (AD), ultimately contributing to improved road safety.

[64] EVCtrl: Efficient Control Adapter for Visual Generation

Zixiang Yang,Yue Ma,Yinhan Zhang,Shanhui Mo,Dongrui Liu,Linfeng Zhang

Main category: cs.CV

TL;DR: EVCtrl is a lightweight, plug-and-play control adapter that significantly improves the efficiency of visual generation models like ControlNet by reducing spatial and temporal redundancy, achieving significant speedups without compromising generation quality.

Details Motivation: ControlNet, while offering precise spatial-temporal control in visual generation, introduces significant latency and redundant computation, especially for video generation. The authors aim to develop a more efficient control method that reduces this overhead without requiring model retraining. Method: The authors propose EVCtrl, which introduces a spatio-temporal dual caching strategy to eliminate redundant computation in DiT-ControlNet models. Spatial redundancy is addressed by partitioning the network into global and local functional zones, while temporal redundancy is reduced by selectively omitting unnecessary denoising steps. Result: EVCtrl achieves 2.16 times speedup on CogVideo-ControlNet and 2.05 times speedup on Wan2.1-ControlNet with minimal degradation in generation quality. The method is effective for both image and video control generation tasks without requiring any model retraining. Conclusion: EVCtrl is a lightweight and effective control adapter for visual generation tasks that significantly reduces computational overhead without compromising generation quality. It achieves notable speedups on multiple video and image generation models. Abstract: Visual generation includes both image and video generation, training probabilistic models to create coherent, diverse, and semantically faithful content from scratch. While early research focused on unconditional sampling, practitioners now demand controllable generation that allows precise specification of layout, pose, motion, or style. While ControlNet grants precise spatial-temporal control, its auxiliary branch markedly increases latency and introduces redundant computation in both uncontrolled regions and denoising steps, especially for video. To address this problem, we introduce EVCtrl, a lightweight, plug-and-play control adapter that slashes overhead without retraining the model. Specifically, we propose a spatio-temporal dual caching strategy for sparse control information. For spatial redundancy, we first profile how each layer of DiT-ControlNet responds to fine-grained control, then partition the network into global and local functional zones. A locality-aware cache focuses computation on the local zones that truly need the control signal, skipping the bulk of redundant computation in global regions. For temporal redundancy, we selectively omit unnecessary denoising steps to improve efficiency. Extensive experiments on CogVideo-Controlnet, Wan2.1-Controlnet, and Flux demonstrate that our method is effective in image and video control generation without the need for training. For example, it achieves 2.16 and 2.05 times speedups on CogVideo-Controlnet and Wan2.1-Controlnet, respectively, with almost no degradation in generation quality.Codes are available in the supplementary materials.

[65] Not There Yet: Evaluating Vision Language Models in Simulating the Visual Perception of People with Low Vision

Rosiana Natalie,Wenqian Xu,Ruei-Che Chang,Rada Mihalcea,Anhong Guo

Main category: cs.CV

TL;DR: 本研究评估了VLMs在模拟低视力个体图像解释能力方面的表现,发现结合视觉信息和示例响应能显著提高模型与参与者答案的一致性。

Details Motivation: 研究VLMs在无障碍领域中模拟低视力个体视觉感知的能力,填补此前研究的空白。 Method: 通过40名低视力参与者组成的调查研究,收集他们的视觉信息和图像感知与识别反应,构建用于VLMs(GPT-4o)的提示,创建每个参与者的模拟代理,并评估VLM生成的响应与参与者原始答案的一致性。 Result: 当提供最少提示时,VLMs倾向于推断超出指定视觉能力,导致低一致性(0.59)。仅提供视觉信息或示例图像响应时一致性仍然较低(均为0.59),而两者结合则显著提高一致性(0.70, p < 0.0001)。结合开放式和多项选择响应的单个示例显著提高了性能(p < 0.0001),而额外的示例则没有显著益处(p > 0.05)。 Conclusion: 提供视觉信息和示例图像响应的组合可以显著提高VLMs模拟低视力个体图像解释的能力,而单独使用任一信息则效果不佳。此外,结合开放式和多项选择响应的单个示例能显著提高性能,而额外的示例则收效甚微。 Abstract: Advances in vision language models (VLMs) have enabled the simulation of general human behavior through their reasoning and problem solving capabilities. However, prior research has not investigated such simulation capabilities in the accessibility domain. In this paper, we evaluate the extent to which VLMs can simulate the vision perception of low vision individuals when interpreting images. We first compile a benchmark dataset through a survey study with 40 low vision participants, collecting their brief and detailed vision information and both open-ended and multiple-choice image perception and recognition responses to up to 25 images. Using these responses, we construct prompts for VLMs (GPT-4o) to create simulated agents of each participant, varying the included information on vision information and example image responses. We evaluate the agreement between VLM-generated responses and participants' original answers. Our results indicate that VLMs tend to infer beyond the specified vision ability when given minimal prompts, resulting in low agreement (0.59). The agreement between the agent' and participants' responses remains low when only either the vision information (0.59) or example image responses (0.59) are provided, whereas a combination of both significantly increase the agreement (0.70, p < 0.0001). Notably, a single example combining both open-ended and multiple-choice responses, offers significant performance improvements over either alone (p < 0.0001), while additional examples provided minimal benefits (p > 0.05).

[66] Are Large Pre-trained Vision Language Models Effective Construction Safety Inspectors?

Xuezheng Chen,Zhengbo Zou

Main category: cs.CV

TL;DR: 本论文提出 ConstructionSite 10k 数据集,用于推动建筑安全检查领域的视觉语言模型研究,并评估了当前模型的性能。

Details Motivation: 现有的视觉语言模型在建筑安全检查中的应用受限于缺乏全面的开放数据集,这限制了它们在未直接训练任务中的适用性。 Method: 构建了一个包含10,000张建筑工地图像的数据集,标注了三个相互关联的任务:图像描述、安全规则违规视觉问答(VQA)和建筑元素视觉定位,并对当前最先进的视觉语言模型进行了评估。 Result: 研究结果显示,当前最先进的视觉语言模型在零样本和小样本设置下表现出显著的泛化能力,但需要进一步训练以适应实际的建筑工地环境。 Conclusion: ConstructionSite 10k 数据集的提出为建筑安全检查领域的视觉语言模型研究提供了重要的基准,有助于开发更有效的模型架构和技术。 Abstract: Construction safety inspections typically involve a human inspector identifying safety concerns on-site. With the rise of powerful Vision Language Models (VLMs), researchers are exploring their use for tasks such as detecting safety rule violations from on-site images. However, there is a lack of open datasets to comprehensively evaluate and further fine-tune VLMs in construction safety inspection. Current applications of VLMs use small, supervised datasets, limiting their applicability in tasks they are not directly trained for. In this paper, we propose the ConstructionSite 10k, featuring 10,000 construction site images with annotations for three inter-connected tasks, including image captioning, safety rule violation visual question answering (VQA), and construction element visual grounding. Our subsequent evaluation of current state-of-the-art large pre-trained VLMs shows notable generalization abilities in zero-shot and few-shot settings, while additional training is needed to make them applicable to actual construction sites. This dataset allows researchers to train and evaluate their own VLMs with new architectures and techniques, providing a valuable benchmark for construction safety inspection.

[67] Can Multi-modal (reasoning) LLMs detect document manipulation?

Zisheng Liang,Kidus Zewde,Rudra Pratap Singh,Disha Patil,Zexi Chen,Jiayu Xue,Yao Yao,Yifei Chen,Qinzhe Liu,Simiao Ren

Main category: cs.CV

TL;DR: 本研究评估了多个多模态大语言模型在文档欺诈检测中的有效性,发现它们在零样本泛化方面优于传统方法,但模型大小与性能之间关系不大,任务特定微调更为关键。

Details Motivation: 文档欺诈对依赖安全和可验证文档的行业构成重大威胁,因此需要强大的检测机制。 Method: 通过提示优化和对模型推理过程的详细分析,对多个最先进的多模态大语言模型进行了基准测试。 Result: 表现最佳的多模态大语言模型在零样本泛化方面表现出色,在分布外数据集上优于传统方法,而一些视觉大语言模型则表现出不一致或较差的性能。模型规模和高级推理能力与检测准确性之间的相关性有限。 Conclusion: 多模态大语言模型在文档欺诈检测中展现出巨大潜力,任务特定的微调对于提升检测准确性至关重要。 Abstract: Document fraud poses a significant threat to industries reliant on secure and verifiable documentation, necessitating robust detection mechanisms. This study investigates the efficacy of state-of-the-art multi-modal large language models (LLMs)-including OpenAI O1, OpenAI 4o, Gemini Flash (thinking), Deepseek Janus, Grok, Llama 3.2 and 4, Qwen 2 and 2.5 VL, Mistral Pixtral, and Claude 3.5 and 3.7 Sonnet-in detecting fraudulent documents. We benchmark these models against each other and prior work on document fraud detection techniques using a standard dataset with real transactional documents. Through prompt optimization and detailed analysis of the models' reasoning processes, we evaluate their ability to identify subtle indicators of fraud, such as tampered text, misaligned formatting, and inconsistent transactional sums. Our results reveal that top-performing multi-modal LLMs demonstrate superior zero-shot generalization, outperforming conventional methods on out-of-distribution datasets, while several vision LLMs exhibit inconsistent or subpar performance. Notably, model size and advanced reasoning capabilities show limited correlation with detection accuracy, suggesting task-specific fine-tuning is critical. This study underscores the potential of multi-modal LLMs in enhancing document fraud detection systems and provides a foundation for future research into interpretable and scalable fraud mitigation strategies.

[68] MedSAMix: A Training-Free Model Merging Approach for Medical Image Segmentation

Yanwu Yang,Guinan Su,Jiesi Hu,Francesco Sammarco,Jonas Geiping,Thomas Wolfers

Main category: cs.CV

TL;DR: MedSAMix is a novel training-free model merging approach that combines generalist and specialist models for improved medical image segmentation, addressing data and generalization challenges through automated optimization techniques.

Details Motivation: Existing fine-tuned medical segmentation models face challenges such as limited data, heterogeneity, scarce annotations, and distributional shifts, which restrict their generalizability. This work aims to overcome these limitations. Method: MedSAMix uses a training-free model merging technique with zero-order optimization to integrate the strengths of generalist and specialist models. It applies single-task and multi-objective optimization for different clinical requirements. Result: MedSAMix demonstrated significant performance improvements, with a 6.67% increase on specialized tasks and a 4.37% improvement in multi-task evaluations, while reducing model bias. Conclusion: The study concludes that MedSAMix effectively addresses the limitations of existing medical image segmentation models by combining generalist and specialist approaches, enhancing both domain-specific accuracy and generalization. Abstract: Universal medical image segmentation models have emerged as a promising paradigm due to their strong generalizability across diverse tasks, showing great potential for a wide range of clinical applications. This potential has been partly driven by the success of general-purpose vision models such as the Segment Anything Model (SAM), which has inspired the development of various fine-tuned variants for medical segmentation tasks. However, fine-tuned variants like MedSAM are trained on comparatively limited medical imaging data that often suffers from heterogeneity, scarce annotations, and distributional shifts. These challenges limit their ability to generalize across a wide range of medical segmentation tasks. In this regard, we propose MedSAMix, a training-free model merging method that integrates the strengths of both generalist models (e.g., SAM) and specialist models (e.g., MedSAM) for medical image segmentation. In contrast to traditional model merging approaches that rely on manual configuration and often result in suboptimal outcomes, we propose a zero-order optimization method to automatically discover optimal layer-wise merging solutions. Furthermore, for clinical applications, we develop two regimes to meet the demand of domain-specificity and generalizability in different scenarios by single-task optimization and multi-objective optimization respectively. Extensive evaluations on 25 medical segmentation tasks demonstrate that MedSAMix effectively mitigates model bias and consistently improves performance in both domain-specific accuracy and generalization, achieving improvements of 6.67% on specialized tasks and 4.37% on multi-task evaluations.

[69] Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset

Wentao Mo,Qingchao Chen,Yuxin Peng,Siyuan Huang,Yang Liu

Main category: cs.CV

TL;DR: This paper introduces MV-ScanQA and TripAlign datasets to advance 3D vision-language learning by enhancing multi-view reasoning and contextual object alignment, along with the LEGO method that achieves top performance on multiple benchmarks.

Details Motivation: The motivation stems from the limitations of existing 3D vision-language datasets, which lack annotations requiring multi-view reasoning and fail to capture richer contextual alignments between multiple objects. Method: The study introduces MV-ScanQA, a new dataset for multi-view 3D question answering, and the TripAlign dataset for 2D-3D-language pre-training. The LEGO method was developed to transfer knowledge from pre-trained 2D models to the 3D domain. Result: MV-ScanQA contains 68% of questions requiring multi-view reasoning, compared to less than 7% in previous datasets. The LEGO method, trained on TripAlign, achieves state-of-the-art performance on MV-ScanQA and other 3D benchmarks. Conclusion: The study concludes that the proposed MV-ScanQA and TripAlign datasets significantly enhance 3D vision-language learning by enabling better multi-view compositional reasoning and providing richer alignment signals, with the LEGO method achieving state-of-the-art results. Abstract: The advancement of 3D vision-language (3D VL) learning is hindered by several limitations in existing 3D VL datasets: they rarely necessitate reasoning beyond a close range of objects in single viewpoint, and annotations often link instructions to single objects, missing richer contextual alignments between multiple objects. This significantly curtails the development of models capable of deep, multi-view 3D scene understanding over distant objects. To address these challenges, we introduce MV-ScanQA, a novel 3D question answering dataset where 68% of questions explicitly require integrating information from multiple views (compared to less than 7% in existing datasets), thereby rigorously testing multi-view compositional reasoning. To facilitate the training of models for such demanding scenarios, we present TripAlign dataset, a large-scale and low-cost 2D-3D-language pre-training corpus containing 1M <2D view, set of 3D objects, text> triplets that explicitly aligns groups of contextually related objects with text, providing richer, view-grounded multi-object multimodal alignment signals than previous single-object annotations. We further develop LEGO, a baseline method for the multi-view reasoning challenge in MV-ScanQA, transferring knowledge from pre-trained 2D LVLMs to 3D domain with TripAlign. Empirically, LEGO pre-trained on TripAlign achieves state-of-the-art performance not only on the proposed MV-ScanQA, but also on existing benchmarks for 3D dense captioning and question answering. Datasets and code are available at https://matthewdm0816.github.io/tripalign-mvscanqa.

[70] Data-Driven Abdominal Phenotypes of Type 2 Diabetes in Lean, Overweight, and Obese Cohorts

Lucas W. Remedios,Chloe Choe,Trent M. Schwartz,Dingjie Su,Gaurav Rudravaram,Chenyu Gao,Aravind R. Krishnan,Adam M. Saunders,Michael E. Kim,Shunxing Bao,Alvin C. Powers,Bennett A. Landman,John Virostko

Main category: cs.CV

TL;DR: This study uses AI to analyze abdominal CT scans and identifies consistent body composition features linked to type 2 diabetes across different BMI groups, suggesting that detailed body composition can reveal risk and protective signatures beyond BMI alone.

Details Motivation: While BMI is a known risk factor for type 2 diabetes, the disease's presence in some lean individuals and absence in some obese individuals suggests that more detailed body composition analysis could reveal specific abdominal phenotypes linked to diabetes. This motivates the use of AI to extract detailed measurements from clinical imaging data to better understand diabetes risk. Method: The study used AI to extract detailed body composition measurements from 3D abdominal CT scans of a large cohort (n=1,728), divided into lean, overweight, and obese subgroups. A random forest classifier with cross-validation was used to classify type 2 diabetes, and SHAP analysis was applied to interpret feature contributions. Clustering techniques were used to group scans by shared decision patterns and link them to anatomical differences. Result: The random forest models achieved mean AUCs of 0.72–0.74. Shared type 2 diabetes signatures were identified across all groups, including fatty skeletal muscle, older age, greater visceral and subcutaneous fat, and a smaller or fat-laden pancreas. Univariate logistic regression confirmed the significance of 14–18 of the top 20 predictors in each subgroup (p < 0.05). Conclusion: The study concludes that abdominal drivers of type 2 diabetes appear consistent across weight classes, suggesting that detailed body composition analysis can uncover phenotypes linked to diabetes risk and protection. Abstract: Purpose: Although elevated BMI is a well-known risk factor for type 2 diabetes, the disease's presence in some lean adults and absence in others with obesity suggests that detailed body composition may uncover abdominal phenotypes of type 2 diabetes. With AI, we can now extract detailed measurements of size, shape, and fat content from abdominal structures in 3D clinical imaging at scale. This creates an opportunity to empirically define body composition signatures linked to type 2 diabetes risk and protection using large-scale clinical data. Approach: To uncover BMI-specific diabetic abdominal patterns from clinical CT, we applied our design four times: once on the full cohort (n = 1,728) and once on lean (n = 497), overweight (n = 611), and obese (n = 620) subgroups separately. Briefly, our experimental design transforms abdominal scans into collections of explainable measurements through segmentation, classifies type 2 diabetes through a cross-validated random forest, measures how features contribute to model-estimated risk or protection through SHAP analysis, groups scans by shared model decision patterns (clustering from SHAP) and links back to anatomical differences (classification). Results: The random-forests achieved mean AUCs of 0.72-0.74. There were shared type 2 diabetes signatures in each group; fatty skeletal muscle, older age, greater visceral and subcutaneous fat, and a smaller or fat-laden pancreas. Univariate logistic regression confirmed the direction of 14-18 of the top 20 predictors within each subgroup (p < 0.05). Conclusions: Our findings suggest that abdominal drivers of type 2 diabetes may be consistent across weight classes.

[71] HierOctFusion: Multi-scale Octree-based 3D Shape Generation via Part-Whole-Hierarchy Message Passing

Xinjie Gao,Bi'an Du,Wei Hu

Main category: cs.CV

TL;DR: 本文提出了一种基于八叉树的3D内容生成模型HierOctFusion,该模型通过层次特征交互和语义部分信息注入机制,提高了生成结果的细节和效率。

Details Motivation: 现有的3D内容生成方法通常将3D物体建模为整体实体,忽略了其语义部分层次结构,限制了泛化能力。同时,整体高分辨率建模计算成本高,而现实世界中的物体本质上是稀疏且具有层次结构的,适合分层生成。 Method: 提出了一种部分感知的多尺度八叉树扩散模型,并引入了跨注意力条件机制,将部分级别的信息注入生成过程。此外,使用预训练分割模型构建了一个带有部分类别注释的3D数据集。 Result: 实验表明,与之前的方法相比,HierOctFusion在形状质量和生成效率方面表现出色。 Conclusion: 通过引入层次特征交互和跨注意力条件机制,HierOctFusion在生成细粒度和稀疏物体结构方面优于先前方法。 Abstract: 3D content generation remains a fundamental yet challenging task due to the inherent structural complexity of 3D data. While recent octree-based diffusion models offer a promising balance between efficiency and quality through hierarchical generation, they often overlook two key insights: 1) existing methods typically model 3D objects as holistic entities, ignoring their semantic part hierarchies and limiting generalization; and 2) holistic high-resolution modeling is computationally expensive, whereas real-world objects are inherently sparse and hierarchical, making them well-suited for layered generation. Motivated by these observations, we propose HierOctFusion, a part-aware multi-scale octree diffusion model that enhances hierarchical feature interaction for generating fine-grained and sparse object structures. Furthermore, we introduce a cross-attention conditioning mechanism that injects part-level information into the generation process, enabling semantic features to propagate effectively across hierarchical levels from parts to the whole. Additionally, we construct a 3D dataset with part category annotations using a pre-trained segmentation model to facilitate training and evaluation. Experiments demonstrate that HierOctFusion achieves superior shape quality and efficiency compared to prior methods.

[72] UWB-PostureGuard: A Privacy-Preserving RF Sensing System for Continuous Ergonomic Sitting Posture Monitoring

Haotang Li,Zhenyu Qi,Sen He,Kebin Peng,Sheng Tan,Yili Ren,Tomas Cerny,Jiyue Zhao,Zi Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于超宽带(UWB)传感技术的隐私保护型姿势监测系统——UWB-PostureGuard,用于改善长时间使用计算机时的坐姿问题,并通过无接触、连续监测的方式实现预防性健康管理。

Details Motivation: 传统的姿势监测方法存在隐私问题(如摄像头系统)和用户不适(如可穿戴传感器),因此需要一种更有效的解决方案。UWB-PostureGuard旨在解决这些问题,提供一种无接触、高准确度的姿势监测方法。 Method: UWB-PostureGuard利用商用UWB设备,通过全面的特征工程提取多种姿势特征,并开发了PoseGBDT模型以捕捉姿势模式的时间依赖性,克服了传统逐帧分类方法的局限性。 Result: 在10名参与者和19种不同姿势的真实世界评估中,该系统表现出色,准确率达到99.11%,并且对环境变量(如衣物厚度、附加设备和家具配置)具有鲁棒性。 Conclusion: UWB-PostureGuard提供了一种可在现有平台上扩展的、隐私保护的移动健康解决方案,用于主动的姿势管理,以低成本提高生活质量。 Abstract: Improper sitting posture during prolonged computer use has become a significant public health concern. Traditional posture monitoring solutions face substantial barriers, including privacy concerns with camera-based systems and user discomfort with wearable sensors. This paper presents UWB-PostureGuard, a privacy-preserving ultra-wideband (UWB) sensing system that advances mobile technologies for preventive health management through continuous, contactless monitoring of ergonomic sitting posture. Our system leverages commercial UWB devices, utilizing comprehensive feature engineering to extract multiple ergonomic sitting posture features. We develop PoseGBDT to effectively capture temporal dependencies in posture patterns, addressing limitations of traditional frame-wise classification approaches. Extensive real-world evaluation across 10 participants and 19 distinct postures demonstrates exceptional performance, achieving 99.11% accuracy while maintaining robustness against environmental variables such as clothing thickness, additional devices, and furniture configurations. Our system provides a scalable, privacy-preserving mobile health solution on existing platforms for proactive ergonomic management, improving quality of life at low costs.

[73] Residual-based Efficient Bidirectional Diffusion Model for Image Dehazing and Haze Generation

Bing Liu,Le Wang,Hao Liu,Mingming Liu

Main category: cs.CV

TL;DR: 本文提出了一种基于残差的高效双向扩散模型(RBDM),可在少量采样步骤下实现无雾和有雾图像的双向转换,并在多个数据集上表现优异。

Details Motivation: 当前的深度去雾方法仅专注于去除图像中的雾,缺乏在有雾和无雾图像之间转换的能力,因此提出RBDM以解决这一问题。 Method: 提出了基于残差的高效双向扩散模型(RBDM),通过设计双马尔可夫链实现残差的有效转换,并在不同时间步对有雾和无雾图像进行扰动以预测噪声,同时学习条件分布。此外,该方法引入了基于图像块的统一评分函数以提高性能并减少计算成本。 Result: RBDM在仅需15个采样步骤的情况下成功实现了尺寸无关的双向图像转换,并在合成和真实世界数据集上表现出卓越或至少可比于现有最先进的方法的性能。 Conclusion: RBDM方法成功实现了无需特定尺寸的、在无雾和有雾图像之间的双向转换,并且在合成和真实世界数据集上均表现出优越或至少可比于现有最先进方法的性能。 Abstract: Current deep dehazing methods only focus on removing haze from hazy images, lacking the capability to translate between hazy and haze-free images. To address this issue, we propose a residual-based efficient bidirectional diffusion model (RBDM) that can model the conditional distributions for both dehazing and haze generation. Firstly, we devise dual Markov chains that can effectively shift the residuals and facilitate bidirectional smooth transitions between them. Secondly, the RBDM perturbs the hazy and haze-free images at individual timesteps and predicts the noise in the perturbed data to simultaneously learn the conditional distributions. Finally, to enhance performance on relatively small datasets and reduce computational costs, our method introduces a unified score function learned on image patches instead of entire images. Our RBDM successfully implements size-agnostic bidirectional transitions between haze-free and hazy images with only 15 sampling steps. Extensive experiments demonstrate that the proposed method achieves superior or at least comparable performance to state-of-the-art methods on both synthetic and real-world datasets.

[74] A Cross-Modal Rumor Detection Scheme via Contrastive Learning by Exploring Text and Image internal Correlations

Bin Ma,Yifei Zhang,Yongjin Xian,Qi Li,Linna Zhou,Gongxun Miao

Main category: cs.CV

TL;DR: 本文提出了一种基于对比学习的跨模态谣言检测方法MICC,通过融合多尺度图像和文本信息,显著提升了检测效果。

Details Motivation: 现有谣言检测方法忽略了图像内容及其与上下文之间的关系,导致关键信息的丢失。 Method: 设计了SCLIP编码器以生成统一的语义嵌入,引入了跨模态多尺度对齐模块,并采用尺度感知融合网络进行特征整合。 Result: 在两个真实数据集上的实验结果表明,所提方法在谣言检测任务中显著优于现有的最先进方法。 Conclusion: 该论文提出了一种新的跨模态谣言检测方法MICC,通过融合多尺度图像特征和文本特征,实现了比现有方法更优的谣言检测性能,表明其在实际应用中的潜力。 Abstract: Existing rumor detection methods often neglect the content within images as well as the inherent relationships between contexts and images across different visual scales, thereby resulting in the loss of critical information pertinent to rumor identification. To address these issues, this paper presents a novel cross-modal rumor detection scheme based on contrastive learning, namely the Multi-scale Image and Context Correlation exploration algorithm (MICC). Specifically, we design an SCLIP encoder to generate unified semantic embeddings for text and multi-scale image patches through contrastive pretraining, enabling their relevance to be measured via dot-product similarity. Building upon this, a Cross-Modal Multi-Scale Alignment module is introduced to identify image regions most relevant to the textual semantics, guided by mutual information maximization and the information bottleneck principle, through a Top-K selection strategy based on a cross-modal relevance matrix constructed between the text and multi-scale image patches. Moreover, a scale-aware fusion network is designed to integrate the highly correlated multi-scale image features with global text features by assigning adaptive weights to image regions based on their semantic importance and cross-modal relevance. The proposed methodology has been extensively evaluated on two real-world datasets. The experimental results demonstrate that it achieves a substantial performance improvement over existing state-of-the-art approaches in rumor detection, highlighting its effectiveness and potential for practical applications.

[75] LEARN: A Story-Driven Layout-to-Image Generation Framework for STEM Instruction

Maoquan Zhang,Bisser Raytchev,Xiujuan Sun

Main category: cs.CV

TL;DR: LEARN是一个基于布局感知扩散模型的教育可视化框架,旨在通过语义对齐和认知支架支持STEM教育中的复杂概念理解。

Details Motivation: 为了解决STEM教育中抽象和顺序性科学概念难以可视化的问题,同时减少外在认知负荷并支持中高级推理能力的发展。 Method: LEARN通过布局条件生成、对比视觉-语义训练和提示调制来生成与教育目标一致的连贯视觉序列。 Result: LEARN能够生成空间组织化和故事驱动的视觉序列,有效支持与布鲁姆分类法一致的推理能力,并减少由短格式媒体引发的注意力碎片化问题。 Conclusion: LEARN是一个将基于布局的叙事、语义结构学习和认知支架统一起来的生成AI新方向,并将在未来促进教育内容的自适应和探索性发展。 Abstract: LEARN is a layout-aware diffusion framework designed to generate pedagogically aligned illustrations for STEM education. It leverages a curated BookCover dataset that provides narrative layouts and structured visual cues, enabling the model to depict abstract and sequential scientific concepts with strong semantic alignment. Through layout-conditioned generation, contrastive visual-semantic training, and prompt modulation, LEARN produces coherent visual sequences that support mid-to-high-level reasoning in line with Bloom's taxonomy while reducing extraneous cognitive load as emphasized by Cognitive Load Theory. By fostering spatially organized and story-driven narratives, the framework counters fragmented attention often induced by short-form media and promotes sustained conceptual focus. Beyond static diagrams, LEARN demonstrates potential for integration with multimodal systems and curriculum-linked knowledge graphs to create adaptive, exploratory educational content. As the first generative approach to unify layout-based storytelling, semantic structure learning, and cognitive scaffolding, LEARN represents a novel direction for generative AI in education. The code and dataset will be released to facilitate future research and practical deployment.

[76] Semi-supervised Image Dehazing via Expectation-Maximization and Bidirectional Brownian Bridge Diffusion Models

Bing Liu,Le Wang,Mingming Liu,Hao Liu,Rui Yao,Yong Zhou,Peng Liu,Tongqiang Xia

Main category: cs.CV

TL;DR: This paper introduces EM-B3DM, an efficient semi-supervised method for image dehazing using Expectation-Maximization, Bidirectional Brownian Bridge Diffusion Models, and a detail-enhanced convolution block, achieving strong performance without requiring paired training data.

Details Motivation: The research aims to address the challenges of dehazing real-world images, particularly in scenes with thick haze, by eliminating the need for costly paired hazy and clear image datasets. Method: The method involves a two-stage learning scheme using Expectation-Maximization (EM) and Bidirectional Brownian Bridge Diffusion Models (B3DM), along with a detail-enhanced Residual Difference Convolution block (RDC) for improved gradient-level information capture. Result: The EM-B3DM method achieves superior or at least comparable results to state-of-the-art methods for image dehazing on both synthetic and real-world datasets. Conclusion: The proposed EM-B3DM method demonstrates superior or comparable performance to existing state-of-the-art image dehazing techniques on synthetic and real-world datasets. Abstract: Existing dehazing methods deal with real-world haze images with difficulty, especially scenes with thick haze. One of the main reasons is the lack of real-world paired data and robust priors. To avoid the costly collection of paired hazy and clear images, we propose an efficient semi-supervised image dehazing method via Expectation-Maximization and Bidirectional Brownian Bridge Diffusion Models (EM-B3DM) with a two-stage learning scheme. In the first stage, we employ the EM algorithm to decouple the joint distribution of paired hazy and clear images into two conditional distributions, which are then modeled using a unified Brownian Bridge diffusion model to directly capture the structural and content-related correlations between hazy and clear images. In the second stage, we leverage the pre-trained model and large-scale unpaired hazy and clear images to further improve the performance of image dehazing. Additionally, we introduce a detail-enhanced Residual Difference Convolution block (RDC) to capture gradient-level information, significantly enhancing the model's representation capability. Extensive experiments demonstrate that our EM-B3DM achieves superior or at least comparable performance to state-of-the-art methods on both synthetic and real-world datasets.

[77] VFM-Guided Semi-Supervised Detection Transformer for Source-Free Object Detection in Remote Sensing Images

Jianhong Han,Yupei Wang,Liang Chen

Main category: cs.CV

TL;DR: This paper proposes VG-DETR, a semi-supervised framework for Source-Free Object Detection in remote sensing imagery, which integrates Vision Foundation Models to reduce pseudo-label noise and enhance feature representation, leading to improved performance.

Details Motivation: Unsupervised domain adaptation methods face limitations in real-world remote-sensing scenarios due to privacy and transmission constraints. Existing Source-Free Object Detection (SFOD) methods struggle with training collapse from noisy pseudo-labels, especially in complex remote sensing images. This work aims to address these issues by leveraging Vision Foundation Models and limited labeled target data. Method: The paper proposes a semi-supervised framework for Source-Free Object Detection (SFOD) in remote sensing images, introducing a Vision foundation-Guided DEtection TRansformer (VG-DETR). It incorporates a Vision Foundation Model (VFM) to improve pseudo-label reliability and detector feature extraction. The method includes a VFM-guided pseudo-label mining strategy and a dual-level alignment approach using contrastive learning and similarity matching. Result: Extensive experiments show that VG-DETR achieves superior performance on source-free remote sensing detection tasks by improving pseudo-label quality and feature representation robustness against domain gaps. Conclusion: VG-DETR effectively addresses the issue of pseudo-label noise in Source-Free Object Detection for remote sensing images, achieving superior performance through integration of Vision Foundation Models. Abstract: Unsupervised domain adaptation methods have been widely explored to bridge domain gaps. However, in real-world remote-sensing scenarios, privacy and transmission constraints often preclude access to source domain data, which limits their practical applicability. Recently, Source-Free Object Detection (SFOD) has emerged as a promising alternative, aiming at cross-domain adaptation without relying on source data, primarily through a self-training paradigm. Despite its potential, SFOD frequently suffers from training collapse caused by noisy pseudo-labels, especially in remote sensing imagery with dense objects and complex backgrounds. Considering that limited target domain annotations are often feasible in practice, we propose a Vision foundation-Guided DEtection TRansformer (VG-DETR), built upon a semi-supervised framework for SFOD in remote sensing images. VG-DETR integrates a Vision Foundation Model (VFM) into the training pipeline in a "free lunch" manner, leveraging a small amount of labeled target data to mitigate pseudo-label noise while improving the detector's feature-extraction capability. Specifically, we introduce a VFM-guided pseudo-label mining strategy that leverages the VFM's semantic priors to further assess the reliability of the generated pseudo-labels. By recovering potentially correct predictions from low-confidence outputs, our strategy improves pseudo-label quality and quantity. In addition, a dual-level VFM-guided alignment method is proposed, which aligns detector features with VFM embeddings at both the instance and image levels. Through contrastive learning among fine-grained prototypes and similarity matching between feature maps, this dual-level alignment further enhances the robustness of feature representations against domain gaps. Extensive experiments demonstrate that VG-DETR achieves superior performance in source-free remote sensing detection tasks.

[78] Better Supervised Fine-tuning for VQA: Integer-Only Loss

Baihong Qian,Haotian Fan,Wenjie Liao,Yunqiu Wang,Tao Li,Junhui Cui

Main category: cs.CV

TL;DR: 本文提出了一种名为IOVQA的微调方法,旨在提升视觉语言模型在视频质量评估任务中的表现,通过限制模型输出为整数并引入目标掩码策略,显著提高了模型的准确性和一致性。

Details Motivation: 现有的视觉内容评估方法存在结果不精确和损失计算效率低的问题,限制了模型对关键评估指标的关注。 Method: 构建了一个名为IOVQA的微调方法,通过将输出限制在[10,50]范围内的整数、使用目标掩码策略计算损失,并对Qwen2.5-VL模型进行微调。 Result: 实验结果显示,该方法在VQA任务中显著提升了模型的准确性和一致性,并在VQualA 2025 GenAI-Bench AIGC视频质量评估挑战赛中排名第三。 Conclusion: 仅保留整数标签进行微调的方法有效优化了视觉语言模型在定量评估场景中的表现,为相关任务提供了新的思路。 Abstract: With the rapid advancement of vision language models(VLM), their ability to assess visual content based on specific criteria and dimensions has become increasingly critical for applications such as video-theme consistency assessment and visual quality scoring. However, existing methods often suffer from imprecise results and inefficient loss calculation, which limit the focus of the model on key evaluation indicators. To address this, we propose IOVQA(Integer-only VQA), a novel fine-tuning approach tailored for VLMs to enhance their performance in video quality assessment tasks. The key innovation of IOVQA lies in its label construction and its targeted loss calculation mechanism. Specifically, during dataset curation, we constrain the model's output to integers within the range of [10,50], ensuring numerical stability, and convert decimal Overall_MOS to integer before using them as labels. We also introduce a target-mask strategy: when computing the loss, only the first two-digit-integer of the label is unmasked, forcing the model to learn the critical components of the numerical evaluation. After fine-tuning the Qwen2.5-VL model using the constructed dataset, experimental results demonstrate that the proposed method significantly improves the model's accuracy and consistency in the VQA task, ranking 3rd in VQualA 2025 GenAI-Bench AIGC Video Quality Assessment Challenge -- Track I. Our work highlights the effectiveness of merely leaving integer labels during fine-tuning, providing an effective idea for optimizing VLMs in quantitative evaluation scenarios.

[79] Exploring the Tradeoff Between Diversity and Discrimination for Continuous Category Discovery

Ruobing Jiang,Yang Liu,Haobing Liu,Yanwei Yu,Chunyang Wang

Main category: cs.CV

TL;DR: This paper proposes IDOD, a novel method for continuous category discovery that improves diversity, reduces error accumulation, and prevents forgetting efficiently, outperforming existing methods on fine-grained datasets.

Details Motivation: Existing continuous category discovery (CCD) methods struggle to balance novel class discovery with classification, often suffer from error accumulation, and rely on memory-intensive techniques like knowledge distillation and data replay. This work aims to address these limitations. Method: IDOD incorporates three modules: independent enrichment of diversity using contrastive loss, joint discovery of novelty transforming multi-stage discovery into single-stage, and continuous increment by orthogonality generating orthogonal prototypes and using representation replay. Result: Experimental results show that IDOD outperforms state-of-the-art methods on challenging fine-grained datasets in continuous category discovery. Conclusion: The proposed IDOD method effectively addresses the challenges in continuous category discovery by improving diversity, reducing error accumulation, and preventing forgetting with lower space overhead, demonstrating superior performance on fine-grained datasets. Abstract: Continuous category discovery (CCD) aims to automatically discover novel categories in continuously arriving unlabeled data. This is a challenging problem considering that there is no number of categories and labels in the newly arrived data, while also needing to mitigate catastrophic forgetting. Most CCD methods cannot handle the contradiction between novel class discovery and classification well. They are also prone to accumulate errors in the process of gradually discovering novel classes. Moreover, most of them use knowledge distillation and data replay to prevent forgetting, occupying more storage space. To address these limitations, we propose Independence-based Diversity and Orthogonality-based Discrimination (IDOD). IDOD mainly includes independent enrichment of diversity module, joint discovery of novelty module, and continuous increment by orthogonality module. In independent enrichment, the backbone is trained separately using contrastive loss to avoid it focusing only on features for classification. Joint discovery transforms multi-stage novel class discovery into single-stage, reducing error accumulation impact. Continuous increment by orthogonality module generates mutually orthogonal prototypes for classification and prevents forgetting with lower space overhead via representative representation replay. Experimental results show that on challenging fine-grained datasets, our method outperforms the state-of-the-art methods.

[80] Fine-Grained VLM Fine-tuning via Latent Hierarchical Adapter Learning

Yumiao Zhao,Bo Jiang,Yuhe Ding,Xiao Wang,Jin Tang,Bin Luo

Main category: cs.CV

TL;DR: LatHAdapter is a novel adapter for fine-tuning Vision-Language Models that better aligns visual and textual representations by exploiting the latent semantic hierarchy in a hyperbolic space, leading to improved performance on few-shot classification tasks.

Details Motivation: Existing adapters fail to capture the inherent one-to-many associations between categories and image samples and struggle to establish accurate associations for unknown categories. Method: LatHAdapter exploits the latent semantic hierarchy of downstream training data and employs hierarchical regularization to learn this hierarchy in a hyperbolic space. Result: LatHAdapter improves the alignment of visual and textual representations, capturing one-to-many associations and enhancing performance on few-shot classification tasks. Conclusion: The proposed LatHAdapter consistently outperforms other fine-tuning approaches, especially in adapting known classes and generalizing to unknown classes. Abstract: Adapter-based approaches have garnered attention for fine-tuning pre-trained Vision-Language Models (VLMs) on few-shot classification tasks. These methods strive to develop a lightweight module that better aligns visual and (category) textual representations, thereby enhancing performance on downstream few-shot learning tasks. However, existing adapters generally learn/align (category) textual-visual modalities via explicit spatial proximity in the underlying embedding space, which i) fails to capture the inherent one-to-many associations between categories and image samples and ii) struggles to establish accurate associations between the unknown categories and images. To address these issues, inspired by recent works on hyperbolic learning, we develop a novel Latent Hierarchical Adapter (LatHAdapter) for fine-tuning VLMs on downstream few-shot classification tasks. The core of LatHAdapter is to exploit the latent semantic hierarchy of downstream training data and employ it to provide richer, fine-grained guidance for the adapter learning process. Specifically, LatHAdapter first introduces some learnable `attribute' prompts as the bridge to align categories and images. Then, it projects the categories, attribute prompts, and images within each batch in a hyperbolic space, and employs hierarchical regularization to learn the latent semantic hierarchy of them, thereby fully modeling the inherent one-to-many associations among categories, learnable attributes, and image samples. Extensive experiments on four challenging few-shot tasks show that the proposed LatHAdapter consistently outperforms many other fine-tuning approaches, particularly in adapting known classes and generalizing to unknown classes.

[81] Versatile Video Tokenization with Generative 2D Gaussian Splatting

Zhenghao Chen,Zicong Chen,Lei Liu,Yiming Wu,Dong Xu

Main category: cs.CV

TL;DR: This paper introduces the Gaussian Video Transformer (GVT), a novel video tokenization method that improves spatial adaptability and temporal versatility, leading to superior performance in video reconstruction, action recognition, and compression.

Details Motivation: Existing video tokenization methods often suffer from spatial over-encoding in low-information regions and challenges in reducing temporal redundancy due to the inability to distinguish between static and dynamic content. This work aims to address these limitations by proposing a more adaptable and versatile tokenization approach. Method: GVT utilizes a generative 2D Gaussian Splatting (2DGS) strategy with a Spatio-Temporal Gaussian Embedding (STGE) mechanism for latent rigid feature extraction and representation. Additionally, a Gaussian Set Partitioning (GSP) strategy separates static and dynamic content to enhance temporal versatility. Result: GVT achieves state-of-the-art performance in video reconstruction, outperforms MAGVIT-v2 in action recognition, and delivers comparable performance in video compression on datasets like UCF101, Kinetics, and DAVIS. Conclusion: The proposed Gaussian Video Transformer (GVT) demonstrates superior performance in video reconstruction, action recognition, and compression, establishing itself as a versatile and effective video tokenization method. Abstract: Video tokenization procedure is critical for a wide range of video processing tasks. Most existing approaches directly transform video into fixed-grid and patch-wise tokens, which exhibit limited versatility. Spatially, uniformly allocating a fixed number of tokens often leads to over-encoding in low-information regions. Temporally, reducing redundancy remains challenging without explicitly distinguishing between static and dynamic content. In this work, we propose the Gaussian Video Transformer (GVT), a versatile video tokenizer built upon a generative 2D Gaussian Splatting (2DGS) strategy. We first extract latent rigid features from a video clip and represent them with a set of 2D Gaussians generated by our proposed Spatio-Temporal Gaussian Embedding (STGE) mechanism in a feed-forward manner. Such generative 2D Gaussians not only enhance spatial adaptability by assigning higher (resp., lower) rendering weights to regions with higher (resp., lower) information content during rasterization, but also improve generalization by avoiding per-video optimization.To enhance the temporal versatility, we introduce a Gaussian Set Partitioning (GSP) strategy that separates the 2D Gaussians into static and dynamic sets, which explicitly model static content shared across different time-steps and dynamic content specific to each time-step, enabling a compact representation.We primarily evaluate GVT on the video reconstruction, while also assessing its performance on action recognition and compression using the UCF101, Kinetics, and DAVIS datasets. Extensive experiments demonstrate that GVT achieves a state-of-the-art video reconstruction quality, outperforms the baseline MAGVIT-v2 in action recognition, and delivers comparable compression performance.

[82] CHARM3R: Towards Unseen Camera Height Robust Monocular 3D Detector

Abhinav Kumar,Yuliang Guo,Zhihao Zhang,Xinyu Huang,Liu Ren,Xiaoming Liu

Main category: cs.CV

TL;DR: 本文提出了CHARM3R,用于提升单目3D物体检测在不同相机高度下的性能,实现了显著的性能提升。

Details Motivation: 单目3D物体检测在面对不同相机高度时表现不佳,需要研究提升其鲁棒性。 Method: 通过数学证明和实验证明深度估计对相机高度变化的影响,并提出CHARM3R模型平均深度估计。 Result: CHARM3R在CARLA数据集上提升了超过45%的性能,且在相机高度变化下表现优异。 Conclusion: CHARM3R有效提升了单目3D物体检测在不同相机高度下的泛化能力,实现了SOTA性能。 Abstract: Monocular 3D object detectors, while effective on data from one ego camera height, struggle with unseen or out-of-distribution camera heights. Existing methods often rely on Plucker embeddings, image transformations or data augmentation. This paper takes a step towards this understudied problem by first investigating the impact of camera height variations on state-of-the-art (SoTA) Mono3D models. With a systematic analysis on the extended CARLA dataset with multiple camera heights, we observe that depth estimation is a primary factor influencing performance under height variations. We mathematically prove and also empirically observe consistent negative and positive trends in mean depth error of regressed and ground-based depth models, respectively, under camera height changes. To mitigate this, we propose Camera Height Robust Monocular 3D Detector (CHARM3R), which averages both depth estimates within the model. CHARM3R improves generalization to unseen camera heights by more than $45\%$, achieving SoTA performance on the CARLA dataset. Codes and Models at https://github.com/abhi1kumar/CHARM3R

[83] Generating Dialogues from Egocentric Instructional Videos for Task Assistance: Dataset, Method and Benchmark

Lavisha Aggarwal,Vikas Bahirwani,Lin Li,Andrea Colaco

Main category: cs.CV

TL;DR: This paper introduces HowToDIV, a new dataset for task-assistance dialogues created by transforming instructional videos into two-person conversations using AI.

Details Motivation: The motivation is to address the scarcity of dialogue-video datasets for real-world task assistance, especially for complex, multi-step tasks. Method: The paper proposes an approach that uses large language models to automatically transform single-person instructional videos into two-person dialogues, creating the HowToDIV dataset. Result: The result is the creation of the HowToDIV dataset, which includes 507 conversations, 6636 question-answer pairs, and 24 hours of video clips across various tasks. Conclusion: The paper concludes that the HowToDIV dataset provides a valuable resource for future research on procedural-task assistance using dialogues. Abstract: Many everyday tasks ranging from fixing appliances, cooking recipes to car maintenance require expert knowledge, especially when tasks are complex and multi-step. Despite growing interest in AI agents, there is a scarcity of dialogue-video datasets grounded for real world task assistance. In this paper, we propose a simple yet effective approach that transforms single-person instructional videos into task-guidance two-person dialogues, aligned with fine grained steps and video-clips. Our fully automatic approach, powered by large language models, offers an efficient alternative to the substantial cost and effort required for human-assisted data collection. Using this technique, we build HowToDIV, a large-scale dataset containing 507 conversations, 6636 question-answer pairs and 24 hours of videoclips across diverse tasks in cooking, mechanics, and planting. Each session includes multi-turn conversation where an expert teaches a novice user how to perform a task step by step, while observing user's surrounding through a camera and microphone equipped wearable device. We establish the baseline benchmark performance on HowToDIV dataset through Gemma-3 model for future research on this new task of dialogues for procedural-task assistance.

[84] UAV-VL-R1: Generalizing Vision-Language Models via Supervised Fine-Tuning and Multi-Stage GRPO for UAV Visual Reasoning

Jiajin Guan,Haibo Mei,Bonan Zhang,Dan Liu,Yuanshuang Fu,Yue Zhang

Main category: cs.CV

TL;DR: 本文提出了一种轻量级视觉-语言模型UAV-VL-R1,用于解决无人机航空影像任务中的挑战。

Details Motivation: 无人机(UAV)基于航空影像的任务面临高分辨率、复杂空间语义和严格的实时约束,限制了通用视觉-语言模型(VLMs)的应用。 Method: UAV-VL-R1使用了结合监督微调(SFT)和多阶段强化学习(RL)的混合方法进行训练,并采用了组相对策略优化(GRPO)算法。 Result: UAV-VL-R1在零样本准确率上比Qwen2-VL-2B-Instruct基线高48.17%,并且在多个任务上优于其72B规模的变体。 Conclusion: UAV-VL-R1实现了比Qwen2-VL-2B-Instruct基线高48.17%的零样本准确率,并且在多个任务上甚至优于其72B规模的变体。 Abstract: Recent advances in vision-language models (VLMs) have demonstrated strong generalization in natural image tasks. However, their performance often degrades on unmanned aerial vehicle (UAV)-based aerial imagery, which features high resolution, complex spatial semantics, and strict real-time constraints. These challenges limit the applicability of general-purpose VLMs to structured aerial reasoning tasks. To address these challenges, we propose UAV-VL-R1, a lightweight VLM explicitly designed for aerial visual reasoning. It is trained using a hybrid method that combines supervised fine-tuning (SFT) and multi-stage reinforcement learning (RL). We leverage the group relative policy optimization (GRPO) algorithm to promote structured and interpretable reasoning through rule-guided rewards and intra-group policy alignment. To support model training and evaluation, we introduce a high-resolution visual question answering dataset named HRVQA-VL, which consists of 50,019 annotated samples covering eight UAV-relevant reasoning tasks, including object counting, transportation recognition, and spatial scene inference. Experimental results show that UAV-VL-R1 achieves a 48.17% higher zero-shot accuracy than the Qwen2-VL-2B-Instruct baseline and even outperforms its 72B-scale variant, which is 36x larger, on multiple tasks. Ablation studies reveal that while SFT improves semantic alignment, it may reduce reasoning diversity in mathematical tasks. GRPO-based RL compensates for this limitation by enhancing logical flexibility and the robustness of inference. Additionally, UAV-VL-R1 requires only 3.9GB of memory under FP16 inference and can be quantized to 2.5GB with INT8, supporting real-time deployment on resource-constrained UAV platforms.

[85] A Coarse-to-Fine Human Pose Estimation Method based on Two-stage Distillation and Progressive Graph Neural Network

Zhangjian Ji,Wenjin Zhang,Shaotong Qiao,Kai Feng,Yuhua Qian

Main category: cs.CV

TL;DR: 本文提出了一种新的由粗到细的两阶段知识蒸馏框架,用于准确、鲁棒且轻量级的人体姿态估计,实验表明其在COCO关键点和CrowdPose数据集上表现优异。

Details Motivation: 现有的最先进的姿态估计方法需要大量的计算资源才能获得准确的预测结果,而通过知识蒸馏将姿态知识从强大的教师模型转移到参数较少的学生模型中是一种可行的解决方案,但传统的知识蒸馏框架未能充分挖掘人体关节间的上下文信息。 Method: 提出了一种新的由粗到细的两阶段知识蒸馏框架用于人体姿态估计。第一阶段引入了人体关节结构损失来挖掘人体关节间的结构信息;第二阶段使用图像引导的渐进图卷积网络(IGP-GCN)来优化初始姿态,并通过教师模型的最终输出姿态进行渐进式监督训练。 Result: 在COCO关键点和CrowdPose数据集上的广泛实验表明,所提出的方法优于许多现有的最先进的姿态估计方法,尤其是在复杂的CrowdPose数据集上,性能提升更加显著。 Conclusion: 实验结果表明,所提出的两阶段知识蒸馏框架在COCO关键点和CrowdPose数据集上均优于现有的先进方法,尤其是在复杂的CrowdPose数据集上性能提升更为显著。 Abstract: Human pose estimation has been widely applied in the human-centric understanding and generation, but most existing state-of-the-art human pose estimation methods require heavy computational resources for accurate predictions. In order to obtain an accurate, robust yet lightweight human pose estimator, one feasible way is to transfer pose knowledge from a powerful teacher model to a less-parameterized student model by knowledge distillation. However, the traditional knowledge distillation framework does not fully explore the contextual information among human joints. Thus, in this paper, we propose a novel coarse-to-fine two-stage knowledge distillation framework for human pose estimation. In the first-stage distillation, we introduce the human joints structure loss to mine the structural information among human joints so as to transfer high-level semantic knowledge from the teacher model to the student model. In the second-stage distillation, we utilize an Image-Guided Progressive Graph Convolutional Network (IGP-GCN) to refine the initial human pose obtained from the first-stage distillation and supervise the training of the IGP-GCN in the progressive way by the final output pose of teacher model. The extensive experiments on the benchmark dataset: COCO keypoint and CrowdPose datasets, show that our proposed method performs favorably against lots of the existing state-of-the-art human pose estimation methods, especially for the more complex CrowdPose dataset, the performance improvement of our model is more significant.

[86] A CLIP-based Uncertainty Modal Modeling (UMM) Framework for Pedestrian Re-Identification in Autonomous Driving

Jialin Li,Shuqi Wu,Ning Wang

Main category: cs.CV

TL;DR: The paper proposes a lightweight framework called Uncertainty Modal Modeling (UMM) to address challenges in pedestrian re-identification for autonomous driving, particularly under conditions of uncertain or missing input modalities.

Details Motivation: The motivation stems from the need for real-time pedestrian identification in autonomous driving scenarios, where conventional ReID approaches face challenges due to uncertain or missing input modalities and the computational overhead of large-scale pre-trained models. Method: The proposed lightweight Uncertainty Modal Modeling (UMM) framework incorporates a multimodal token mapper, synthetic modality augmentation strategy, and cross-modal cue interactive learner. It also utilizes CLIP's vision-language alignment ability for efficient multimodal input fusion without extensive fine-tuning. Result: Experimental results demonstrate that UMM performs well in terms of robustness, generalization, and computational efficiency under uncertain modality conditions. Conclusion: The UMM framework offers a scalable and practical solution for pedestrian re-identification in autonomous driving scenarios by achieving strong robustness, generalization, and computational efficiency under uncertain modality conditions. Abstract: Re-Identification (ReID) is a critical technology in intelligent perception systems, especially within autonomous driving, where onboard cameras must identify pedestrians across views and time in real-time to support safe navigation and trajectory prediction. However, the presence of uncertain or missing input modalities--such as RGB, infrared, sketches, or textual descriptions--poses significant challenges to conventional ReID approaches. While large-scale pre-trained models offer strong multimodal semantic modeling capabilities, their computational overhead limits practical deployment in resource-constrained environments. To address these challenges, we propose a lightweight Uncertainty Modal Modeling (UMM) framework, which integrates a multimodal token mapper, synthetic modality augmentation strategy, and cross-modal cue interactive learner. Together, these components enable unified feature representation, mitigate the impact of missing modalities, and extract complementary information across different data types. Additionally, UMM leverages CLIP's vision-language alignment ability to fuse multimodal inputs efficiently without extensive finetuning. Experimental results demonstrate that UMM achieves strong robustness, generalization, and computational efficiency under uncertain modality conditions, offering a scalable and practical solution for pedestrian re-identification in autonomous driving scenarios.

[87] FantasyTalking2: Timestep-Layer Adaptive Preference Optimization for Audio-Driven Portrait Animation

MengChao Wang,Qiang Wang,Fan Jiang,Mu Xu

Main category: cs.CV

TL;DR: 本研究提出TLPO框架和Talking-Critic模型,通过解耦和融合多维度偏好,显著提升了音频驱动肖像动画在动作自然性、唇形同步和视觉质量等方面与人类偏好的对齐能力。

Details Motivation: 现有音频驱动肖像动画方法难以在多个维度(如动作自然性、唇形同步准确性和视觉质量)上满足人类的细粒度偏好,主要是由于优化目标之间的冲突以及缺乏大规模高质量的多维偏好标注数据集。 Method: 提出TLPO(Timestep-Layer adaptive multi-expert Preference Optimization)框架,通过将偏好解耦为专门的专家模块,并在时间步和网络层之间融合,实现对生成视频的多维度优化。 Result: 实验表明,Talking-Critic在对齐人类偏好评分方面显著优于现有方法;TLPO在唇形同步准确性、动作自然性和视觉质量方面相比基线模型有明显提升,在定性和定量评估中均表现出色。 Conclusion: TLPO框架能够有效对齐基于扩散模型的肖像动画与细粒度、多维度偏好,Talking-Critic奖励模型和Talking-NSQ数据集的引入也显著提升了与人类偏好对齐的能力。 Abstract: Recent advances in audio-driven portrait animation have demonstrated impressive capabilities. However, existing methods struggle to align with fine-grained human preferences across multiple dimensions, such as motion naturalness, lip-sync accuracy, and visual quality. This is due to the difficulty of optimizing among competing preference objectives, which often conflict with one another, and the scarcity of large-scale, high-quality datasets with multidimensional preference annotations. To address these, we first introduce Talking-Critic, a multimodal reward model that learns human-aligned reward functions to quantify how well generated videos satisfy multidimensional expectations. Leveraging this model, we curate Talking-NSQ, a large-scale multidimensional human preference dataset containing 410K preference pairs. Finally, we propose Timestep-Layer adaptive multi-expert Preference Optimization (TLPO), a novel framework for aligning diffusion-based portrait animation models with fine-grained, multidimensional preferences. TLPO decouples preferences into specialized expert modules, which are then fused across timesteps and network layers, enabling comprehensive, fine-grained enhancement across all dimensions without mutual interference. Experiments demonstrate that Talking-Critic significantly outperforms existing methods in aligning with human preference ratings. Meanwhile, TLPO achieves substantial improvements over baseline models in lip-sync accuracy, motion naturalness, and visual quality, exhibiting superior performance in both qualitative and quantitative evaluations. Ours project page: https://fantasy-amap.github.io/fantasy-talking2/

[88] Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception

Junjie Wang,Keyu Chen,Yulin Li,Bin Chen,Hengshuang Zhao,Xiaojuan Qi,Zhuotao Tian

Main category: cs.CV

TL;DR: 本文提出了一种名为DeCLIP的新框架,通过改进CLIP模型的局部特征表示,显著提升了开放词汇密集感知任务的性能。

Details Motivation: 密集视觉感知任务受限于其对预定义类别的依赖,而像CLIP这样的视觉语言模型在局部特征表示上存在局限性,导致在密集感知中的应用效果不佳。 Method: 提出了一种名为DeCLIP的新框架,该框架通过从视觉基础模型中联合蒸馏语义相关性,并从扩散模型中提取对象完整性线索来增强上下文特征;同时,通过区域相关性约束内容特征以提高局部可区分性。 Result: DeCLIP在广泛的视觉任务中持续实现了最先进的性能,包括2D检测和分割、3D实例分割、视频实例分割以及6D物体姿态估计。 Conclusion: DeCLIP通过解耦自注意力模块并增强上下文和内容特征,为开放词汇密集感知任务建立了坚实的基础,并在多个任务上实现了最先进的性能。 Abstract: Dense visual perception tasks have been constrained by their reliance on predefined categories, limiting their applicability in real-world scenarios where visual concepts are unbounded. While Vision-Language Models (VLMs) like CLIP have shown promise in open-vocabulary tasks, their direct application to dense perception often leads to suboptimal performance due to limitations in local feature representation. In this work, we present our observation that CLIP's image tokens struggle to effectively aggregate information from spatially or semantically related regions, resulting in features that lack local discriminability and spatial consistency. To address this issue, we propose DeCLIP, a novel framework that enhances CLIP by decoupling the self-attention module to obtain ``content'' and ``context'' features respectively. \revise{The context features are enhanced by jointly distilling semantic correlations from Vision Foundation Models (VFMs) and object integrity cues from diffusion models, thereby enhancing spatial consistency. In parallel, the content features are aligned with image crop representations and constrained by region correlations from VFMs to improve local discriminability. Extensive experiments demonstrate that DeCLIP establishes a solid foundation for open-vocabulary dense perception, consistently achieving state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.} Code is available at https://github.com/xiaomoguhz/DeCLIP

[89] Controlling Multimodal LLMs via Reward-guided Decoding

Oscar Mañas,Pierluca D'Oro,Koustuv Sinha,Adriana Romero-Soriano,Michal Drozdzal,Aishwarya Agrawal

Main category: cs.CV

TL;DR: 本文提出了一种奖励引导解码方法,通过两个奖励模型控制视觉生成的精确度和召回率,实现对多模态大语言模型推理过程的动态控制,并在减少幻觉方面取得了显著效果。

Details Motivation: 随着多模态大语言模型的广泛应用,如何根据不同的用户需求对其进行适应性调整变得愈发重要。 Method: 构建了两个独立的奖励模型以控制输出中的物体精确度与召回率,并通过调节奖励函数的相对重要性和解码过程中的搜索广度来实现对MLLM推理过程的动态控制。 Result: 实验表明,该方法在标准物体幻觉基准测试中显著提升了模型的可控性,并且在减少幻觉方面优于现有方法。 Conclusion: 本文提出了一种用于多模态大语言模型(MLLMs)适应性的奖励引导解码方法,并展示了其在提升视觉基础任务中的效果。 Abstract: As Multimodal Large Language Models (MLLMs) gain widespread applicability, it is becoming increasingly desirable to adapt them for diverse user needs. In this paper, we study the adaptation of MLLMs through controlled decoding. To achieve this, we introduce the first method for reward-guided decoding of MLLMs and demonstrate its application in improving their visual grounding. Our method involves building reward models for visual grounding and using them to guide the MLLM's decoding process. Concretely, we build two separate reward models to independently control the degree of object precision and recall in the model's output. Our approach enables on-the-fly controllability of an MLLM's inference process in two ways: first, by giving control over the relative importance of each reward function during decoding, allowing a user to dynamically trade off object precision for recall in image captioning tasks; second, by giving control over the breadth of the search during decoding, allowing the user to control the trade-off between the amount of test-time compute and the degree of visual grounding. We evaluate our method on standard object hallucination benchmarks, showing that it provides significant controllability over MLLM inference, while consistently outperforming existing hallucination mitigation methods.

[90] Vision-Language Models display a strong gender bias

Aiswarya Konavoor,Raj Abhijit Dandekar,Rajat Dandekar,Sreedath Panat

Main category: cs.CV

TL;DR: Vision-language models may encode gender-linked associations, and a new framework is developed to evaluate this bias.

Details Motivation: To investigate whether vision-language models encode and amplify social stereotypes, particularly gender-linked associations, which are not evident from standard accuracy metrics. Method: The study assembles a dataset of face photographs and statements related to various categories of labor. It computes unit-norm image and text embeddings, then calculates association scores based on cosine similarity. Result: The study generates a map of gender associations in a contrastive vision-language space, accompanied by uncertainty measures and a robust gender bias evaluation framework. Conclusion: The study concludes that vision-language models can exhibit gender-linked associations, highlighting the need for robust gender bias evaluation frameworks. Abstract: Vision-language models (VLM) align images and text in a shared representation space that is useful for retrieval and zero-shot transfer. Yet, this alignment can encode and amplify social stereotypes in subtle ways that are not obvious from standard accuracy metrics. In this study, we test whether the contrastive vision-language encoder exhibits gender-linked associations when it places embeddings of face images near embeddings of short phrases that describe occupations and activities. We assemble a dataset of 220 face photographs split by perceived binary gender and a set of 150 unique statements distributed across six categories covering emotional labor, cognitive labor, domestic labor, technical labor, professional roles, and physical labor. We compute unit-norm image embeddings for every face and unit-norm text embeddings for every statement, then define a statement-level association score as the difference between the mean cosine similarity to the male set and the mean cosine similarity to the female set, where positive values indicate stronger association with the male set and negative values indicate stronger association with the female set. We attach bootstrap confidence intervals by resampling images within each gender group, aggregate by category with a separate bootstrap over statements, and run a label-swap null model that estimates the level of mean absolute association we would expect if no gender structure were present. The outcome is a statement-wise and category-wise map of gender associations in a contrastive vision-language space, accompanied by uncertainty, simple sanity checks, and a robust gender bias evaluation framework.

[91] Domain-aware Category-level Geometry Learning Segmentation for 3D Point Clouds

Pei He,Lingling Li,Licheng Jiao,Ronghua Shang,Fang Liu,Shuang Wang,Xu Liu,Wenping Ma

Main category: cs.CV

TL;DR: 本文提出了一种类别级别的几何学习框架,用于领域通用的3D语义分割,通过类别级别几何嵌入和几何一致性学习来提高模型在未见环境中的泛化能力。

Details Motivation: 3D分割中的领域泛化是一个关键挑战,当前方法通过增强点云的数据分布来缓解领域转移,但忽略了类别级别的分布和对齐。 Method: 提出了类别级别几何嵌入(CGE)以感知点云特征的细粒度几何属性,并构建每个类别的几何属性;此外,提出了几何一致性学习(GCL)以模拟潜在的3D分布并校准类别级别的几何嵌入。 Result: 实验结果验证了所提方法的有效性,在与最先进的领域通用点云方法相比时具有非常有竞争力的分割准确性。 Conclusion: 本文提出的类别级别几何学习框架有效提高了3D语义分割模型在未见环境中的泛化能力。 Abstract: Domain generalization in 3D segmentation is a critical challenge in deploying models to unseen environments. Current methods mitigate the domain shift by augmenting the data distribution of point clouds. However, the model learns global geometric patterns in point clouds while ignoring the category-level distribution and alignment. In this paper, a category-level geometry learning framework is proposed to explore the domain-invariant geometric features for domain generalized 3D semantic segmentation. Specifically, Category-level Geometry Embedding (CGE) is proposed to perceive the fine-grained geometric properties of point cloud features, which constructs the geometric properties of each class and couples geometric embedding to semantic learning. Secondly, Geometric Consistent Learning (GCL) is proposed to simulate the latent 3D distribution and align the category-level geometric embeddings, allowing the model to focus on the geometric invariant information to improve generalization. Experimental results verify the effectiveness of the proposed method, which has very competitive segmentation accuracy compared with the state-of-the-art domain generalized point cloud methods.

[92] Enhancing Supervised Composed Image Retrieval via Reasoning-Augmented Representation Engineering

Jun Li,Kai Li,Shaoguo Liu,Tingting Gao

Main category: cs.CV

TL;DR: 本文提出了一种新的CIR框架PMTFR,无需额外训练即可提高检索性能,适用于监督CIR任务。

Details Motivation: 现有方法在监督CIR中难以取得满意结果,且CoT技术在CIR中的应用有限,本文旨在解决这些问题。 Method: 提出了一种包含金字塔匹配模型和无训练优化的框架,通过金字塔修补模块和从CoT数据中提取的表示来增强视觉信息理解。 Result: PMTFR在CIR基准测试中表现出色,超越了现有最先进方法。 Conclusion: PMTFR在监督CIR任务中优于现有技术,并且无需额外训练。 Abstract: Composed Image Retrieval (CIR) presents a significant challenge as it requires jointly understanding a reference image and a modified textual instruction to find relevant target images. Some existing methods attempt to use a two-stage approach to further refine retrieval results. However, this often requires additional training of a ranking model. Despite the success of Chain-of-Thought (CoT) techniques in reducing training costs for language models, their application in CIR tasks remains limited -- compressing visual information into text or relying on elaborate prompt designs. Besides, existing works only utilize it for zero-shot CIR, as it is challenging to achieve satisfactory results in supervised CIR with a well-trained model. In this work, we proposed a framework that includes the Pyramid Matching Model with Training-Free Refinement (PMTFR) to address these challenges. Through a simple but effective module called Pyramid Patcher, we enhanced the Pyramid Matching Model's understanding of visual information at different granularities. Inspired by representation engineering, we extracted representations from COT data and injected them into the LVLMs. This approach allowed us to obtain refined retrieval scores in the Training-Free Refinement paradigm without relying on explicit textual reasoning, further enhancing performance. Extensive experiments on CIR benchmarks demonstrate that PMTFR surpasses state-of-the-art methods in supervised CIR tasks. The code will be made public.

[93] Probing the Representational Power of Sparse Autoencoders in Vision Models

Matthew Lyle Olson,Musashi Hinck,Neale Ratzlaff,Changbai Li,Phillip Howard,Vasudev Lal,Shao-Yen Tseng

Main category: cs.CV

TL;DR: This paper evaluates Sparse Autoencoders (SAEs) in vision models, showing they produce interpretable features, improve generalization, and enable controllable generation across various architectures like vision embedding models, multi-modal LMMs, and diffusion models.

Details Motivation: Sparse Autoencoders (SAEs) are popular for interpreting the hidden states of large language models, but remain understudied in the visual domain. This work aims to evaluate their effectiveness for vision models. Method: The authors conducted an extensive evaluation of the representational power of SAEs for vision models using a broad range of image-based tasks across three vision model architectures: vision embedding models, multi-modal LMMs, and diffusion models. Result: Experimental results show that SAE features are semantically meaningful, improve out-of-distribution generalization, and enable controllable generation. Specifically, SAEs enable OOD detection, recover ontological structures, allow semantic steering via text encoder manipulation, and reveal shared representations across vision and language modalities. Conclusion: This study provides a foundation for evaluating SAEs in vision models, highlighting their potential to improve interpretability, generalization, and steerability in the visual domain. Abstract: Sparse Autoencoders (SAEs) have emerged as a popular tool for interpreting the hidden states of large language models (LLMs). By learning to reconstruct activations from a sparse bottleneck layer, SAEs discover interpretable features from the high-dimensional internal representations of LLMs. Despite their popularity with language models, SAEs remain understudied in the visual domain. In this work, we provide an extensive evaluation the representational power of SAEs for vision models using a broad range of image-based tasks. Our experimental results demonstrate that SAE features are semantically meaningful, improve out-of-distribution generalization, and enable controllable generation across three vision model architectures: vision embedding models, multi-modal LMMs and diffusion models. In vision embedding models, we find that learned SAE features can be used for OOD detection and provide evidence that they recover the ontological structure of the underlying model. For diffusion models, we demonstrate that SAEs enable semantic steering through text encoder manipulation and develop an automated pipeline for discovering human-interpretable attributes. Finally, we conduct exploratory experiments on multi-modal LLMs, finding evidence that SAE features reveal shared representations across vision and language modalities. Our study provides a foundation for SAE evaluation in vision models, highlighting their strong potential improving interpretability, generalization, and steerability in the visual domain.

[94] Unifying Scale-Aware Depth Prediction and Perceptual Priors for Monocular Endoscope Pose Estimation and Tissue Reconstruction

Muzammil Khan,Enzo Kerkhof,Matteo Fusaglia,Koert Kuhlmann,Theo Ruers,Françoise J. Siepel

Main category: cs.CV

TL;DR: This paper proposes a unified framework for monocular endoscopic tissue reconstruction to address challenges like depth ambiguity and physiological tissue deformation, achieving robust results and outperforming state-of-the-art methods.

Details Motivation: Accurate endoscope pose estimation and 3D tissue surface reconstruction enhance monocular minimally invasive surgical procedures, but challenges like depth ambiguity, physiological tissue deformation, and limited texture fidelity persist. Method: A unified framework integrating scale-aware depth prediction with temporally-constrained perceptual refinement, using modules named MAPIS-Depth and WEMA-RTDL, followed by volumetric fusion and marching cubes for 3D surface mesh extraction. Result: Evaluations on HEVD and SCARED datasets showed robust performance and superiority over existing state-of-the-art methods in monocular endoscopic tissue reconstruction. Conclusion: The proposed framework demonstrates robustness and superiority over state-of-the-art methods in monocular endoscopic tissue reconstruction. Abstract: Accurate endoscope pose estimation and 3D tissue surface reconstruction significantly enhances monocular minimally invasive surgical procedures by enabling accurate navigation and improved spatial awareness. However, monocular endoscope pose estimation and tissue reconstruction face persistent challenges, including depth ambiguity, physiological tissue deformation, inconsistent endoscope motion, limited texture fidelity, and a restricted field of view. To overcome these limitations, a unified framework for monocular endoscopic tissue reconstruction that integrates scale-aware depth prediction with temporally-constrained perceptual refinement is presented. This framework incorporates a novel MAPIS-Depth module, which leverages Depth Pro for robust initialisation and Depth Anything for efficient per-frame depth prediction, in conjunction with L-BFGS-B optimisation, to generate pseudo-metric depth estimates. These estimates are temporally refined by computing pixel correspondences using RAFT and adaptively blending flow-warped frames based on LPIPS perceptual similarity, thereby reducing artefacts arising from physiological tissue deformation and motion. To ensure accurate registration of the synthesised pseudo-RGBD frames from MAPIS-Depth, a novel WEMA-RTDL module is integrated, optimising both rotation and translation. Finally, truncated signed distance function-based volumetric fusion and marching cubes are applied to extract a comprehensive 3D surface mesh. Evaluations on HEVD and SCARED, with ablation and comparative analyses, demonstrate the framework's robustness and superiority over state-of-the-art methods.

[95] TimeMachine: Fine-Grained Facial Age Editing with Identity Preservation

Yilin Mi,Qixin Yan,Zheng-Peng Duan,Chunle Guo,Hubery Yin,Hao Liu,Chen Li,Chongyi Li

Main category: cs.CV

TL;DR: 本文提出TimeMachine,通过将高精度年龄信息注入多交叉注意力模块并引入ACG模块,在保持身份特征不变的前提下实现了精确的细粒度年龄编辑。

Details Motivation: 现有的面部图像编辑技术在保持身份特征不变的情况下实现细粒度的年龄编辑仍然具有挑战性。 Method: 提出了TimeMachine框架,利用多交叉注意力模块注入高精度年龄信息,并设计了年龄分类器引导模块(ACG)在潜在空间中直接预测年龄。 Result: 实验结果表明,TimeMachine在细粒度年龄编辑和保持身份一致性方面达到了最先进的性能。 Conclusion: TimeMachine能够实现高质量的细粒度年龄编辑,同时保持身份特征不变。 Abstract: With the advancement of generative models, facial image editing has made significant progress. However, achieving fine-grained age editing while preserving personal identity remains a challenging task.In this paper, we propose TimeMachine, a novel diffusion-based framework that achieves accurate age editing while keeping identity features unchanged. To enable fine-grained age editing, we inject high-precision age information into the multi-cross attention module, which explicitly separates age-related and identity-related features. This design facilitates more accurate disentanglement of age attributes, thereby allowing precise and controllable manipulation of facial aging.Furthermore, we propose an Age Classifier Guidance (ACG) module that predicts age directly in the latent space, instead of performing denoising image reconstruction during training. By employing a lightweight module to incorporate age constraints, this design enhances age editing accuracy by modest increasing training cost. Additionally, to address the lack of large-scale, high-quality facial age datasets, we construct a HFFA dataset (High-quality Fine-grained Facial-Age dataset) which contains one million high-resolution images labeled with identity and facial attributes. Experimental results demonstrate that TimeMachine achieves state-of-the-art performance in fine-grained age editing while preserving identity consistency.

[96] Hyperspectral vs. RGB for Pedestrian Segmentation in Urban Driving Scenes: A Comparative Study

Jiarong Li,Imad Ali Shah,Enda Ward,Martin Glavin,Edward Jones,Brian Deegan

Main category: cs.CV

TL;DR: This study shows that using optimal HSI band selection improves pedestrian segmentation accuracy, addressing safety challenges in automotive perception systems.

Details Motivation: Pedestrian segmentation faces safety challenges due to metamerism in RGB imaging, where pedestrians and backgrounds are visually indistinguishable, motivating the investigation of hyperspectral imaging as a solution. Method: The study used the Hyperspectral City v2 (H-City) dataset and compared RGB imaging with two dimensionality-reduction methods: PCA and CSNR-JMIM for converting HSI data into three-channel representations. Three segmentation models (U-Net, DeepLabV3+, SegFormer) were evaluated. Result: CSNR-JMIM outperformed RGB with 1.44% improvement in IoU and 2.18% in F1-score for pedestrian segmentation, and similar gains were observed for rider segmentation. Conclusion: This study demonstrates that optimal HSI band selection enhances pedestrian segmentation performance, showing significant potential for improving safety in automotive applications. Abstract: Pedestrian segmentation in automotive perception systems faces critical safety challenges due to metamerism in RGB imaging, where pedestrians and backgrounds appear visually indistinguishable.. This study investigates the potential of hyperspectral imaging (HSI) for enhanced pedestrian segmentation in urban driving scenarios using the Hyperspectral City v2 (H-City) dataset. We compared standard RGB against two dimensionality-reduction approaches by converting 128-channel HSI data into three-channel representations: Principal Component Analysis (PCA) and optimal band selection using Contrast Signal-to-Noise Ratio with Joint Mutual Information Maximization (CSNR-JMIM). Three semantic segmentation models were evaluated: U-Net, DeepLabV3+, and SegFormer. CSNR-JMIM consistently outperformed RGB with an average improvements of 1.44% in Intersection over Union (IoU) and 2.18% in F1-score for pedestrian segmentation. Rider segmentation showed similar gains with 1.43% IoU and 2.25% F1-score improvements. These improved performance results from enhanced spectral discrimination of optimally selected HSI bands effectively reducing false positives. This study demonstrates robust pedestrian segmentation through optimal HSI band selection, showing significant potential for safety-critical automotive applications.

[97] Denoise-then-Retrieve: Text-Conditioned Video Denoising for Video Moment Retrieval

Weijia Liu,Jiuxin Cao,Bo Miao,Zhiheng Fu,Xuelin Zhu,Jiawei Ge,Bo Liu,Mehwish Nasim,Ajmal Mian

Main category: cs.CV

TL;DR: 本文提出了一种去噪然后检索的范式,用于视频时刻检索(VMR),通过去除与文本无关的视频片段来提高检索的准确性。

Details Motivation: 当前基于文本驱动的VMR方法会编码所有视频片段,包括无关片段,这会破坏多模态对齐并阻碍优化。 Method: 提出了一种去噪-检索网络(DRNet),包括文本条件去噪(TCD)和文本重建反馈(TRF)模块。TCD使用跨注意和结构化状态空间块来动态识别噪声片段,TRF则提炼出单个查询嵌入并与文本嵌入对齐,用作训练期间去噪的辅助监督。 Result: 在Charades-STA和QVHighlights上的实验表明,该方法在所有指标上都超过了最先进的方法。此外,所提出的去噪然后检索范式可以无缝集成到先进的VMR模型中以提升性能。 Conclusion: 该研究通过引入去噪步骤来改善多模态对齐,从而提高了视频时刻检索的性能,并展示了其在现有模型中的广泛应用潜力。 Abstract: Current text-driven Video Moment Retrieval (VMR) methods encode all video clips, including irrelevant ones, disrupting multimodal alignment and hindering optimization. To this end, we propose a denoise-then-retrieve paradigm that explicitly filters text-irrelevant clips from videos and then retrieves the target moment using purified multimodal representations. Following this paradigm, we introduce the Denoise-then-Retrieve Network (DRNet), comprising Text-Conditioned Denoising (TCD) and Text-Reconstruction Feedback (TRF) modules. TCD integrates cross-attention and structured state space blocks to dynamically identify noisy clips and produce a noise mask to purify multimodal video representations. TRF further distills a single query embedding from purified video representations and aligns it with the text embedding, serving as auxiliary supervision for denoising during training. Finally, we perform conditional retrieval using text embeddings on purified video representations for accurate VMR. Experiments on Charades-STA and QVHighlights demonstrate that our approach surpasses state-of-the-art methods on all metrics. Furthermore, our denoise-then-retrieve paradigm is adaptable and can be seamlessly integrated into advanced VMR models to boost performance.

[98] Logic Unseen: Revealing the Logical Blindspots of Vision-Language Models

Yuchen Zhou,Jiayu Tang,Shuo Yang,Xiaoyan Xiao,Yuqin Dai,Wenhao Yang,Chao Gou,Xiaobo Xia,Tat-Seng Chua

Main category: cs.CV

TL;DR: 本研究提出了LogicBench和LogicCLIP,旨在提升视觉语言模型的逻辑理解能力,填补其在逻辑推理方面的不足。

Details Motivation: 现有的VLM在逻辑理解方面存在显著盲区,限制了其在实际应用中的可靠性。 Method: 作者提出了LogicBench基准测试和LogicCLIP训练框架,通过数据生成和优化目标的改进来提升VLM的逻辑敏感性。 Result: 现有VLM在LogicBench测试中表现远低于人类,特别是在因果关系和条件性任务上。LogicCLIP在逻辑理解方面取得了显著改进。 Conclusion: LogicBench和LogicCLIP为提升VLM的逻辑能力提供了重要的资源。 Abstract: Vision-Language Models (VLMs), exemplified by CLIP, have emerged as foundational for multimodal intelligence. However, their capacity for logical understanding remains significantly underexplored, resulting in critical ''logical blindspots'' that limit their reliability in practical applications. To systematically diagnose this, we introduce LogicBench, a comprehensive benchmark with over 50,000 vision-language pairs across 9 logical categories and 4 diverse scenarios: images, videos, anomaly detection, and medical diagnostics. Our evaluation reveals that existing VLMs, even the state-of-the-art ones, fall at over 40 accuracy points below human performance, particularly in challenging tasks like Causality and Conditionality, highlighting their reliance on surface semantics over critical logical structures. To bridge this gap, we propose LogicCLIP, a novel training framework designed to boost VLMs' logical sensitivity through advancements in both data generation and optimization objectives. LogicCLIP utilizes logic-aware data generation and a contrastive learning strategy that combines coarse-grained alignment, a fine-grained multiple-choice objective, and a novel logical structure-aware objective. Extensive experiments demonstrate LogicCLIP's substantial improvements in logical comprehension across all LogicBench domains, significantly outperforming baselines. Moreover, LogicCLIP retains, and often surpasses, competitive performance on general vision-language benchmarks, demonstrating that the enhanced logical understanding does not come at the expense of general alignment. We believe that LogicBench and LogicCLIP will be important resources for advancing VLM logical capabilities.

[99] Delving into Dynamic Scene Cue-Consistency for Robust 3D Multi-Object Tracking

Haonan Zhang,Xinyao Wang,Boxi Wu,Tu Zheng,Wang Yunhua,Zheng Yang

Main category: cs.CV

TL;DR: 本文提出了一种新的3D多目标追踪方法DSC-Track,结合几何一致性原则和Transformer模块,有效提升了追踪性能。

Details Motivation: 传统的基于单个物体运动模型的追踪方法在拥挤环境或检测不准确的情况下效果不佳,而现有的几何感知方法容易受到无关物体干扰,因此需要一种能够利用稳定空间模式的追踪方法。 Method: 设计了基于点对特征(PPF)的时空编码器以抑制干扰,引入了几何一致性Transformer模块进行特征对齐,并采用了动态更新机制保持稳定的在线追踪。 Result: 在nuScenes和Waymo Open数据集上的实验表明,DSC-Track具有良好的效果和鲁棒性,在nuScenes验证集和测试集上分别达到了73.2%和70.3%的AMOTA指标。 Conclusion: 该论文提出了一种新的几何一致性追踪方法DSC-Track,通过统一的时空编码器、几何一致性Transformer模块和动态更新机制,有效提升了3D多目标追踪的性能。 Abstract: 3D multi-object tracking is a critical and challenging task in the field of autonomous driving. A common paradigm relies on modeling individual object motion, e.g., Kalman filters, to predict trajectories. While effective in simple scenarios, this approach often struggles in crowded environments or with inaccurate detections, as it overlooks the rich geometric relationships between objects. This highlights the need to leverage spatial cues. However, existing geometry-aware methods can be susceptible to interference from irrelevant objects, leading to ambiguous features and incorrect associations. To address this, we propose focusing on cue-consistency: identifying and matching stable spatial patterns over time. We introduce the Dynamic Scene Cue-Consistency Tracker (DSC-Track) to implement this principle. Firstly, we design a unified spatiotemporal encoder using Point Pair Features (PPF) to learn discriminative trajectory embeddings while suppressing interference. Secondly, our cue-consistency transformer module explicitly aligns consistent feature representations between historical tracks and current detections. Finally, a dynamic update mechanism preserves salient spatiotemporal information for stable online tracking. Extensive experiments on the nuScenes and Waymo Open Datasets validate the effectiveness and robustness of our approach. On the nuScenes benchmark, for instance, our method achieves state-of-the-art performance, reaching 73.2% and 70.3% AMOTA on the validation and test sets, respectively.

[100] Noise Matters: Optimizing Matching Noise for Diffusion Classifiers

Yanghao Wang,Long Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为NoOp的噪声优化方法,有效解决了扩散分类器中的噪声不稳定性问题,提高了分类性能和速度。

Details Motivation: 为了解决现有扩散分类器中由于随机采样噪声导致的分类性能不稳定和分类速度慢的问题。 Method: 提出了一种名为NoOp的噪声优化方法,包括优化数据集特定噪声(Frequency Matching)和训练元网络生成图像特定噪声偏移(Spatial Matching)两个部分。 Result: 在多个数据集上进行了广泛的消融实验,证明了NoOp方法的有效性。 Conclusion: NoOp方法能够有效解决扩散分类器中的噪声不稳定性问题,通过优化数据集特定噪声和训练元网络生成图像特定噪声偏移,提高了分类性能和速度。 Abstract: Although today's pretrained discriminative vision-language models (e.g., CLIP) have demonstrated strong perception abilities, such as zero-shot image classification, they also suffer from the bag-of-words problem and spurious bias. To mitigate these problems, some pioneering studies leverage powerful generative models (e.g., pretrained diffusion models) to realize generalizable image classification, dubbed Diffusion Classifier (DC). Specifically, by randomly sampling a Gaussian noise, DC utilizes the differences of denoising effects with different category conditions to classify categories. Unfortunately, an inherent and notorious weakness of existing DCs is noise instability: different random sampled noises lead to significant performance changes. To achieve stable classification performance, existing DCs always ensemble the results of hundreds of sampled noises, which significantly reduces the classification speed. To this end, we firstly explore the role of noise in DC, and conclude that: there are some ``good noises'' that can relieve the instability. Meanwhile, we argue that these good noises should meet two principles: Frequency Matching and Spatial Matching. Regarding both principles, we propose a novel Noise Optimization method to learn matching (i.e., good) noise for DCs: NoOp. For frequency matching, NoOp first optimizes a dataset-specific noise: Given a dataset and a timestep t, optimize one randomly initialized parameterized noise. For Spatial Matching, NoOp trains a Meta-Network that adopts an image as input and outputs image-specific noise offset. The sum of optimized noise and noise offset will be used in DC to replace random noise. Extensive ablations on various datasets demonstrated the effectiveness of NoOp.

[101] GANDiff FR: Hybrid GAN Diffusion Synthesis for Causal Bias Attribution in Face Recognition

Md Asgor Hossain Reaj,Rajan Das Gupta,Md Yeasin Rahat,Nafiz Fahad,Md Jawadul Hasan,Tze Hui Liew

Main category: cs.CV

TL;DR: GANDiff FR is a new framework for generating synthetic faces with controlled demographic and environmental attributes, enabling precise bias measurement and reduction in facial recognition systems.

Details Motivation: The need to precisely control demographic and environmental factors to measure, explain, and reduce bias in facial recognition systems. Method: GANDiff FR combines StyleGAN3-based identity-preserving generation with diffusion-based attribute control to manipulate pose, illumination, and expression under ceteris paribus conditions. It synthesizes 10,000 demographically balanced faces for bias analysis. Result: AdaFace reduced inter-group TPR disparity by 60% under matched operating points; illumination accounted for 42% of residual bias. GANDiff FR achieved strong synthetic-to-real transfer (r = 0.85) and produced three times more attribute-conditioned variants. Conclusion: GANDiff FR establishes a reproducible and regulation-aligned standard for fairness auditing with strong synthetic-to-real transfer and enables precise bias evaluation. Abstract: We introduce GANDiff FR, the first synthetic framework that precisely controls demographic and environmental factors to measure, explain, and reduce bias with reproducible rigor. GANDiff FR unifies StyleGAN3-based identity-preserving generation with diffusion-based attribute control, enabling fine-grained manipulation of pose around 30 degrees, illumination (four directions), and expression (five levels) under ceteris paribus conditions. We synthesize 10,000 demographically balanced faces across five cohorts validated for realism via automated detection (98.2%) and human review (89%) to isolate and quantify bias drivers. Benchmarking ArcFace, CosFace, and AdaFace under matched operating points shows AdaFace reduces inter-group TPR disparity by 60% (2.5% vs. 6.3%), with illumination accounting for 42% of residual bias. Cross-dataset evaluation on RFW, BUPT, and CASIA WebFace confirms strong synthetic-to-real transfer (r 0.85). Despite around 20% computational overhead relative to pure GANs, GANDiff FR yields three times more attribute-conditioned variants, establishing a reproducible, regulation-aligned (EU AI Act) standard for fairness auditing. Code and data are released to support transparent, scalable bias evaluation.

[102] Index-Aligned Query Distillation for Transformer-based Incremental Object Detection

Mingxiao Ma,Shunyao Zhu,Guoliang Kang

Main category: cs.CV

TL;DR: This paper introduces IAQD, a novel distillation method for transformer-based incremental object detection that reduces catastrophic knowledge forgetting by aligning queries through index matching.

Details Motivation: The motivation is to address the issue of catastrophic knowledge forgetting in transformer-based models during incremental object detection, which occurs when knowledge about previously learned categories is lost. Method: The authors propose a new distillation approach named Index-Aligned Query Distillation (IAQD), which matches queries of previous and current phase models by index, focusing on partial queries critical for detecting previous categories. Result: Experiments show that IAQD successfully mitigates knowledge forgetting and achieves state-of-the-art performance on benchmarks. Conclusion: The paper concludes that IAQD effectively addresses the problem of catastrophic knowledge forgetting in transformer-based models for incremental object detection. Abstract: Incremental object detection (IOD) aims to continuously expand the capability of a model to detect novel categories while preserving its performance on previously learned ones. When adopting a transformer-based detection model to perform IOD, catastrophic knowledge forgetting may inevitably occur, meaning the detection performance on previously learned categories may severely degenerate. Previous typical methods mainly rely on knowledge distillation (KD) to mitigate the catastrophic knowledge forgetting of transformer-based detection models. Specifically, they utilize Hungarian Matching to build a correspondence between the queries of the last-phase and current-phase detection models and align the classifier and regressor outputs between matched queries to avoid knowledge forgetting. However, we observe that in IOD task, Hungarian Matching is not a good choice. With Hungarian Matching, the query of the current-phase model may match different queries of the last-phase model at different iterations during KD. As a result, the knowledge encoded in each query may be reshaped towards new categories, leading to the forgetting of previously encoded knowledge of old categories. Based on our observations, we propose a new distillation approach named Index-Aligned Query Distillation (IAQD) for transformer-based IOD. Beyond using Hungarian Matching, IAQD establishes a correspondence between queries of the previous and current phase models that have the same index. Moreover, we perform index-aligned distillation only on partial queries which are critical for the detection of previous categories. In this way, IAQD largely preserves the previous semantic and spatial encoding capabilities without interfering with the learning of new categories. Extensive experiments on representative benchmarks demonstrate that IAQD effectively mitigates knowledge forgetting, achieving new state-of-the-art performance.

[103] Cost-Effective Active Labeling for Data-Efficient Cervical Cell Classification

Yuanlin Liu,Zhihan Zhou,Mingqiang Wei,Youyi Song

Main category: cs.CV

TL;DR: This paper proposes an active labeling approach to reduce human effort in building effective training datasets for cervical cell classification.

Details Motivation: Existing classification methods require a representative training dataset, which is costly and time-consuming to create manually. Method: Active labeling method that leverages the classifier's uncertainty on unlabeled images to select the most beneficial data for labeling. Result: Empirical results show that the method effectively improves the representativeness of the training dataset and reduces human cost. Conclusion: The proposed active labeling method enhances the efficiency of constructing a representative training dataset for cervical cell classification, significantly reducing human cost. Abstract: Information on the number and category of cervical cells is crucial for the diagnosis of cervical cancer. However, existing classification methods capable of automatically measuring this information require the training dataset to be representative, which consumes an expensive or even unaffordable human cost. We herein propose active labeling that enables us to construct a representative training dataset using a much smaller human cost for data-efficient cervical cell classification. This cost-effective method efficiently leverages the classifier's uncertainty on the unlabeled cervical cell images to accurately select images that are most beneficial to label. With a fast estimation of the uncertainty, this new algorithm exhibits its validity and effectiveness in enhancing the representative ability of the constructed training dataset. The extensive empirical results confirm its efficacy again in navigating the usage of human cost, opening the avenue for data-efficient cervical cell classification.

[104] Semantically Guided Adversarial Testing of Vision Models Using Language Models

Katarzyna Filus,Jorge M. Cruz-Duarte

Main category: cs.CV

TL;DR: This paper proposes a semantics-guided framework for selecting adversarial targets in vision models using cross-modal knowledge transfer from pretrained models, showing improved performance over static databases and the potential for standardized adversarial benchmarks.

Details Motivation: The motivation is to improve the interpretability, reproducibility, and flexibility of adversarial target selection in vision models by using semantic relationships rather than random or static methods. Method: The paper proposes a semantics-guided framework for adversarial target selection that uses cross-modal knowledge transfer from pretrained language and vision-language models. Several state-of-the-art models are evaluated as similarity sources to select semantically related labels for forming adversarial scenarios. Result: The experiments showed that the proposed method consistently forms practical adversarial targets and outperforms static lexical databases like WordNet, especially for distant class relationships. It also allows for a preliminary assessment of similarity sources through static testing. Conclusion: The paper concludes that pretrained models are suitable for creating adversarial benchmarks that are interpretable, standardized, and scalable across different architectures and datasets. Abstract: In targeted adversarial attacks on vision models, the selection of the target label is a critical yet often overlooked determinant of attack success. This target label corresponds to the class that the attacker aims to force the model to predict. Now, existing strategies typically rely on randomness, model predictions, or static semantic resources, limiting interpretability, reproducibility, or flexibility. This paper then proposes a semantics-guided framework for adversarial target selection using the cross-modal knowledge transfer from pretrained language and vision-language models. We evaluate several state-of-the-art models (BERT, TinyLLAMA, and CLIP) as similarity sources to select the most and least semantically related labels with respect to the ground truth, forming best- and worst-case adversarial scenarios. Our experiments on three vision models and five attack methods reveal that these models consistently render practical adversarial targets and surpass static lexical databases, such as WordNet, particularly for distant class relationships. We also observe that static testing of target labels offers a preliminary assessment of the effectiveness of similarity sources, \textit{a priori} testing. Our results corroborate the suitability of pretrained models for constructing interpretable, standardized, and scalable adversarial benchmarks across architectures and datasets.

[105] HOID-R1: Reinforcement Learning for Open-World Human-Object Interaction Detection Reasoning with Multimodal Large Language Model

Zhenhao Zhang,Hanqing Wang,Xiangyu Zeng,Ziyu Cheng,Jiaxin Liu,Haoyu Yan,Zhirui Liu,Kaiyang Ji,Tianxiang Gui,Ke Hu,Kangyi Chen,Yahao Fan,Mokai Pan

Main category: cs.CV

TL;DR: This paper proposes HOID-R1, a novel framework for human-object interaction detection that integrates CoT-guided SFT and GRPO with reinforcement learning, achieving superior performance and generalization capabilities.

Details Motivation: Recent open-vocabulary HOI detection approaches neglect the inherent 3D spatial understanding capabilities of models by relying solely on large language models for textual prompts. The authors aim to address this shortcoming. Method: The paper introduces HOID-R1, which combines chain-of-thought (CoT) guided supervised fine-tuning (SFT) and group relative policy optimization (GRPO) within a reinforcement learning paradigm. An 'MLLM-as-a-judge' mechanism is also introduced to supervise CoT outputs. Result: Extensive experiments show that HOID-R1 achieves state-of-the-art performance on HOI detection benchmarks and outperforms existing methods in generalization to novel scenarios. Conclusion: The proposed HOID-R1 framework achieves state-of-the-art performance on HOI detection benchmarks and demonstrates superior open-world generalization to novel scenarios. Abstract: Understanding and recognizing human-object interaction (HOI) is a pivotal application in AR/VR and robotics. Recent open-vocabulary HOI detection approaches depend exclusively on large language models for richer textual prompts, neglecting their inherent 3D spatial understanding capabilities. To address this shortcoming, we introduce HOID-R1, the first HOI detection framework that integrates chain-of-thought (CoT) guided supervised fine-tuning (SFT) with group relative policy optimization (GRPO) within a reinforcement learning (RL) paradigm. Specifically, we initially apply SFT to imbue the model with essential reasoning capabilities, forcing the model to articulate its thought process in the output. Subsequently, we integrate GRPO to leverage multi-reward signals for policy optimization, thereby enhancing alignment across diverse modalities. To mitigate hallucinations in the CoT reasoning, we introduce an "MLLM-as-a-judge" mechanism that supervises the CoT outputs, further improving generalization. Extensive experiments show that HOID-R1 achieves state-of-the-art performance on HOI detection benchmarks and outperforms existing methods in open-world generalization to novel scenarios.

[106] Leveraging the RETFound foundation model for optic disc segmentation in retinal images

Zhenyi Zhao,Muthu Rama Krishnan Mookiah,Emanuele Trucco

Main category: cs.CV

TL;DR: RETFound, a foundation model for retinal images, is adapted for optic disc segmentation and achieves state-of-the-art results across multiple datasets.

Details Motivation: The motivation is to explore the potential of RETFound, a foundation model developed for retinal images, beyond disease diagnosis and apply it to optic disc segmentation, a fundamental task in retinal image analysis. Method: The study adapts RETFound for optic disc segmentation by training a task-specific head with a modest number of examples and evaluates its performance across four public and one private dataset. Result: The adapted RETFound achieves about 96% Dice score consistently across all datasets, showing superior performance in internal verification, domain generalization, and domain adaptation compared to baseline networks. Conclusion: RETFound, a foundation model, has been successfully adapted for optic disc segmentation, demonstrating excellent performance and outperforming state-of-the-art methods across multiple datasets. Abstract: RETFound is a well-known foundation model (FM) developed for fundus camera and optical coherence tomography images. It has shown promising performance across multiple datasets in diagnosing diseases, both eye-specific and systemic, from retinal images. However, to our best knowledge, it has not been used for other tasks. We present the first adaptation of RETFound for optic disc segmentation, a ubiquitous and foundational task in retinal image analysis. The resulting segmentation system outperforms state-of-the-art, segmentation-specific baseline networks after training a head with only a very modest number of task-specific examples. We report and discuss results with four public datasets, IDRID, Drishti-GS, RIM-ONE-r3, and REFUGE, and a private dataset, GoDARTS, achieving about 96% Dice consistently across all datasets. Overall, our method obtains excellent performance in internal verification, domain generalization and domain adaptation, and exceeds most of the state-of-the-art baseline results. We discuss the results in the framework of the debate about FMs as alternatives to task-specific architectures. The code is available at: [link to be added after the paper is accepted]

[107] Does the Skeleton-Recall Loss Really Work?

Devansh Arora,Nitin Kumar,Sukrit Gupta

Main category: cs.CV

TL;DR: This paper critically evaluates the effectiveness of topology-based loss functions like SRL for segmenting thin tubular structures, showing they do not consistently outperform traditional methods.

Details Motivation: To assess the effectiveness of topology preservation-based loss functions, particularly SRL, in image segmentation for thin tubular structures. Method: The authors conducted a theoretical analysis of the Skeleton Recall Loss (SRL) gradients and empirically evaluated its performance across multiple datasets. Result: The study found that SRL-based models did not outperform traditional baseline models in segmentation accuracy. Conclusion: The work provides a critical evaluation of topology-based loss functions, highlighting their limitations in achieving superior performance in tubular structure segmentation. Abstract: Image segmentation is an important and widely performed task in computer vision. Accomplishing effective image segmentation in diverse settings often requires custom model architectures and loss functions. A set of models that specialize in segmenting thin tubular structures are topology preservation-based loss functions. These models often utilize a pixel skeletonization process claimed to generate more precise segmentation masks of thin tubes and better capture the structures that other models often miss. One such model, Skeleton Recall Loss (SRL) proposed by Kirchhoff et al.~\cite {kirchhoff2024srl}, was stated to produce state-of-the-art results on benchmark tubular datasets. In this work, we performed a theoretical analysis of the gradients for the SRL loss. Upon comparing the performance of the proposed method on some of the tubular datasets (used in the original work, along with some additional datasets), we found that the performance of SRL-based segmentation models did not exceed traditional baseline models. By providing both a theoretical explanation and empirical evidence, this work critically evaluates the limitations of topology-based loss functions, offering valuable insights for researchers aiming to develop more effective segmentation models for complex tubular structures.

[108] Unified Knowledge Distillation Framework: Fine-Grained Alignment and Geometric Relationship Preservation for Deep Face Recognition

Durgesh Mishra,Rishabh Uikey

Main category: cs.CV

TL;DR: This paper introduces a new unified knowledge distillation framework for face recognition that improves model performance on edge devices by combining two novel loss functions that capture both instance-level details and relational structures.

Details Motivation: Traditional KD methods like Raw L2 Feature Distillation or Feature Consistency loss struggle to capture both fine-grained instance-level details and complex relational structures, leading to suboptimal performance in face recognition models. This work aims to address this limitation. Method: The method introduces two new loss functions: Instance-Level Embedding Distillation, which aligns individual feature embeddings using a dynamic hard mining strategy, and Relation-Based Pairwise Similarity Distillation, which captures relational information through pairwise similarities with a memory bank and sample mining strategy. Result: The proposed unified framework surpasses state-of-the-art distillation methods on multiple benchmark face recognition datasets. Notably, when using strong teacher networks, the student model achieves higher accuracy than the teacher. Conclusion: The unified KD framework outperforms existing distillation methods in face recognition tasks and, under certain conditions, allows the student model to exceed the teacher's performance. Abstract: Knowledge Distillation is crucial for optimizing face recognition models for deployment in computationally limited settings, such as edge devices. Traditional KD methods, such as Raw L2 Feature Distillation or Feature Consistency loss, often fail to capture both fine-grained instance-level details and complex relational structures, leading to suboptimal performance. We propose a unified approach that integrates two novel loss functions, Instance-Level Embedding Distillation and Relation-Based Pairwise Similarity Distillation. Instance-Level Embedding Distillation focuses on aligning individual feature embeddings by leveraging a dynamic hard mining strategy, thereby enhancing learning from challenging examples. Relation-Based Pairwise Similarity Distillation captures relational information through pairwise similarity relationships, employing a memory bank mechanism and a sample mining strategy. This unified framework ensures both effective instance-level alignment and preservation of geometric relationships between samples, leading to a more comprehensive distillation process. Our unified framework outperforms state-of-the-art distillation methods across multiple benchmark face recognition datasets, as demonstrated by extensive experimental evaluations. Interestingly, when using strong teacher networks compared to the student, our unified KD enables the student to even surpass the teacher's accuracy.

[109] G-CUT3R: Guided 3D Reconstruction with Camera and Depth Prior Integration

Ramil Khafizov,Artem Komarichev,Ruslan Rakhimov,Peter Wonka,Evgeny Burnaev

Main category: cs.CV

TL;DR: G-CUT3R improves 3D scene reconstruction by integrating auxiliary data through a modified model, achieving better performance on multi-view tasks.

Details Motivation: The motivation is to improve 3D scene reconstruction by leveraging auxiliary data, such as depth and camera calibrations, which are commonly available in real-world scenarios. Method: The method involves a lightweight modification to the CUT3R model, incorporating dedicated encoders for each modality and fusing features with RGB image tokens via zero convolution. Result: Evaluated across multiple benchmarks, the approach demonstrates significant performance improvements in utilizing available prior information while maintaining compatibility with varying input modalities. Conclusion: G-CUT3R effectively enhances 3D scene reconstruction by integrating auxiliary data, showing improved performance across multiple benchmarks. Abstract: We introduce G-CUT3R, a novel feed-forward approach for guided 3D scene reconstruction that enhances the CUT3R model by integrating prior information. Unlike existing feed-forward methods that rely solely on input images, our method leverages auxiliary data, such as depth, camera calibrations, or camera positions, commonly available in real-world scenarios. We propose a lightweight modification to CUT3R, incorporating a dedicated encoder for each modality to extract features, which are fused with RGB image tokens via zero convolution. This flexible design enables seamless integration of any combination of prior information during inference. Evaluated across multiple benchmarks, including 3D reconstruction and other multi-view tasks, our approach demonstrates significant performance improvements, showing its ability to effectively utilize available priors while maintaining compatibility with varying input modalities.

[110] RMFAT: Recurrent Multi-scale Feature Atmospheric Turbulence Mitigator

Zhiming Liu,Nantheera Anantrasirichai

Main category: cs.CV

TL;DR: RMFAT is a lightweight and efficient method for real-time video restoration under atmospheric turbulence conditions, offering improved clarity and faster inference speed compared to existing methods.

Details Motivation: Atmospheric turbulence degrades video quality, and existing methods based on transformer and 3D architectures have high computational costs, limiting real-time deployment in resource-constrained scenarios. Method: RMFAT uses a recurrent framework with multi-scale feature encoding and decoding to restore video frames using only two inputs at a time, reducing computational burden and improving temporal consistency. Result: RMFAT outperforms existing methods in clarity restoration (nearly 9% improvement in SSIM) and achieves significantly improved inference speed (more than fourfold reduction in runtime). Conclusion: RMFAT is a lightweight and efficient framework suitable for real-time atmospheric turbulence suppression tasks. Abstract: Atmospheric turbulence severely degrades video quality by introducing distortions such as geometric warping, blur, and temporal flickering, posing significant challenges to both visual clarity and temporal consistency. Current state-of-the-art methods are based on transformer and 3D architectures and require multi-frame input, but their large computational cost and memory usage limit real-time deployment, especially in resource-constrained scenarios. In this work, we propose RMFAT: Recurrent Multi-scale Feature Atmospheric Turbulence Mitigator, designed for efficient and temporally consistent video restoration under AT conditions. RMFAT adopts a lightweight recurrent framework that restores each frame using only two inputs at a time, significantly reducing temporal window size and computational burden. It further integrates multi-scale feature encoding and decoding with temporal warping modules at both encoder and decoder stages to enhance spatial detail and temporal coherence. Extensive experiments on synthetic and real-world atmospheric turbulence datasets demonstrate that RMFAT not only outperforms existing methods in terms of clarity restoration (with nearly a 9\% improvement in SSIM) but also achieves significantly improved inference speed (more than a fourfold reduction in runtime), making it particularly suitable for real-time atmospheric turbulence suppression tasks.

[111] SelfAdapt: Unsupervised Domain Adaptation of Cell Segmentation Models

Fabian H. Reith,Jannik Franzen,Dinesh R. Palli,J. Lorenz Rumberger,Dagmar Kainmueller

Main category: cs.CV

TL;DR: SelfAdapt是一种无需标签的细胞分割模型适应方法,通过学生-教师增强一致性训练、L2-SP正则化和无标签停止准则,在LiveCell和TissueNet数据集上显著提高了性能。

Details Motivation: 尽管通用模型如Cellpose在多种细胞数据中表现出色,但其在与训练数据不同的领域中效果往往下降,而监督微调需要难以获得的标注数据。 Method: SelfAdapt方法建立在学生-教师增强一致性训练基础上,引入了L2-SP正则化和无标签停止准则。 Result: 在LiveCell和TissueNet数据集上的评估表明,SelfAdapt在AP0.5指标上比基线Cellpose有高达29.64%的相对改进,并且可以提升先前通过监督微调的模型效果。 Conclusion: SelfAdapt是一个无需标签即可实现细胞分割模型适应性的方法,其在LiveCell和TissueNet数据集上表现出比基线Cellpose高29.64%的AP0.5相对改进,并且可以进一步改进先前通过监督微调的模型。 Abstract: Deep neural networks have become the go-to method for biomedical instance segmentation. Generalist models like Cellpose demonstrate state-of-the-art performance across diverse cellular data, though their effectiveness often degrades on domains that differ from their training data. While supervised fine-tuning can address this limitation, it requires annotated data that may not be readily available. We propose SelfAdapt, a method that enables the adaptation of pre-trained cell segmentation models without the need for labels. Our approach builds upon student-teacher augmentation consistency training, introducing L2-SP regularization and label-free stopping criteria. We evaluate our method on the LiveCell and TissueNet datasets, demonstrating relative improvements in AP0.5 of up to 29.64% over baseline Cellpose. Additionally, we show that our unsupervised adaptation can further improve models that were previously fine-tuned with supervision. We release SelfAdapt as an easy-to-use extension of the Cellpose framework. The code for our method is publicly available at https: //github.com/Kainmueller-Lab/self_adapt.

[112] Training-free Dimensionality Reduction via Feature Truncation: Enhancing Efficiency in Privacy-preserving Multi-Biometric Systems

Florian Bayer,Maximilian Russo,Christian Rathgeb

Main category: cs.CV

TL;DR: This paper explores the use of multi-modal biometric fusion to reduce template size for more efficient encrypted processing while maintaining accuracy and security.

Details Motivation: Biometric Template Protection schemes, especially those using Homomorphic Encryption, are computationally expensive. This work aims to address this challenge by exploring reduced multi-biometric template sizes while maintaining security and performance. Method: The study uses deep neural networks to extract features from face, fingerprint, and iris modalities. It applies dimensionality reduction techniques to reduce template size and evaluates the performance using an in-house virtual multi-biometric database derived from FRGC, MCYT, and CASIA datasets. Result: The proposed approaches allow for a 67% reduction in template size without any loss in Equal Error Rate (EER) compared to the best-performing single modality. Additionally, the methods are explainable, training-free, and generalizable. Conclusion: Fusing multiple biometric modalities can reduce template size by 67% without compromising accuracy, making it more efficient for encrypted processing while maintaining security. Abstract: Biometric recognition is widely used, making the privacy and security of extracted templates a critical concern. Biometric Template Protection schemes, especially those utilizing Homomorphic Encryption, introduce significant computational challenges due to increased workload. Recent advances in deep neural networks have enabled state-of-the-art feature extraction for face, fingerprint, and iris modalities. The ubiquity and affordability of biometric sensors further facilitate multi-modal fusion, which can enhance security by combining features from different modalities. This work investigates the biometric performance of reduced multi-biometric template sizes. Experiments are conducted on an in-house virtual multi-biometric database, derived from DNN-extracted features for face, fingerprint, and iris, using the FRGC, MCYT, and CASIA databases. The evaluated approaches are (i) explainable and straightforward to implement under encryption, (ii) training-free, and (iii) capable of generalization. Dimensionality reduction of feature vectors leads to fewer operations in the Homomorphic Encryption (HE) domain, enabling more efficient encrypted processing while maintaining biometric accuracy and security at a level equivalent to or exceeding single-biometric recognition. Our results demonstrate that, by fusing feature vectors from multiple modalities, template size can be reduced by 67 % with no loss in Equal Error Rate (EER) compared to the best-performing single modality.

[113] ImagiDrive: A Unified Imagination-and-Planning Framework for Autonomous Driving

Jingyu Li,Bozhou Zhang,Xin Jin,Jiankang Deng,Xiatian Zhu,Li Zhang

Main category: cs.CV

TL;DR: 本文提出了一种新的自动驾驶框架ImagiDrive,通过整合VLM和DWM,实现了更好的自动驾驶性能。

Details Motivation: 自动驾驶需要丰富的上下文理解和精确的预测推理能力。虽然VLM和DWM在不同方面表现出色,但它们的整合仍存在挑战,尤其是在连接动作级决策与高保真像素级预测时。 Method: 提出了一个名为ImagiDrive的新型端到端自动驾驶框架,通过一个统一的想象与规划循环,整合了基于VLM的驾驶代理和基于DWM的场景想象器,并引入了提前停止机制和轨迹选择策略。 Result: 在nuScenes和NAVSIM数据集上的广泛实验验证了ImagiDrive在开环和闭环条件下的鲁棒性和优越性。 Conclusion: ImagiDrive有效地结合了VLM和DWM的优势,实现了更稳健和优越的自动驾驶性能。 Abstract: Autonomous driving requires rich contextual comprehension and precise predictive reasoning to navigate dynamic and complex environments safely. Vision-Language Models (VLMs) and Driving World Models (DWMs) have independently emerged as powerful recipes addressing different aspects of this challenge. VLMs provide interpretability and robust action prediction through their ability to understand multi-modal context, while DWMs excel in generating detailed and plausible future driving scenarios essential for proactive planning. Integrating VLMs with DWMs is an intuitive, promising, yet understudied strategy to exploit the complementary strengths of accurate behavioral prediction and realistic scene generation. Nevertheless, this integration presents notable challenges, particularly in effectively connecting action-level decisions with high-fidelity pixel-level predictions and maintaining computational efficiency. In this paper, we propose ImagiDrive, a novel end-to-end autonomous driving framework that integrates a VLM-based driving agent with a DWM-based scene imaginer to form a unified imagination-and-planning loop. The driving agent predicts initial driving trajectories based on multi-modal inputs, guiding the scene imaginer to generate corresponding future scenarios. These imagined scenarios are subsequently utilized to iteratively refine the driving agent's planning decisions. To address efficiency and predictive accuracy challenges inherent in this integration, we introduce an early stopping mechanism and a trajectory selection strategy. Extensive experimental validation on the nuScenes and NAVSIM datasets demonstrates the robustness and superiority of ImagiDrive over previous alternatives under both open-loop and closed-loop conditions.

[114] Remove360: Benchmarking Residuals After Object Removal in 3D Gaussian Splatting

Simona Kocour,Assia Benbihi,Torsten Sattler

Main category: cs.CV

TL;DR: This paper introduces a new benchmark and dataset, Remove360, to evaluate semantic residuals after object removal in 3D scenes, highlighting the limitations of current techniques and the need for more robust solutions.

Details Motivation: The motivation stems from the need to understand what semantic information persists after object removal, which is crucial for privacy-preserving 3D reconstruction and editable scene representations. Method: The authors introduce a novel benchmark and evaluation framework to measure semantic residuals after object removal in 3D Gaussian Splatting. They conduct experiments across diverse indoor and outdoor scenes and introduce the Remove360 dataset for evaluation. Result: The experiments show that current methods can preserve semantic information despite the absence of visual geometry. However, the study reveals critical limitations in current 3D object removal techniques when dealing with real-world complexity. Conclusion: The paper concludes that current 3D object removal techniques have critical limitations and highlights the need for more robust solutions to handle real-world complexity in preserving semantic information after object removal. Abstract: Understanding what semantic information persists after object removal is critical for privacy-preserving 3D reconstruction and editable scene representations. In this work, we introduce a novel benchmark and evaluation framework to measure semantic residuals, the unintended semantic traces left behind, after object removal in 3D Gaussian Splatting. We conduct experiments across a diverse set of indoor and outdoor scenes, showing that current methods can preserve semantic information despite the absence of visual geometry. We also release Remove360, a dataset of pre/post-removal RGB images and object-level masks captured in real-world environments. While prior datasets have focused on isolated object instances, Remove360 covers a broader and more complex range of indoor and outdoor scenes, enabling evaluation of object removal in the context of full-scene representations. Given ground truth images of a scene before and after object removal, we assess whether we can truly eliminate semantic presence, and if downstream models can still infer what was removed. Our findings reveal critical limitations in current 3D object removal techniques and underscore the need for more robust solutions capable of handling real-world complexity. The evaluation framework is available at github.com/spatial-intelligence-ai/Remove360.git. Data are available at huggingface.co/datasets/simkoc/Remove360.

[115] MM-R1: Unleashing the Power of Unified Multimodal Large Language Models for Personalized Image Generation

Qian Liang,Yujia Wu,Kuncheng Li,Jiwei Wei,Shiyuan He,Jinyu Guo,Ning Xie

Main category: cs.CV

TL;DR: This paper proposes MM-R1, a framework that enables unified Multimodal Large Language Models to perform personalized image generation efficiently and without extensive fine-tuning.

Details Motivation: Aligning Multimodal Large Language Models (MLLMs) with personalized image generation is challenging due to the data-intensive fine-tuning required by existing methods for each new subject. Method: The paper introduces MM-R1, which employs a cross-modal Chain-of-Thought (X-CoT) reasoning strategy and Grouped Reward Proximal Policy Optimization (GRPO) to improve personalized image generation. Result: Experiments show that MM-R1 enables unified MLLMs to generate personalized images in a zero-shot manner, achieving high subject fidelity and strong text alignment. Conclusion: MM-R1 effectively enhances the personalization capability of unified MLLMs for generating images with high subject fidelity and strong text alignment without requiring extensive fine-tuning. Abstract: Multimodal Large Language Models (MLLMs) with unified architectures excel across a wide range of vision-language tasks, yet aligning them with personalized image generation remains a significant challenge. Existing methods for MLLMs are frequently subject-specific, demanding a data-intensive fine-tuning process for every new subject, which limits their scalability. In this paper, we introduce MM-R1, a framework that integrates a cross-modal Chain-of-Thought (X-CoT) reasoning strategy to unlock the inherent potential of unified MLLMs for personalized image generation. Specifically, we structure personalization as an integrated visual reasoning and generation process: (1) grounding subject concepts by interpreting and understanding user-provided images and contextual cues, and (2) generating personalized images conditioned on both the extracted subject representations and user prompts. To further enhance the reasoning capability, we adopt Grouped Reward Proximal Policy Optimization (GRPO) to explicitly align the generation. Experiments demonstrate that MM-R1 unleashes the personalization capability of unified MLLMs to generate images with high subject fidelity and strong text alignment in a zero-shot manner.

[116] Inside Knowledge: Graph-based Path Generation with Explainable Data Augmentation and Curriculum Learning for Visual Indoor Navigation

Daniel Airinei,Elena Burceanu,Marius Leordeanu

Main category: cs.CV

TL;DR: This paper presents an efficient, real-time, vision-based deep learning solution for indoor navigation, eliminating the need for GPS or additional sensors.

Details Motivation: The motivation stems from the difficulty of indoor navigation due to poor GPS access and the lack of deployable solutions despite progress in the field. Method: The method involves a novel graph-based path generation technique, combined with explainable data augmentation and curriculum learning, allowing automatic data collection, annotation, and training. Result: The result is a novel, large-scale dataset with annotated video footage from a shopping mall and an Android application that demonstrates the effectiveness of their visual-based indoor navigation approach. Conclusion: The paper concludes that their deep learning approach, which relies solely on visual input, is efficient, real-time, and easily deployable for indoor navigation without the need for special sensors or additional infrastructure. Abstract: Indoor navigation is a difficult task, as it generally comes with poor GPS access, forcing solutions to rely on other sources of information. While significant progress continues to be made in this area, deployment to production applications is still lacking, given the complexity and additional requirements of current solutions. Here, we introduce an efficient, real-time and easily deployable deep learning approach, based on visual input only, that can predict the direction towards a target from images captured by a mobile device. Our technical approach, based on a novel graph-based path generation method, combined with explainable data augmentation and curriculum learning, includes contributions that make the process of data collection, annotation and training, as automatic as possible, efficient and robust. On the practical side, we introduce a novel largescale dataset, with video footage inside a relatively large shopping mall, in which each frame is annotated with the correct next direction towards different specific target destinations. Different from current methods, ours relies solely on vision, avoiding the need of special sensors, additional markers placed along the path, knowledge of the scene map or internet access. We also created an easy to use application for Android, which we plan to make publicly available. We make all our data and code available along with visual demos on our project site

[117] Data-Driven Deepfake Image Detection Method -- The 2024 Global Deepfake Image Detection Challenge

Xiaoya Zhu,Yibing Nan,Shiguo Lian

Main category: cs.CV

TL;DR: 本文提出了一种基于Swin Transformer V2-B分类网络结合数据增强和样本生成方法的Deepfake图像检测技术,并在比赛中获得优异奖,证明了其有效性。

Details Motivation: 随着AI技术的快速发展,Deepfake技术已成为一把双刃剑,不仅产生了大量的AI生成内容,也对数字安全带来了前所未有的挑战。因此,需要发展高效的方法来检测Deepfake图像。 Method: 我们采用了基于Swin Transformer V2-B的分类网络,并结合在线数据增强和离线样本生成方法来提高训练样本的多样性及模型的泛化能力。 Result: 通过使用Swin Transformer V2-B分类网络以及数据增强和样本生成方法,我们的模型在Deepfake图像检测任务中表现出色,最终获得了优异奖。 Conclusion: 我们的方法在Deepfake图像检测任务中获得了优异奖,证明了基于Swin Transformer V2-B分类网络结合数据增强和样本生成方法的有效性。 Abstract: With the rapid development of technology in the field of AI, deepfake technology has emerged as a double-edged sword. It has not only created a large amount of AI-generated content but also posed unprecedented challenges to digital security. The task of the competition is to determine whether a face image is a Deepfake image and output its probability score of being a Deepfake image. In the image track competition, our approach is based on the Swin Transformer V2-B classification network. And online data augmentation and offline sample generation methods are employed to enrich the diversity of training samples and increase the generalization ability of the model. Finally, we got the award of excellence in Deepfake image detection.

[118] CoFi: A Fast Coarse-to-Fine Few-Shot Pipeline for Glomerular Basement Membrane Segmentation

Hongjin Fang,Daniel Reisenbüchler,Kenji Ikemura,Mert R. Sabuncu,Yihe Yang,Ruining Deng

Main category: cs.CV

TL;DR: 本文提出CoFi,一种快速高效的粗到细少样本分割流程,用于电子显微镜图像中肾小球基底膜的分割,显著减少标注需求并保持高精度。

Details Motivation: 现有监督深度学习方法依赖大量像素级标注,不适合临床工作流;少样本学习难以捕捉GBM分析所需的精细结构细节。 Method: 提出了一种名为CoFi的方法,首先使用仅三个标注图像训练一个轻量级神经网络生成初始粗分割掩码,然后通过形态学感知修剪生成高质量点提示,最后利用这些提示引导SAM进行分割优化。 Result: 该方法在GBM分割中表现出色,Dice系数达到74.54%,推理速度为1.9 FPS。 Conclusion: CoFi是一个快速高效的粗到细的少样本分割流程,适用于EM图像中的GBM分割,不仅减轻了传统方法的注释和计算负担,还实现了准确可靠的分割结果。 Abstract: Accurate segmentation of the glomerular basement membrane (GBM) in electron microscopy (EM) images is fundamental for quantifying membrane thickness and supporting the diagnosis of various kidney diseases. While supervised deep learning approaches achieve high segmentation accuracy, their reliance on extensive pixel-level annotation renders them impractical for clinical workflows. Few-shot learning can reduce this annotation burden but often struggles to capture the fine structural details necessary for GBM analysis. In this study, we introduce CoFi, a fast and efficient coarse-to-fine few-shot segmentation pipeline designed for GBM delineation in EM images. CoFi first trains a lightweight neural network using only three annotated images to produce an initial coarse segmentation mask. This mask is then automatically processed to generate high-quality point prompts with morphology-aware pruning, which are subsequently used to guide SAM in refining the segmentation. The proposed method achieved exceptional GBM segmentation performance, with a Dice coefficient of 74.54% and an inference speed of 1.9 FPS. We demonstrate that CoFi not only alleviates the annotation and computational burdens associated with conventional methods, but also achieves accurate and reliable segmentation results. The pipeline's speed and annotation efficiency make it well-suited for research and hold strong potential for clinical applications in renal pathology. The pipeline is publicly available at: https://github.com/ddrrnn123/CoFi.

[119] TACR-YOLO: A Real-time Detection Framework for Abnormal Human Behaviors Enhanced with Coordinate and Task-Aware Representations

Xinyi Yin,Wenbo Yuan,Xuecheng Wu,Liangyu Fu,Danlei Huang

Main category: cs.CV

TL;DR: The paper introduces TACR-YOLO, an improved real-time framework for detecting abnormal human behaviors, particularly effective in challenging scenarios with small objects and multi-scale fusion issues.

Details Motivation: YOLO-based detection methods face challenges like small object detection, task conflicts, and multi-scale fusion in Abnormal Human Behavior Detection (AHBD), which this research aims to overcome. Method: The authors proposed TACR-YOLO, a new real-time framework incorporating a Coordinate Attention Module, a Task-Aware Attention Module, and a Strengthen Neck Network. They also optimized Anchor Box sizes using K-means clustering and deployed DIoU-Loss for improved bounding box regression. Result: TACR-YOLO achieved 91.92% mAP on the Personnel Anomalous Behavior Detection (PABD) dataset, showing competitive speed and robustness. Conclusion: TACR-YOLO provides new insights and advancements for abnormal behavior detection under special scenarios. Abstract: Abnormal Human Behavior Detection (AHBD) under special scenarios is becoming increasingly crucial. While YOLO-based detection methods excel in real-time tasks, they remain hindered by challenges including small objects, task conflicts, and multi-scale fusion in AHBD. To tackle them, we propose TACR-YOLO, a new real-time framework for AHBD. We introduce a Coordinate Attention Module to enhance small object detection, a Task-Aware Attention Module to deal with classification-regression conflicts, and a Strengthen Neck Network for refined multi-scale fusion, respectively. In addition, we optimize Anchor Box sizes using K-means clustering and deploy DIoU-Loss to improve bounding box regression. The Personnel Anomalous Behavior Detection (PABD) dataset, which includes 8,529 samples across four behavior categories, is also presented. Extensive experimental results indicate that TACR-YOLO achieves 91.92% mAP on PABD, with competitive speed and robustness. Ablation studies highlight the contribution of each improvement. This work provides new insights for abnormal behavior detection under special scenarios, advancing its progress.

[120] OpenConstruction: A Systematic Synthesis of Open Visual Datasets for Data-Centric Artificial Intelligence in Construction Monitoring

Ruoxin Xiong,Yanyu Wang,Jiannan Cai,Kaijian Liu,Yuansheng Zhu,Pingbo Tang,Nora El-Gohary

Main category: cs.CV

TL;DR: 这篇论文对建筑行业使用的视觉数据集进行了系统性回顾,构建了一个开放源码目录,并提出了基于FAIR原则的未来数据基础设施发展蓝图。

Details Motivation: 这篇论文的动机是解决建筑行业在利用视觉数据集支持人工智能和机器学习应用时,存在的资源质量参差不齐、缺乏系统性回顾的问题,从而限制了社区对数据集现状的理解和未来发展方向的规划。 Method: 这篇论文的方法是通过广泛搜索学术数据库和开放数据平台,收集了2005年至2024年期间的51个公开可用的视觉数据集,并使用结构化数据模式对它们进行分类,包括数据基础、数据模态、注释框架和下游应用领域。 Result: 这篇论文的结果是整理出了51个公开可用的视觉数据集,并构建了一个开放源码目录OpenConstruction来支持数据驱动方法的开发,同时指出了现有数据集的主要局限性并提出了未来发展的路线图。 Conclusion: 这篇论文的结论是,通过系统性地整理和分析现有的视觉数据集,研究者为建筑行业的数据驱动方法开发提供了一个开放源码的目录OpenConstruction,并基于FAIR原则提出了未来数据基础设施的发展蓝图。 Abstract: The construction industry increasingly relies on visual data to support Artificial Intelligence (AI) and Machine Learning (ML) applications for site monitoring. High-quality, domain-specific datasets, comprising images, videos, and point clouds, capture site geometry and spatiotemporal dynamics, including the location and interaction of objects, workers, and materials. However, despite growing interest in leveraging visual datasets, existing resources vary widely in sizes, data modalities, annotation quality, and representativeness of real-world construction conditions. A systematic review to categorize their data characteristics and application contexts is still lacking, limiting the community's ability to fully understand the dataset landscape, identify critical gaps, and guide future directions toward more effective, reliable, and scalable AI applications in construction. To address this gap, this study conducts an extensive search of academic databases and open-data platforms, yielding 51 publicly available visual datasets that span the 2005-2024 period. These datasets are categorized using a structured data schema covering (i) data fundamentals (e.g., size and license), (ii) data modalities (e.g., RGB and point cloud), (iii) annotation frameworks (e.g., bounding boxes), and (iv) downstream application domains (e.g., progress tracking). This study synthesizes these findings into an open-source catalog, OpenConstruction, supporting data-driven method development. Furthermore, the study discusses several critical limitations in the existing construction dataset landscape and presents a roadmap for future data infrastructure anchored in the Findability, Accessibility, Interoperability, and Reusability (FAIR) principles. By reviewing the current landscape and outlining strategic priorities, this study supports the advancement of data-centric solutions in the construction sector.

[121] CineTrans: Learning to Generate Videos with Cinematic Transitions via Masked Diffusion Models

Xiaoxue Wu,Bingjie Gao,Yu Qiao,Yaohui Wang,Xinyuan Chen

Main category: cs.CV

TL;DR: CineTrans是一个新的视频生成框架,可以生成具有电影风格过渡的多镜头连贯视频。

Details Motivation: 尽管视频合成取得了显著进展,但多镜头视频生成的研究仍处于初级阶段。现有的模型和大规模数据集仍然无法有效实现镜头转换,导致生成的视频大多局限于单镜头序列。 Method: 作者提出了CineTrans框架,并构建了一个具有详细镜头注释的多镜头视频文本数据集Cine250K。他们分析了现有视频扩散模型,并发现扩散模型中的注意力图与镜头边界之间存在关联,利用这一关联设计了一种基于掩码的控制机制。这种机制可以在任意位置实现过渡,且在无需训练的情况下有效传输。 Result: CineTrans经过使用掩码机制在Cine250K数据集上微调后,能够生成符合电影剪辑风格的多镜头视频序列,避免了不稳定的过渡或简单的镜头拼接。通过广泛实验和专门的评估指标,CineTrans在所有标准上显著优于现有基线模型。 Conclusion: CineTrans提供了一种创新的框架,解决了多镜头视频生成中的关键问题,为生成具有电影风格过渡的高质量视频提供了有效方案。 Abstract: Despite significant advances in video synthesis, research into multi-shot video generation remains in its infancy. Even with scaled-up models and massive datasets, the shot transition capabilities remain rudimentary and unstable, largely confining generated videos to single-shot sequences. In this work, we introduce CineTrans, a novel framework for generating coherent multi-shot videos with cinematic, film-style transitions. To facilitate insights into the film editing style, we construct a multi-shot video-text dataset Cine250K with detailed shot annotations. Furthermore, our analysis of existing video diffusion models uncovers a correspondence between attention maps in the diffusion model and shot boundaries, which we leverage to design a mask-based control mechanism that enables transitions at arbitrary positions and transfers effectively in a training-free setting. After fine-tuning on our dataset with the mask mechanism, CineTrans produces cinematic multi-shot sequences while adhering to the film editing style, avoiding unstable transitions or naive concatenations. Finally, we propose specialized evaluation metrics for transition control, temporal consistency and overall quality, and demonstrate through extensive experiments that CineTrans significantly outperforms existing baselines across all criteria.

[122] Automated Building Heritage Assessment Using Street-Level Imagery

Kristina Dabrock,Tim Johansson,Anna Donarelli,Mikael Mangold,Noah Pflugradt,Jann Michael Weinand,Jochen Linßen

Main category: cs.CV

TL;DR: 本研究利用GPT和机器学习模型评估建筑的文化遗产价值,以支持在大规模能源改造中高效且谨慎地实施节能措施。

Details Motivation: 为了在不损害文化遗产的前提下量化建筑中的节能措施,传统方法成本高且耗时,因此需要更高效的解决方案。 Method: 该研究使用了大型语言模型GPT来检测建筑立面图像中的文化遗产价值,并结合建筑登记数据作为特征,训练机器学习模型对建筑物进行分类。 Result: 结合注册数据和GPT提取特征的机器学习模型,在与专家创建的清单验证对比时,宏平均F1得分为0.71;仅使用GPT提取数据时,得分为0.60。 Conclusion: 该研究提出的方法能够提高数据库的质量,从而支持在大规模能源改造场景中谨慎实施节能措施,并综合考虑遗产价值。 Abstract: Detailed data is required to quantify energy conservation measures in buildings, such as envelop retrofits, without compromising cultural heritage. Novel artificial intelligence tools may improve efficiency in identifying heritage values in buildings compared to costly and time-consuming traditional inventories. In this study, the large language model GPT was used to detect various aspects of cultural heritage value in fa\c{c}ade images. Using this data and building register data as features, machine learning models were trained to classify multi-family and non-residential buildings in Stockholm, Sweden. Validation against an expert-created inventory shows a macro F1-score of 0.71 using a combination of register data and features retrieved from GPT, and a score of 0.60 using only GPT-derived data. The presented methodology can contribute to a higher-quality database and thus support careful energy efficiency measures and integrated consideration of heritage value in large-scale energetic refurbishment scenarios.

[123] Perception in Plan: Coupled Perception and Planning for End-to-End Autonomous Driving

Bozhou Zhang,Jingyu Li,Nan Song,Li Zhang

Main category: cs.CV

TL;DR: 本文提出了一种新的端到端自动驾驶框架VeteranAD,通过将感知集成到规划过程中,提升了自动驾驶的性能。

Details Motivation: 现有的端到端自动驾驶方法主要遵循感知-规划范式,但感知和规划通常是顺序执行的,需要进一步提升规划性能。 Method: 引入了VeteranAD框架,结合了感知-in-计划的设计,利用多模式锚定轨迹作为规划先验,并采用自回归策略逐步预测未来轨迹。 Result: 在NAVSIM和Bench2Drive数据集上的实验表明,VeteranAD实现了最先进的性能。 Conclusion: VeteranAD实现了最先进的性能,通过将感知模块与规划过程紧密结合,提升了自动驾驶的准确性和可靠性。 Abstract: End-to-end autonomous driving has achieved remarkable advancements in recent years. Existing methods primarily follow a perception-planning paradigm, where perception and planning are executed sequentially within a fully differentiable framework for planning-oriented optimization. We further advance this paradigm through a perception-in-plan framework design, which integrates perception into the planning process. This design facilitates targeted perception guided by evolving planning objectives over time, ultimately enhancing planning performance. Building on this insight, we introduce VeteranAD, a coupled perception and planning framework for end-to-end autonomous driving. By incorporating multi-mode anchored trajectories as planning priors, the perception module is specifically designed to gather traffic elements along these trajectories, enabling comprehensive and targeted perception. Planning trajectories are then generated based on both the perception results and the planning priors. To make perception fully serve planning, we adopt an autoregressive strategy that progressively predicts future trajectories while focusing on relevant regions for targeted perception at each step. With this simple yet effective design, VeteranAD fully unleashes the potential of planning-oriented end-to-end methods, leading to more accurate and reliable driving behavior. Extensive experiments on the NAVSIM and Bench2Drive datasets demonstrate that our VeteranAD achieves state-of-the-art performance.

[124] Hierarchical Graph Feature Enhancement with Adaptive Frequency Modulation for Visual Recognition

Feiyue Zhao,Zhichao Zhang

Main category: cs.CV

TL;DR: This paper proposes HGFE, a lightweight module that integrates graph-based reasoning into CNNs to better capture local and global relationships in images, resulting in improved performance across multiple vision tasks.

Details Motivation: CNNs struggle with modeling complex topological relationships and non-local semantics due to their reliance on regular grid structures, which HGFE aims to address. Method: The HGFE framework builds two levels of graph structures—intra-window graph convolution for local spatial dependencies and inter-window supernode interactions for global semantic relationships—and introduces an adaptive frequency modulation module to balance signal propagation. Result: Experiments on CIFAR-100, PASCAL VOC, VisDrone, CrackSeg, and CarParts datasets validated the effectiveness of HGFE in improving structural awareness and feature representation. Conclusion: The proposed HGFE module improves structural representation and enhances overall recognition performance in CNNs by integrating graph-based reasoning. Abstract: Convolutional neural networks (CNNs) have demonstrated strong performance in visual recognition tasks, but their inherent reliance on regular grid structures limits their capacity to model complex topological relationships and non-local semantics within images. To address this limita tion, we propose the hierarchical graph feature enhancement (HGFE), a novel framework that integrates graph-based rea soning into CNNs to enhance both structural awareness and feature representation. HGFE builds two complementary levels of graph structures: intra-window graph convolution to cap ture local spatial dependencies and inter-window supernode interactions to model global semantic relationships. Moreover, we introduce an adaptive frequency modulation module that dynamically balances low-frequency and high-frequency signal propagation, preserving critical edge and texture information while mitigating over-smoothing. The proposed HGFE module is lightweight, end-to-end trainable, and can be seamlessly integrated into standard CNN backbone networks. Extensive experiments on CIFAR-100 (classification), PASCAL VOC, and VisDrone (detection), as well as CrackSeg and CarParts (segmentation), validated the effectiveness of the HGFE in improving structural representation and enhancing overall recognition performance.

[125] Handwritten Text Recognition of Historical Manuscripts Using Transformer-Based Models

Erez Meoded

Main category: cs.CV

TL;DR: This study improves the recognition of historical handwritten texts by introducing specialized data augmentation techniques and ensemble strategies, significantly reducing error rates on a 16th-century Latin manuscript dataset.

Details Motivation: The motivation stems from the challenges in digitizing historical documents due to scarce transcriptions, linguistic variation, and diverse handwriting styles, which hinder the digitization process. Method: The researchers applied TrOCR, a transformer-based HTR model, and introduced four novel data augmentation methods tailored for historical handwriting. They also explored ensemble learning approaches. Result: On the Gwalther dataset, the best single-model augmentation (Elastic) achieved a Character Error Rate (CER) of 1.86, while a top-5 voting ensemble achieved a CER of 1.60, showing a significant improvement over previous methods. Conclusion: The study concludes that domain-specific data augmentation techniques and ensemble strategies significantly enhance the performance of HTR models on historical manuscripts. Abstract: Historical handwritten text recognition (HTR) is essential for unlocking the cultural and scholarly value of archival documents, yet digitization is often hindered by scarce transcriptions, linguistic variation, and highly diverse handwriting styles. In this study, we apply TrOCR, a state-of-the-art transformer-based HTR model, to 16th-century Latin manuscripts authored by Rudolf Gwalther. We investigate targeted image preprocessing and a broad suite of data augmentation techniques, introducing four novel augmentation methods designed specifically for historical handwriting characteristics. We also evaluate ensemble learning approaches to leverage the complementary strengths of augmentation-trained models. On the Gwalther dataset, our best single-model augmentation (Elastic) achieves a Character Error Rate (CER) of 1.86, while a top-5 voting ensemble achieves a CER of 1.60 - representing a 50% relative improvement over the best reported TrOCR_BASE result and a 42% improvement over the previous state of the art. These results highlight the impact of domain-specific augmentations and ensemble strategies in advancing HTR performance for historical manuscripts.

[126] AIM: Amending Inherent Interpretability via Self-Supervised Masking

Eyad Alshami,Shashank Agnihotri,Bernt Schiele,Margret Keuper

Main category: cs.CV

TL;DR: AIM improves deep neural network interpretability and performance by promoting the use of genuine features through self-supervised feature masking.

Details Motivation: The motivation is to address the issue of deep neural networks utilizing both genuine and spurious features, aiming to enhance model interpretability and generalization without additional annotations. Method: The paper introduces "Amending Inherent Interpretability via Self-Supervised Masking" (AIM), which uses features from multiple encoding stages to guide a self-supervised, sample-specific feature-masking process. Result: AIM demonstrated significant improvements in interpretability, measured by the Energy Pointing Game (EPG) score, and accuracy gains across various datasets and architectures. Conclusion: The paper concludes that AIM effectively enhances the interpretability of deep neural networks while improving their performance by promoting the use of genuine features. Abstract: It has been observed that deep neural networks (DNNs) often use both genuine as well as spurious features. In this work, we propose "Amending Inherent Interpretability via Self-Supervised Masking" (AIM), a simple yet interestingly effective method that promotes the network's utilization of genuine features over spurious alternatives without requiring additional annotations. In particular, AIM uses features at multiple encoding stages to guide a self-supervised, sample-specific feature-masking process. As a result, AIM enables the training of well-performing and inherently interpretable models that faithfully summarize the decision process. We validate AIM across a diverse range of challenging datasets that test both out-of-distribution generalization and fine-grained visual understanding. These include general-purpose classification benchmarks such as ImageNet100, HardImageNet, and ImageWoof, as well as fine-grained classification datasets such as Waterbirds, TravelingBirds, and CUB-200. AIM demonstrates significant dual benefits: interpretability improvements, as measured by the Energy Pointing Game (EPG) score, and accuracy gains over strong baselines. These consistent gains across domains and architectures provide compelling evidence that AIM promotes the use of genuine and meaningful features that directly contribute to improved generalization and human-aligned interpretability.

[127] A Real-time Concrete Crack Detection and Segmentation Model Based on YOLOv11

Shaoze Huang,Qi Liu,Chao Chen,Yuhang Chen

Main category: cs.CV

TL;DR: 本文提出了一种基于YOLOv11n架构的多任务混凝土裂缝检测与分割模型YOLOv11-KW-TA-FP,以解决交通基础设施中裂缝检测效率低和现有深度学习模型性能不佳的问题。

Details Motivation: 长三角地区交通基础设施加速老化,裂缝劣化严重威胁结构完整性和区域经济增长。现有手动检测效率低下且深度学习模型在复杂背景下的小目标裂缝检测性能不佳,因此需要更高效的解决方案。 Method: 该模型整合了一个三阶段优化框架:(1)在骨干网络中嵌入动态KernelWarehouse卷积(KWConv),通过动态核共享机制增强特征表示;(2)在特征金字塔中加入三重注意力机制(TA),增强通道-空间交互建模;(3)设计FP-IoU损失函数以促进自适应边界框回归惩罚。 Result: 实验验证表明,改进后的模型在基线模型上实现了显著的性能提升,达到了91.3%的精确率、76.6%的召回率和86.4%的mAP@50。消融研究证实了所提出模块的协同效应,且鲁棒性测试表明该模型在数据稀缺和噪声干扰条件下表现稳定。 Conclusion: 本研究提供了一种高效的计算机视觉解决方案,用于自动化基础设施检测,具有显著的工程应用价值。 Abstract: Accelerated aging of transportation infrastructure in the rapidly developing Yangtze River Delta region necessitates efficient concrete crack detection, as crack deterioration critically compromises structural integrity and regional economic growth. To overcome the limitations of inefficient manual inspection and the suboptimal performance of existing deep learning models, particularly for small-target crack detection within complex backgrounds, this paper proposes YOLOv11-KW-TA-FP, a multi-task concrete crack detection and segmentation model based on the YOLOv11n architecture. The proposed model integrates a three-stage optimization framework: (1) Embedding dynamic KernelWarehouse convolution (KWConv) within the backbone network to enhance feature representation through a dynamic kernel sharing mechanism; (2) Incorporating a triple attention mechanism (TA) into the feature pyramid to strengthen channel-spatial interaction modeling; and (3) Designing an FP-IoU loss function to facilitate adaptive bounding box regression penalization. Experimental validation demonstrates that the enhanced model achieves significant performance improvements over the baseline, attaining 91.3% precision, 76.6% recall, and 86.4% mAP@50. Ablation studies confirm the synergistic efficacy of the proposed modules. Furthermore, robustness tests indicate stable performance under conditions of data scarcity and noise interference. This research delivers an efficient computer vision solution for automated infrastructure inspection, exhibiting substantial practical engineering value.

[128] Multi-State Tracker: Enhancing Efficient Object Tracking via Multi-State Specialization and Interaction

Shilei Wang,Gong Cheng,Pujian Lai,Dong Gao,Junwei Han

Main category: cs.CV

TL;DR: 本文提出MST,一种基于轻量级SSE和CSI模块的高效跟踪器,在保持低计算开销的同时显著提升了跟踪精度和鲁棒性。

Details Motivation: 高效跟踪器通常因降低计算复杂度和模型参数而牺牲特征表示能力,导致难以准确捕捉目标状态。为此,需要一种兼顾效率和特征表示能力的新方法。 Method: 提出多状态跟踪器(MST),结合轻量级的状态特定增强(SSE)和跨状态交互(CSI)模块,通过多阶段生成多状态特征并进行自适应聚合。 Result: MST在多个数据集上均超越了之前的高效跟踪器,GOT-10K数据集上的AO分数比现有SOTA高效跟踪器HCAT提高了4.5%,计算量仅增加0.1 GFLOPs,参数仅增加0.66 M。 Conclusion: MST通过引入SSE和CSI模块,在保持高效性的同时显著提升了特征表示能力和跟踪鲁棒性,超越了现有的高效跟踪器。 Abstract: Efficient trackers achieve faster runtime by reducing computational complexity and model parameters. However, this efficiency often compromises the expense of weakened feature representation capacity, thus limiting their ability to accurately capture target states using single-layer features. To overcome this limitation, we propose Multi-State Tracker (MST), which utilizes highly lightweight state-specific enhancement (SSE) to perform specialized enhancement on multi-state features produced by multi-state generation (MSG) and aggregates them in an interactive and adaptive manner using cross-state interaction (CSI). This design greatly enhances feature representation while incurring minimal computational overhead, leading to improved tracking robustness in complex environments. Specifically, the MSG generates multiple state representations at multiple stages during feature extraction, while SSE refines them to highlight target-specific features. The CSI module facilitates information exchange between these states and ensures the integration of complementary features. Notably, the introduced SSE and CSI modules adopt a highly lightweight hidden state adaptation-based state space duality (HSA-SSD) design, incurring only 0.1 GFLOPs in computation and 0.66 M in parameters. Experimental results demonstrate that MST outperforms all previous efficient trackers across multiple datasets, significantly improving tracking accuracy and robustness. In particular, it shows excellent runtime performance, with an AO score improvement of 4.5\% over the previous SOTA efficient tracker HCAT on the GOT-10K dataset. The code is available at https://github.com/wsumel/MST.

[129] An Efficient Medical Image Classification Method Based on a Lightweight Improved ConvNeXt-Tiny Architecture

Jingsong Xia,Yue Yin,Xiuhan Li

Main category: cs.CV

TL;DR: 本研究提出了一种改进的ConvNeXt-Tiny架构,用于在资源受限环境下高效准确地进行医学图像分类。

Details Motivation: 在资源受限的计算环境中实现高效且高精度的医学图像分类仍然具有挑战性。 Method: 基于改进的ConvNeXt-Tiny架构,引入了双全局池化特征融合策略、轻量级通道注意力模块(SEVector)和特征平滑损失函数。 Result: 在仅使用CPU(8线程)的情况下,该方法在测试集上10个训练周期内实现了最高89.10%的分类准确率,损失值表现出稳定的收敛趋势。 Conclusion: 所提出的方法在资源有限的环境中有效提升了医学图像分类性能,为医学影像分析模型的部署和推广提供了可行且高效的解决方案。 Abstract: Intelligent analysis of medical imaging plays a crucial role in assisting clinical diagnosis. However, achieving efficient and high-accuracy image classification in resource-constrained computational environments remains challenging. This study proposes a medical image classification method based on an improved ConvNeXt-Tiny architecture. Through structural optimization and loss function design, the proposed method enhances feature extraction capability and classification performance while reducing computational complexity. Specifically, the method introduces a dual global pooling (Global Average Pooling and Global Max Pooling) feature fusion strategy into the ConvNeXt-Tiny backbone to simultaneously preserve global statistical features and salient response information. A lightweight channel attention module, termed Squeeze-and-Excitation Vector (SEVector), is designed to improve the adaptive allocation of channel weights while minimizing parameter overhead. Additionally, a Feature Smoothing Loss is incorporated into the loss function to enhance intra-class feature consistency and suppress intra-class variance. Under CPU-only conditions (8 threads), the method achieves a maximum classification accuracy of 89.10% on the test set within 10 training epochs, exhibiting a stable convergence trend in loss values. Experimental results demonstrate that the proposed method effectively improves medical image classification performance in resource-limited settings, providing a feasible and efficient solution for the deployment and promotion of medical imaging analysis models.

[130] Reinforcing Video Reasoning Segmentation to Think Before It Segments

Sitong Gong,Lu Zhang,Yunzhi Zhuge,Xu Jia,Pingping Zhang,Huchuan Lu

Main category: cs.CV

TL;DR: Veason-R1结合强化学习与思维链策略,在视频推理分割任务中实现性能突破。

Details Motivation: 传统方法受限于可解释性和时空推理能力不足,因此需要一种强调结构化推理的方法。 Method: 使用Group Relative Policy Optimization (GRPO) 和 Chain-of-Thought (CoT) 初始化训练Veason-R1模型,并通过奖励机制优化推理链。 Result: Veason-R1在ReVOS和ReasonVOS等基准测试中显著优于现有技术,分别提升了+1.3 J&F和+10.0 J&F,并在抗幻觉方面提升了+8.8 R。 Conclusion: Veason-R1通过结合强化学习和思维链策略,在视频推理分割任务中实现了SOTA性能,同时提升了模型对幻觉的鲁棒性。 Abstract: Video reasoning segmentation (VRS) endeavors to delineate referred objects in videos guided by implicit instructions that encapsulate human intent and temporal logic. Previous approaches leverage large vision language models (LVLMs) to encode object semantics into tokens for mask prediction. However, this paradigm suffers from limited interpretability during inference and suboptimal performance due to inadequate spatiotemporal reasoning. Drawing inspiration from seminal breakthroughs in reinforcement learning, we introduce Veason-R1, a specialized LVLM for VRS that emphasizes structured reasoning in segmentation. Veason-R1 is trained through Group Relative Policy Optimization (GRPO) augmented with Chain-of-Thought (CoT) initialization. To begin with, we curate high-quality CoT training data to instill structured reasoning trajectories, bridging video-level semantics and frame-level spatial grounding, yielding the supervised fine-tuned model Veason-SFT. Subsequently, GRPO fine-tuning encourages efficient exploration of the reasoning space by optimizing reasoning chains. To this end, we incorporate a holistic reward mechanism that synergistically enhances spatial alignment and temporal consistency, bolstering keyframe localization and fine-grained grounding. Comprehensive empirical evaluations demonstrate that Veason-R1 achieves state-of-the-art performance on multiple benchmarks, surpassing prior art by significant margins (e.g., +1.3 J &F in ReVOS and +10.0 J &F in ReasonVOS), while exhibiting robustness to hallucinations (+8.8 R). Our code and model weights will be available at Veason-R1.

[131] Training-Free Anomaly Generation via Dual-Attention Enhancement in Diffusion Model

Zuo Zuo,Jiahao Dong,Yanyun Qu,Zongze Wu

Main category: cs.CV

TL;DR: This paper proposes a training-free framework called AAG based on Stable Diffusion for generating realistic industrial anomalies, utilizing Cross-Attention Enhancement and Self-Attention Enhancement to improve fidelity and coherence.

Details Motivation: Industrial anomaly detection faces the challenge of data scarcity, particularly in the lack of sufficient anomaly data. Existing anomaly generation methods suffer from low fidelity or require extra training data. Method: The proposed AAG framework utilizes Stable Diffusion along with Cross-Attention Enhancement (CAE) and Self-Attention Enhancement (SAE) to generate anomalies without requiring additional training. Result: Extensive experiments on MVTec AD and VisA datasets demonstrate the effectiveness of AAG in anomaly generation and its utility for improving downstream tasks. Conclusion: AAG can effectively generate realistic and natural anomalies in specific regions of normal images, enhancing downstream anomaly inspection tasks. Abstract: Industrial anomaly detection (AD) plays a significant role in manufacturing where a long-standing challenge is data scarcity. A growing body of works have emerged to address insufficient anomaly data via anomaly generation. However, these anomaly generation methods suffer from lack of fidelity or need to be trained with extra data. To this end, we propose a training-free anomaly generation framework dubbed AAG, which is based on Stable Diffusion (SD)'s strong generation ability for effective anomaly image generation. Given a normal image, mask and a simple text prompt, AAG can generate realistic and natural anomalies in the specific regions and simultaneously keep contents in other regions unchanged. In particular, we propose Cross-Attention Enhancement (CAE) to re-engineer the cross-attention mechanism within Stable Diffusion based on the given mask. CAE increases the similarity between visual tokens in specific regions and text embeddings, which guides these generated visual tokens in accordance with the text description. Besides, generated anomalies need to be more natural and plausible with object in given image. We propose Self-Attention Enhancement (SAE) which improves similarity between each normal visual token and anomaly visual tokens. SAE ensures that generated anomalies are coherent with original pattern. Extensive experiments on MVTec AD and VisA datasets demonstrate effectiveness of AAG in anomaly generation and its utility. Furthermore, anomaly images generated by AAG can bolster performance of various downstream anomaly inspection tasks.

[132] TrajSV: A Trajectory-based Model for Sports Video Representations and Applications

Zheng Wang,Shihao Xu,Wei Shi

Main category: cs.CV

TL;DR: This paper introduces TrajSV, a novel trajectory-based framework for sports analytics that significantly improves performance in video retrieval, action spotting, and video captioning while requiring minimal supervision.

Details Motivation: The study is motivated by unresolved issues in sports analytics, such as data unavailability, the lack of an effective trajectory-based framework, and the need for sufficient supervision labels. The authors aim to address these challenges and improve the analysis of sports broadcast videos. Method: The paper proposes TrajSV, a trajectory-based framework comprising three components: data preprocessing, Clip Representation Network (CRNet), and Video Representation Network (VRNet). CRNet uses a trajectory-enhanced Transformer to learn clip representations, while VRNet aggregates clip representations and visual features using an encoder-decoder structure. A triple contrastive loss optimizes representations in an unsupervised manner. Result: TrajSV achieves state-of-the-art performance in sports video retrieval, with nearly a 70% improvement. It outperforms baselines in action spotting, achieving top results in 9 out of 17 action categories, and shows a nearly 20% improvement in video captioning. The proposed framework is validated on three sports (soccer, basketball, volleyball) across three downstream applications. Conclusion: TrajSV demonstrates significant improvements in sports video retrieval, action spotting, and video captioning, achieving state-of-the-art results in many cases. It addresses existing challenges in sports analytics by offering an effective trajectory-based framework that operates efficiently with minimal supervision labels. Abstract: Sports analytics has received significant attention from both academia and industry in recent years. Despite the growing interest and efforts in this field, several issues remain unresolved, including (1) data unavailability, (2) lack of an effective trajectory-based framework, and (3) requirement for sufficient supervision labels. In this paper, we present TrajSV, a trajectory-based framework that addresses various issues in existing studies. TrajSV comprises three components: data preprocessing, Clip Representation Network (CRNet), and Video Representation Network (VRNet). The data preprocessing module extracts player and ball trajectories from sports broadcast videos. CRNet utilizes a trajectory-enhanced Transformer module to learn clip representations based on these trajectories. Additionally, VRNet learns video representations by aggregating clip representations and visual features with an encoder-decoder architecture. Finally, a triple contrastive loss is introduced to optimize both video and clip representations in an unsupervised manner. The experiments are conducted on three broadcast video datasets to verify the effectiveness of TrajSV for three types of sports (i.e., soccer, basketball, and volleyball) with three downstream applications (i.e., sports video retrieval, action spotting, and video captioning). The results demonstrate that TrajSV achieves state-of-the-art performance in sports video retrieval, showcasing a nearly 70% improvement. It outperforms baselines in action spotting, achieving state-of-the-art results in 9 out of 17 action categories, and demonstrates a nearly 20% improvement in video captioning. Additionally, we introduce a deployed system along with the three applications based on TrajSV.

[133] Causality Matters: How Temporal Information Emerges in Video Language Models

Yumeng Shi,Quanyu Long,Yin Wu,Wenya Wang

Main category: cs.CV

TL;DR: This paper investigates how VideoLMs understand temporal information and finds that temporal reasoning arises from inter-frame attention rather than positional encodings. Two efficiency strategies are proposed.

Details Motivation: Temporal understanding in VideoLMs remains a challenge despite progress in multimodal understanding. Previous works focus on positional encodings, but their actual impact was unclear. Method: The researchers conducted extensive analysis experiments to trace how temporal information is processed within VideoLMs. They evaluated the impact of altering frame sequences and positional encodings. Result: Removing or modifying positional encodings had minimal impact on performance, while reversing frame sequences caused a significant drop. This led to the discovery of a causal pathway for temporal reasoning. Conclusion: The study uncovers a causal information pathway where temporal reasoning emerges from inter-visual token interactions constrained by causal attention. Two efficiency strategies are proposed based on these insights. Abstract: Video language models (VideoLMs) have made significant progress in multimodal understanding. However, temporal understanding, which involves identifying event order, duration, and relationships across time, still remains a core challenge. Prior works emphasize positional encodings (PEs) as a key mechanism for encoding temporal structure. Surprisingly, we find that removing or modifying PEs in video inputs yields minimal degradation in the performance of temporal understanding. In contrast, reversing the frame sequence while preserving the original PEs causes a substantial drop. To explain this behavior, we conduct substantial analysis experiments to trace how temporal information is integrated within the model. We uncover a causal information pathway: temporal cues are progressively synthesized through inter-frame attention, aggregated in the final frame, and subsequently integrated into the query tokens. This emergent mechanism shows that temporal reasoning emerges from inter-visual token interactions under the constraints of causal attention, which implicitly encodes temporal structure. Based on these insights, we propose two efficiency-oriented strategies: staged cross-modal attention and a temporal exit mechanism for early token truncation. Experiments on two benchmarks validate the effectiveness of both approaches. To the best of our knowledge, this is the first work to systematically investigate video temporal understanding in VideoLMs, offering insights for future model improvement.

[134] DashCam Video: A complementary low-cost data stream for on-demand forest-infrastructure system monitoring

Durga Joshi,Chandi Witharana,Robert Fahey,Thomas Worthley,Zhe Zhu,Diego Cerrai

Main category: cs.CV

TL;DR: 该研究开发了一个低成本、可重复的框架,利用dashcam视频数据进行实时的城市基础设施和植被的对象级结构评估和地理定位。

Details Motivation: 旨在解决利用常见但未充分利用的dashcam视频数据进行城市路边植被和基础设施的实时对象级结构评估和地理定位的问题。 Method: 开发了一个端到端的管道,结合单目深度估计、深度误差校正和几何三角测量,从车辆安装的dashcam视频中生成准确的空间和结构数据。首先使用最先进的单目深度模型估计深度图,然后通过梯度增强回归框架进行优化,以修正远处物体的低估。此外,使用基于GPS的三角测量估计物体位置,使用针孔相机几何计算物体高度。 Result: 深度校正模型在变换尺度上实现了强大的预测性能(R2 = 0.92,MAE = 0.31),显著减少了15米以外的偏差。在不同摄像头安装位置和车辆速度条件下评估了该方法。低速行驶的车内摄像头准确率最高,平均地理位置误差为2.83米,树木高度估计的平均绝对误差(MAE)为2.09米,杆状物体的误差为0.88米。 Conclusion: 该研究提出了一种新颖、低成本且可重复的框架,利用车载摄像头视频数据进行实时的对象级结构评估和地理位置定位,为城市植被和基础设施的风险监测提供了一种快速、实时且具有成本效益的解决方案。 Abstract: Our study introduces a novel, low-cost, and reproducible framework for real-time, object-level structural assessment and geolocation of roadside vegetation and infrastructure with commonly available but underutilized dashboard camera (dashcam) video data. We developed an end-to-end pipeline that combines monocular depth estimation, depth error correction, and geometric triangulation to generate accurate spatial and structural data from street-level video streams from vehicle-mounted dashcams. Depth maps were first estimated using a state-of-the-art monocular depth model, then refined via a gradient-boosted regression framework to correct underestimations, particularly for distant objects. The depth correction model achieved strong predictive performance (R2 = 0.92, MAE = 0.31 on transformed scale), significantly reducing bias beyond 15 m. Further, object locations were estimated using GPS-based triangulation, while object heights were calculated using pin hole camera geometry. Our method was evaluated under varying conditions of camera placement and vehicle speed. Low-speed vehicle with inside camera gave the highest accuracy, with mean geolocation error of 2.83 m, and mean absolute error (MAE) in height estimation of 2.09 m for trees and 0.88 m for poles. To the best of our knowledge, it is the first framework to combine monocular depth modeling, triangulated GPS-based geolocation, and real-time structural assessment for urban vegetation and infrastructure using consumer-grade video data. Our approach complements conventional RS methods, such as LiDAR and image by offering a fast, real-time, and cost-effective solution for object-level monitoring of vegetation risks and infrastructure exposure, making it especially valuable for utility companies, and urban planners aiming for scalable and frequent assessments in dynamic urban environments.

[135] CoreEditor: Consistent 3D Editing via Correspondence-constrained Diffusion

Zhe Zhu,Honghua Chen,Peng Li,Mingqiang Wei

Main category: cs.CV

TL;DR: CoreEditor is a novel framework for consistent text-to-3D editing that enforces precise interactions between pixels and allows users to choose preferred results.

Details Motivation: Existing approaches often fail to maintain cross-view consistency when adapting pre-trained 2D image editors to multi-view inputs, leading to insufficient edits and blurry details. Method: CoreEditor uses a correspondence-constrained attention mechanism and a selective editing pipeline for consistent text-to-3D editing. Result: CoreEditor produces high-quality, 3D-consistent edits with sharper details. Conclusion: CoreEditor provides high-quality, 3D-consistent edits with sharper details and significantly outperforms prior methods. Abstract: Text-driven 3D editing seeks to modify 3D scenes according to textual descriptions, and most existing approaches tackle this by adapting pre-trained 2D image editors to multi-view inputs. However, without explicit control over multi-view information exchange, they often fail to maintain cross-view consistency, leading to insufficient edits and blurry details. We introduce CoreEditor, a novel framework for consistent text-to-3D editing. The key innovation is a correspondence-constrained attention mechanism that enforces precise interactions between pixels expected to remain consistent throughout the diffusion denoising process. Beyond relying solely on geometric alignment, we further incorporate semantic similarity estimated during denoising, enabling more reliable correspondence modeling and robust multi-view editing. In addition, we design a selective editing pipeline that allows users to choose preferred results from multiple candidates, offering greater flexibility and user control. Extensive experiments show that CoreEditor produces high-quality, 3D-consistent edits with sharper details, significantly outperforming prior methods.

[136] LoRAtorio: An intrinsic approach to LoRA Skill Composition

Niki Foteinopoulou,Ignas Budvytis,Stephan Liwicki

Main category: cs.CV

TL;DR: LoRAtorio is a new framework for combining multiple LoRA adapters in text-to-image models, using latent space analysis and cosine similarity to guide composition, achieving superior results without retraining.

Details Motivation: Current LoRA composition methods struggle with combining multiple adapters, especially in open-ended scenarios. This work addresses this limitation by analyzing model behavior in-distribution vs. out-of-distribution. Method: LoRAtorio operates by analyzing the latent space, computing cosine similarity between predicted noise patches and the base model, and creating a weight matrix for aggregating LoRA outputs. It also modifies classifier-free guidance to address domain drift. Result: LoRAtorio achieves a 1.3% improvement in ClipScore and a 72.43% win rate in GPT-4V pairwise evaluations, showing strong performance and generalization across multiple diffusion models. Conclusion: LoRAtorio provides a novel method to effectively combine multiple LoRA adapters for text-to-image diffusion models, achieving state-of-the-art results and generalizing well across different models. Abstract: Low-Rank Adaptation (LoRA) has become a widely adopted technique in text-to-image diffusion models, enabling the personalisation of visual concepts such as characters, styles, and objects. However, existing approaches struggle to effectively compose multiple LoRA adapters, particularly in open-ended settings where the number and nature of required skills are not known in advance. In this work, we present LoRAtorio, a novel train-free framework for multi-LoRA composition that leverages intrinsic model behaviour. Our method is motivated by two key observations: (1) LoRA adapters trained on narrow domains produce denoised outputs that diverge from the base model, and (2) when operating out-of-distribution, LoRA outputs show behaviour closer to the base model than when conditioned in distribution. The balance between these two observations allows for exceptional performance in the single LoRA scenario, which nevertheless deteriorates when multiple LoRAs are loaded. Our method operates in the latent space by dividing it into spatial patches and computing cosine similarity between each patch's predicted noise and that of the base model. These similarities are used to construct a spatially-aware weight matrix, which guides a weighted aggregation of LoRA outputs. To address domain drift, we further propose a modification to classifier-free guidance that incorporates the base model's unconditional score into the composition. We extend this formulation to a dynamic module selection setting, enabling inference-time selection of relevant LoRA adapters from a large pool. LoRAtorio achieves state-of-the-art performance, showing up to a 1.3% improvement in ClipScore and a 72.43% win rate in GPT-4V pairwise evaluations, and generalises effectively to multiple latent diffusion models.

[137] Is ChatGPT-5 Ready for Mammogram VQA?

Qiang Li,Shansong Wang,Mingzhe Hu,Mojtaba Safari,Zachary Eidex,Xiaofeng Yang

Main category: cs.CV

TL;DR: 研究显示GPT-5在乳腺X线摄影VQA任务中表现良好,但仍然不足于满足高风险临床应用的要求。

Details Motivation: 乳腺X线摄影视觉问答(VQA)结合了图像解释与临床推理,有潜力支持乳腺癌筛查。 Method: 我们系统地评估了GPT-5家族和GPT-4o模型在四个公共乳腺X线摄影数据集(EMBED、InBreast、CMMD、CBIS-DDSM)上的表现,包括BI-RADS评估、异常检测和恶性分类任务。 Result: GPT-5是一直表现最好的模型,但落后于人类专家和特定领域微调模型。在EMBED上,GPT-5在密度(56.8%)、扭曲(52.5%)、肿块(64.5%)、钙化(63.5%)和恶性(52.8%)分类中得分最高。在InBreast上,它达到了36.9%的BI-RADS准确率、45.9%的异常检测和35.0%的恶性分类。在CMMD上,GPT-5达到了32.3%的异常检测和55.0%的恶性准确率。在CBIS-DDSM上,它达到了69.3%的BI-RADS准确率、66.0%的异常检测和58.2%的恶性准确率。与人类专家估计相比,GPT-5的敏感性(63.5%)和特异性(52.3%)较低。 Conclusion: 虽然GPT-5在乳腺X线摄影VQA任务中表现出一定的能力,但在没有特定领域适应和优化的情况下,其性能仍不足以满足高风险临床成像应用的要求。从GPT-4o到GPT-5性能的显著提升显示了通用大语言模型(LLMs)在乳腺X线摄影VQA任务中的潜力趋势。 Abstract: Mammogram visual question answering (VQA) integrates image interpretation with clinical reasoning and has potential to support breast cancer screening. We systematically evaluated the GPT-5 family and GPT-4o model on four public mammography datasets (EMBED, InBreast, CMMD, CBIS-DDSM) for BI-RADS assessment, abnormality detection, and malignancy classification tasks. GPT-5 consistently was the best performing model but lagged behind both human experts and domain-specific fine-tuned models. On EMBED, GPT-5 achieved the highest scores among GPT variants in density (56.8%), distortion (52.5%), mass (64.5%), calcification (63.5%), and malignancy (52.8%) classification. On InBreast, it attained 36.9% BI-RADS accuracy, 45.9% abnormality detection, and 35.0% malignancy classification. On CMMD, GPT-5 reached 32.3% abnormality detection and 55.0% malignancy accuracy. On CBIS-DDSM, it achieved 69.3% BI-RADS accuracy, 66.0% abnormality detection, and 58.2% malignancy accuracy. Compared with human expert estimations, GPT-5 exhibited lower sensitivity (63.5%) and specificity (52.3%). While GPT-5 exhibits promising capabilities for screening tasks, its performance remains insufficient for high-stakes clinical imaging applications without targeted domain adaptation and optimization. However, the tremendous improvements in performance from GPT-4o to GPT-5 show a promising trend in the potential for general large language models (LLMs) to assist with mammography VQA tasks.

[138] Thyme: Think Beyond Images

Yi-Fan Zhang,Xingyu Lu,Shukang Yin,Chaoyou Fu,Wei Chen,Xiao Hu,Bin Wen,Kaiyu Jiang,Changyi Liu,Tianke Zhang,Haonan Fan,Kaibing Chen,Jiankang Chen,Haojie Ding,Kaiyu Tang,Zhang Zhang,Liang Wang,Fan Yang,Tingting Gao,Guorui Zhou

Main category: cs.CV

TL;DR: Thyme introduces a novel approach for MLLMs to autonomously generate and execute image processing and computational operations through code, significantly enhancing performance in perception and reasoning tasks.

Details Motivation: The motivation is to bridge the gap between open-source and proprietary models by enabling MLLMs to perform diverse image manipulations and logical reasoning through code, surpassing current 'think with images' approaches. Method: Thyme uses a two-stage training strategy: initial SFT on a dataset of 500K samples to teach code generation, followed by RL with GRPO-ATS algorithm to refine decision-making, using high-resolution question-answer pairs. Result: Extensive experiments and ablation studies show that Thyme achieves consistent and significant performance improvements, particularly in high-resolution perception and complex reasoning tasks across nearly 20 benchmarks. Conclusion: Thyme offers a new paradigm for MLLMs to enhance perception and reasoning tasks through autonomous image manipulation and code execution, achieving significant performance gains across multiple benchmarks. Abstract: Following OpenAI's introduction of the ``thinking with images'' concept, recent efforts have explored stimulating the use of visual information in the reasoning process to enhance model performance in perception and reasoning tasks. However, to the best of our knowledge, no open-source work currently offers a feature set as rich as proprietary models (O3), which can perform diverse image manipulations and simultaneously enhance logical reasoning capabilities through code. In this paper, we make a preliminary attempt in this direction by introducing Thyme (Think Beyond Images), a novel paradigm for enabling MLLMs to transcend existing ``think with images'' approaches by autonomously generating and executing diverse image processing and computational operations via executable code. This approach not only facilitates a rich, on-the-fly set of image manipulations (e.g., cropping, rotation, contrast enhancement) but also allows for mathematical computations, all while maintaining high autonomy in deciding when and how to apply these operations. We activate this capability through a two-stage training strategy: an initial SFT on a curated dataset of 500K samples to teach code generation, followed by a RL phase to refine decision-making. For the RL stage, we manually collect and design high-resolution question-answer pairs to increase the learning difficulty, and we propose GRPO-ATS (Group Relative Policy Optimization with Adaptive Temperature Sampling), an algorithm that applies distinct temperatures to text and code generation to balance reasoning exploration with code execution precision. We conduct extensive experimental analysis and ablation studies. Comprehensive evaluations on nearly 20 benchmarks show that Thyme yields significant and consistent performance gains, particularly in challenging high-resolution perception and complex reasoning tasks.