cs.CL [Back]

[1] MultiStream-LLM: Bridging Modalities for Robust Sign Language Translation

Marshall Thomas,Edward Fish,Richard Bowden

Main category: cs.CL

TL;DR: 本文介绍了一种名为MultiStream-LLM的模块化框架，旨在解决现有端到端模型在高速手指拼写和异步非手动线索整合方面的不足，从而实现更强大和有效的手语翻译。

Details

Motivation: 单片端到端模型在高速手指拼写精确识别和来自面部的异步非手动线索整合方面表现不佳，而现有的使用大语言模型的手语翻译方法在翻译名称、地点和技术术语等关键信息时表现较差。 Method: MultiStream-LLM框架使用了针对连续手语、手指拼写和唇读的独立专家网络，每个网络先将其特定模态解码为token序列，然后通过一个轻量级的Transformer进行融合，最终由大型语言模型生成句子。 Result: MultiStream-LLM在How2Sign基准测试中建立了新的最先进水平，BLEU-4得分为23.5，并在具有挑战性的ChicagoFSWildPlus手指拼写数据集中实现了73.2%的字母准确率。 Conclusion: MultiStream-LLM通过分离和解决不同的识别任务并在融合前进行优化，为实现鲁棒且高保真的手语翻译提供了一条更强大和有效的途径。 Abstract: Despite progress in gloss-free Sign Language Translation (SLT), monolithic end-to-end models consistently fail on two critical components of natural signing: the precise recognition of high-speed fingerspelling and the integration of asynchronous non-manual cues from the face. Recent progress in Automated Sign Language Translation with Large Language Models has side stepped this challenge, forcing a single network to learn these simultaneously resulting in poor performance when tasked with translating crucial information such as names,places, and technical terms. We introduce MultiStream-LLM, a modular framework designed to overcome these limitations. Our approach employs separate, specialized predictors for continuous signing, fingerspelling, and lipreading. Each expert network first decodes its specific modality into a sequence of tokens. These parallel streams are then fused by a lightweight transformer that resolves temporal misalignments before passing the combined representation to a Large Language Model (LLM) for final sentence generation. Our method establishes a new state-of-the-art on the How2Sign benchmark with a BLEU-4 score of 23.5 and achieves 73.2% letter accuracy on the challenging ChicagoFSWildPlus fingerspelling dataset. These results validate our core hypothesis: by isolating and solving distinct recogni tion tasks before fusion, our multi-expert approach provides a more powerful and effective pathway to robust, high-fidelity sign language translation.

[2] Compiling Prompts, Not Crafting Them: A Reproducible Workflow for AI-Assisted Evidence Synthesis

Teo Susnjak

Main category: cs.CL

TL;DR: 本文探讨了如何利用声明式提示优化技术，构建可重复的系统文献综述自动化框架，以提高大型语言模型在系统文献综述中的可靠性和可再现性。

Details

Motivation: 当前的大型语言模型在系统文献综述中的应用依赖于脆弱的手工提示，这影响了其可靠性与可再现性，本文旨在解决这一问题。 Method: 本文采用了声明式提示优化技术，并将其应用于系统文献综述自动化领域，提出了一个结构化的、特定领域的框架。 Result: 研究成功地将新兴方法转化为具体的蓝图，并提供了可工作的代码示例，使研究人员能够构建符合透明和严谨原则的可验证大型语言模型管道。 Conclusion: 这种方法为系统文献综述管道提供了一种新的应用方式，提高了系统文献综述的自动化水平和科学信心。 Abstract: Large language models (LLMs) offer significant potential to accelerate systematic literature reviews (SLRs), yet current approaches often rely on brittle, manually crafted prompts that compromise reliability and reproducibility. This fragility undermines scientific confidence in LLM-assisted evidence synthesis. In response, this work adapts recent advances in declarative prompt optimisation, developed for general-purpose LLM applications, and demonstrates their applicability to the domain of SLR automation. This research proposes a structured, domain-specific framework that embeds task declarations, test suites, and automated prompt tuning into a reproducible SLR workflow. These emerging methods are translated into a concrete blueprint with working code examples, enabling researchers to construct verifiable LLM pipelines that align with established principles of transparency and rigour in evidence synthesis. This is a novel application of such approaches to SLR pipelines.

[3] What Are Research Hypotheses?

Jian Wu,Sarah Rajtmajer

Main category: cs.CL

TL;DR: This paper addresses the inconsistent interpretation of the term 'hypothesis' in NLU tasks and emphasizes the importance of well-defined hypotheses for a machine-interpretable scholarly record.

Details

Motivation: The motivation for the paper is the observed migration of the interpretation of the term 'hypothesis' from traditional definitions in various scientific fields within NLU tasks. Method: The paper provides an overview and delineates various definitions of hypothesis, discerning the nuances across recently published NLU tasks. Result: The result of the paper is an overview and delineation of various definitions of hypothesis, with a focus on discerning nuances across recently published NLU tasks. Conclusion: The paper concludes that there is a need for well-structured and well-defined hypotheses as we move toward a machine-interpretable scholarly record. Abstract: Over the past decades, alongside advancements in natural language processing, significant attention has been paid to training models to automatically extract, understand, test, and generate hypotheses in open and scientific domains. However, interpretations of the term \emph{hypothesis} for various natural language understanding (NLU) tasks have migrated from traditional definitions in the natural, social, and formal sciences. Even within NLU, we observe differences defining hypotheses across literature. In this paper, we overview and delineate various definitions of hypothesis. Especially, we discern the nuances of definitions across recently published NLU tasks. We highlight the importance of well-structured and well-defined hypotheses, particularly as we move toward a machine-interpretable scholarly record.

[4] Explainable Chain-of-Thought Reasoning: An Empirical Analysis on State-Aware Reasoning Dynamics

Sheldon Yu,Yuxin Xiong,Junda Wu,Xintong Li,Tong Yu,Xiang Chen,Ritwik Sinha,Jingbo Shang,Julian McAuley

Main category: cs.CL

TL;DR: 该论文通过状态感知的过渡框架，将链式推理抽象为结构化潜在动态，提高了对大语言模型推理过程的可解释性。

Details

Motivation: 尽管链式推理（CoT）提示技术已使大语言模型（LLMs）能够执行多步推理，但现有工作主要关注局部令牌级归因，缺乏对推理步骤的高层语义角色和过渡的探索。 Method: 通过令牌级嵌入的谱分析和聚类方法，将每个推理步骤表示为语义一致的潜在状态，并将其进展建模为马尔可夫链。 Result: 论文提出的方法能够支持多种分析，包括语义角色识别、时间模式可视化和一致性评估，从而提供了对多步推理过程的深入理解。 Conclusion: 论文提出了一种状态感知的过渡框架，将CoT推理轨迹抽象为结构化的潜在动态，从而提供了对推理过程的结构化和可解释性分析。 Abstract: Recent advances in chain-of-thought (CoT) prompting have enabled large language models (LLMs) to perform multi-step reasoning. However, the explainability of such reasoning remains limited, with prior work primarily focusing on local token-level attribution, such that the high-level semantic roles of reasoning steps and their transitions remain underexplored. In this paper, we introduce a state-aware transition framework that abstracts CoT trajectories into structured latent dynamics. Specifically, to capture the evolving semantics of CoT reasoning, each reasoning step is represented via spectral analysis of token-level embeddings and clustered into semantically coherent latent states. To characterize the global structure of reasoning, we model their progression as a Markov chain, yielding a structured and interpretable view of the reasoning process. This abstraction supports a range of analyses, including semantic role identification, temporal pattern visualization, and consistency evaluation.

Seiji Maekawa,Hayate Iso,Nikita Bhutani

Main category: cs.CL

TL;DR: 本文提出了一个新任务DFM和一个基准测试框架DiFBench，用于评估LLMs识别文档集合中独特特征的能力，并揭示了当前LLMs在统计推理方面的核心限制。

Details

Motivation: 现有LLMs基准测试强调检索或总结与给定查询相关的信息，但未评估模型识别文档集合中全局独特特征的能力。 Method: 介绍了一个新任务DFM，并通过DiFBench框架对十个最先进的LLMs进行了大规模评估。 Result: 在DFM任务中，通用模型和推理增强模型之间存在显著性能差距，且所有模型在任务复杂性和文档数量增加时性能下降。 Conclusion: 当前的LLMs在细粒度统计推理和稀有特征检测方面存在核心限制，特别是在任务复杂性和文档数量增加时性能显著下降。 Abstract: Effective decision-making often relies on identifying what makes each candidate distinctive. While existing benchmarks for LLMs emphasize retrieving or summarizing information relevant to a given query, they do not evaluate a model's ability to identify globally distinctive features across a set of documents. We introduce Distinctive Feature Mining (DFM), a new task that challenges models to analyze a small-to-medium collection (10-40 documents) and surface features that are rare in the global context (e.g., appearing in less than 10% of documents). This setting mirrors real-world scenarios such as candidate selection or product differentiation, where statistical reasoning, not retrieval, is key. To enable systematic evaluation of this capability, we present DiFBench, a configurable benchmark creation framework with controllable parameters such as document set size and distinctiveness thresholds. Using DiFBench, we perform a large-scale assessment of distinctive feature mining across ten state-of-the-art LLMs. Our findings reveal a significant performance gap between general-purpose and reasoning-enhanced models. All models, however, substantially degrade as the task complexity and document count increase. We also find that a common failure mode is misidentifying frequent features as distinctive. These insights reveal core limitations in contemporary LLMs' abilities to perform fine-grained, statistical reasoning and rarity detection.

[6] The Differential Meaning of Models: A Framework for Analyzing the Structural Consequences of Semantic Modeling Decisions

Zachary K. Stine,James E. Deitrick

Main category: cs.CL

TL;DR: 作者提出了一个基于C.S.皮尔斯符号学理论的通用框架，用于描述各种符号系统建模实践，并将模型本身视为符号。

Details

Motivation: 目前缺乏一种通用的理论框架来统一描述各种建模实践，因此作者提出了这一框架，以便更好地理解模型及其建模决策背后的符号学意义。 Method: 通过分析当前符号系统建模方法的多样性，作者提出了一种基于C.S.皮尔斯符号学理论的框架，并使用经验实例来说明该框架的使用。 Result: 提出了一种新的理论框架，该框架将模型视为潜在符号几何的测量工具，并通过与其他模型的对比来理解模型的解释视角。 Conclusion: 作者提出了一个基于C.S.皮尔斯符号学理论的通用理论框架，用于描述各种建模实践，并将模型和建模决策本身视为符号，从而形成了一种模型语义学的理论基础。 Abstract: The proliferation of methods for modeling of human meaning-making constitutes a powerful class of instruments for the analysis of complex semiotic systems. However, the field lacks a general theoretical framework for describing these modeling practices across various model types in an apples-to-apples way. In this paper, we propose such a framework grounded in the semiotic theory of C. S. Peirce. We argue that such models measure latent symbol geometries, which can be understood as hypotheses about the complex of semiotic agencies underlying a symbolic dataset. Further, we argue that in contexts where a model's value cannot be straightforwardly captured by proxy measures of performance, models can instead be understood relationally, so that the particular interpretive lens of a model becomes visible through its contrast with other models. This forms the basis of a theory of model semantics in which models, and the modeling decisions that constitute them, are themselves treated as signs. In addition to proposing the framework, we illustrate its empirical use with a few brief examples and consider foundational questions and future directions enabled by the framework.

[7] The Temporal Game: A New Perspective on Temporal Relation Extraction

Hugo Sousa,Ricardo Campos,Alípio Jorge

Main category: cs.CL

TL;DR: Temporal Game is a new method for temporal relation extraction that simplifies annotation by using point-wise comparisons and interactive game modes.

Details

Motivation: To provide a more fine-grained and flexible approach to temporal relation extraction, supporting both interval and instant entities. Method: The Temporal Game decomposes interval-level relations into point-wise comparisons and uses temporal closure for consistency. It includes Game and Annotation modes. Result: A demo was created that allows annotation of texts and exporting timelines, with public availability and open-source code. Conclusion: The Temporal Game serves as both a research tool and annotation interface, supporting further development in temporal reasoning. Abstract: In this paper we demo the Temporal Game, a novel approach to temporal relation extraction that casts the task as an interactive game. Instead of directly annotating interval-level relations, our approach decomposes them into point-wise comparisons between the start and end points of temporal entities. At each step, players classify a single point relation, and the system applies temporal closure to infer additional relations and enforce consistency. This point-based strategy naturally supports both interval and instant entities, enabling more fine-grained and flexible annotation than any previous approach. The Temporal Game also lays the groundwork for training reinforcement learning agents, by treating temporal annotation as a sequential decision-making task. To showcase this potential, the demo presented in this paper includes a Game mode, in which users annotate texts from the TempEval-3 dataset and receive feedback based on a scoring system, and an Annotation mode, that allows custom documents to be annotated and resulting timeline to be exported. Therefore, this demo serves both as a research tool and an annotation interface. The demo is publicly available at https://temporal-game.inesctec.pt, and the source code is open-sourced to foster further research and community-driven development in temporal reasoning and annotation.

[8] Exploring Reasoning-Infused Text Embedding with Large Language Models for Zero-Shot Dense Retrieval

Yuxiang Liu,Tian Wang,Gourab Kundu,Tianyu Cao,Guang Cheng,Zhen Ge,Jianshu Chen,Qingjun Cui,Trishul Chilimbi

Main category: cs.CL

TL;DR: This paper proposes RITE, a method that integrates logical reasoning into text embedding using generative LLMs, significantly improving zero-shot retrieval performance on reasoning-intensive tasks.

Details

Motivation: Transformer-based encoder-only retrievers often fall short in handling queries requiring sophisticated reasoning. Despite the reasoning capabilities of decoder-only LLMs, existing LLM-based embedding methods do not fully exploit this potential. Method: RITE utilizes generative LLMs to generate intermediate reasoning texts in the token space before computing embeddings, thereby enriching the representation with inferential depth. Result: Experimental results on BRIGHT, a reasoning-intensive retrieval benchmark, demonstrate that RITE significantly enhances zero-shot retrieval performance across diverse domains. Conclusion: RITE enhances zero-shot retrieval performance by integrating reasoning into the embedding process, highlighting the importance of reasoning in text representation. Abstract: Transformer-based models such as BERT and E5 have significantly advanced text embedding by capturing rich contextual representations. However, many complex real-world queries require sophisticated reasoning to retrieve relevant documents beyond surface-level lexical matching, where encoder-only retrievers often fall short. Decoder-only large language models (LLMs), known for their strong reasoning capabilities, offer a promising alternative. Despite this potential, existing LLM-based embedding methods primarily focus on contextual representation and do not fully exploit the reasoning strength of LLMs. To bridge this gap, we propose Reasoning-Infused Text Embedding (RITE), a simple but effective approach that integrates logical reasoning into the text embedding process using generative LLMs. RITE builds upon existing language model embedding techniques by generating intermediate reasoning texts in the token space before computing embeddings, thereby enriching representations with inferential depth. Experimental results on BRIGHT, a reasoning-intensive retrieval benchmark, demonstrate that RITE significantly enhances zero-shot retrieval performance across diverse domains, underscoring the effectiveness of incorporating reasoning into the embedding process.

[9] OpinioRAG: Towards Generating User-Centric Opinion Highlights from Large-scale Online Reviews

Mir Tafseer Nayeem,Davood Rafiei

Main category: cs.CL

TL;DR: 本文提出了一种名为OpinioRAG的新框架，用于从大量用户评论中高效生成个性化意见摘要，并提出了适合情感丰富领域的评估指标。

Details

Motivation: 现有方法无法扩展或生成忽略个性化需求的通用摘要，且缺乏适合情感丰富领域的评估指标。 Method: 结合基于RAG的证据检索与大语言模型（LLMs），并提出新的无参考验证指标。 Result: 开发了可扩展、无需训练的框架OpinioRAG，并提出了新的评估指标，同时贡献了大规模用户评论数据集。 Conclusion: OpinioRAG是用于从大量用户评论中生成意见亮点的可扩展框架，实验证明其能够生成准确、相关和结构化的摘要。 Abstract: We study the problem of opinion highlights generation from large volumes of user reviews, often exceeding thousands per entity, where existing methods either fail to scale or produce generic, one-size-fits-all summaries that overlook personalized needs. To tackle this, we introduce OpinioRAG, a scalable, training-free framework that combines RAG-based evidence retrieval with LLMs to efficiently produce tailored summaries. Additionally, we propose novel reference-free verification metrics designed for sentiment-rich domains, where accurately capturing opinions and sentiment alignment is essential. These metrics offer a fine-grained, context-sensitive assessment of factual consistency. To facilitate evaluation, we contribute the first large-scale dataset of long-form user reviews, comprising entities with over a thousand reviews each, paired with unbiased expert summaries and manually annotated queries. Through extensive experiments, we identify key challenges, provide actionable insights into improving systems, pave the way for future research, and position OpinioRAG as a robust framework for generating accurate, relevant, and structured summaries at scale.

[10] Wage Sentiment Indices Derived from Survey Comments via Large Language Models

Taihei Sone

Main category: cs.CL

TL;DR: This study develops a Wage Sentiment Index using LLMs to forecast wage dynamics in Japan, demonstrating the potential of such indices to improve economic policy design.

Details

Motivation: The emergence of generative AI has created opportunities for economic text analysis, and there is a need for more timely and effective economic policy design tools. Method: The study constructs a Wage Sentiment Index (WSI) using Large Language Models (LLMs), extending the framework of the Price Sentiment Index (PSI). Data architecture is developed to integrate additional sources like newspapers and social media. Result: WSI models based on LLMs significantly outperformed both baseline approaches and pretrained models. Conclusion: LLM-driven sentiment indices like the WSI have the potential to enhance the timeliness and effectiveness of economic policy design. Abstract: The emergence of generative Artificial Intelligence (AI) has created new opportunities for economic text analysis. This study proposes a Wage Sentiment Index (WSI) constructed with Large Language Models (LLMs) to forecast wage dynamics in Japan. The analysis is based on the Economy Watchers Survey (EWS), a monthly survey conducted by the Cabinet Office of Japan that captures real-time economic assessments from workers in industries highly sensitive to business conditions. The WSI extends the framework of the Price Sentiment Index (PSI) used in prior studies, adapting it specifically to wage related sentiment. To ensure scalability and adaptability, a data architecture is also developed that enables integration of additional sources such as newspapers and social media. Experimental results demonstrate that WSI models based on LLMs significantly outperform both baseline approaches and pretrained models. These findings highlight the potential of LLM-driven sentiment indices to enhance the timeliness and effectiveness of economic policy design by governments and central banks.

[11] Balanced Actor Initialization: Stable RLHF Training of Distillation-Based Reasoning Models

Chen Zheng,Yiyuan Ma,Yuan Yang,Deyi Liu,Jing Liu,Zuquan Song,Yuxin Song,Cheng Ren,Hang Zhu,Xin Liu,Yiyuan Ma,Siyuan Qiao,Xun Zhou,Liang Xiang,Yonghui Wu

Main category: cs.CL

TL;DR: 本研究提出BAI方法，解决在蒸馏训练模型上应用RLHF时的训练稳定性问题，从而实现高效且稳定的推理模型训练。

Details

Motivation: 在蒸馏训练模型上应用RLHF存在显著的训练不稳定性问题，包括Sequence Length Collapse和Reward Hockey Stick Curve。 Method: 提出了一种两阶段加权模型融合方法BAI，用于稳定在蒸馏训练模型上应用RLHF的过程。 Result: BAI方法成功缓解了Sequence Length Collapse和Reward Hockey Stick Curve，实现了训练过程中序列长度的持续提升。 Conclusion: BAI有效地解决了第三范式中的训练不稳定问题，实现了推理能力和对齐能力的平衡。 Abstract: The development of alignment and reasoning capabilities in large language models has seen remarkable progress through two paradigms: instruction tuning and reinforcement learning from human feedback (RLHF) alignment paradigm, and distillation-based reasoning fine-tuning paradigm. While both approaches prove effective independently, the third paradigm of applying RLHF to distillation-trained models presents significant challenges. Our investigation reveals two critical phenomena that emerge in this paradigm: Sequence Length Collapse, where language generation dramatically reduces during early RLHF training, and the Reward Hockey Stick Curve, featuring severe reward score drops followed by gradual recovery. These instabilities fundamentally compromise the model's alignment and reasoning capabilities. To address these challenges, we propose Balanced Actor Initialization (BAI), a two-stage weighted model merging approach. BAI first merges instruction-following and distillation-based reasoning fine-tuned models, then further combines this intermediate model with the pretrained model to preserve foundational knowledge. Through comprehensive experiments across diverse benchmarks and detailed analysis of training experiments, we demonstrate that BAI resolves Sequence Length Collapse, mitigates the Reward Hockey Stick Curve, and enables continuous sequence length improvement during training. Additionally, our analysis reveals that balanced merging ratios achieve optimal trade-offs between training stability and reasoning capability preservation. Our work provides the effective solution for stable training in this third paradigm, enabling more capable reasoning models that combine distillation efficiency with RLHF alignment.

Rinku Dewri

Main category: cs.CL

TL;DR: GIER is a framework that enhances large language model outputs through self-reflection and revision based on conceptual quality criteria, leading to improved reasoning without compromising task accuracy.

Details

Motivation: The motivation behind GIER is to enhance the reasoning capabilities of large language models by addressing conceptual quality criteria without relying on demonstrations, examples, or chain-of-thought templates. Method: GIER employs natural language descriptions of reasoning gaps to prompt models to iteratively critique and refine their outputs, enhancing rationale quality, grounding, and reasoning alignment. Result: GIER successfully improves rationale quality, grounding, and reasoning alignment across three reasoning-intensive tasks and four LLMs, without degrading task accuracy. Conclusion: GIER improves the quality of outputs from large language models by enabling them to interpret abstract conceptual gaps and translate them into concrete reasoning improvements through self-reflection and revision. Abstract: We introduce GIER (Gap-driven Iterative Enhancement of Responses), a general framework for improving large language model (LLM) outputs through self-reflection and revision based on conceptual quality criteria. Unlike prompting strategies that rely on demonstrations, examples, or chain-of-thought templates, GIER utilizes natural language descriptions of reasoning gaps, and prompts a model to iteratively critique and refine its own outputs to better satisfy these criteria. Across three reasoning-intensive tasks (SciFact, PrivacyQA, and e-SNLI) and four LLMs (GPT-4.1, GPT-4o Mini, Gemini 1.5 Pro, and Llama 3.3 70B), GIER improves rationale quality, grounding, and reasoning alignment without degrading task accuracy. Our analysis demonstrates that models can not only interpret abstract conceptual gaps but also translate them into concrete reasoning improvements.

[13] Open Data Synthesis For Deep Research

Ziyi Xia,Kun Luo,Hongjin Qian,Zheng Liu

Main category: cs.CL

TL;DR: InfoSeek is a framework for synthesizing complex research tasks that require multi-step reasoning, outperforming larger models and commercial APIs.

Details

Motivation: Existing benchmarks fail to capture the complexity of Deep Research tasks that involve multi-step reasoning and synthesis of evidence. Method: InfoSeek uses a dual-agent system to recursively build a Research Tree, converting it into natural language questions requiring hierarchical traversal. Result: Models trained on InfoSeek outperform strong baselines and achieve performance comparable to stronger APIs. Conclusion: InfoSeek supports advanced optimization strategies and achieves competitive performance compared to larger models and commercial APIs. Abstract: Large language models (LLMs) are increasingly expected to go beyond simple factual queries toward Deep Research-tasks that require decomposing questions into sub-problems, coordinating multi-step reasoning, and synthesizing evidence from diverse sources. We formalize Deep Research tasks with verifiable answers as Hierarchical Constraint Satisfaction Problems (HCSPs), which are fundamentally different from single-constraint, multi-hop, or flat CSP formulations. However, existing benchmarks (e.g., Natural Questions, HotpotQA) fail to capture this complexity, while recent synthetic datasets often introduce shortcut reasoning, knowledge leakage, or lack sufficient structural depth. To address this gap, we introduce InfoSeek, a scalable framework for synthesizing complex Deep Research tasks. InfoSeek uses a dual-agent system to recursively build a Research Tree from large-scale webpages, blurring intermediate nodes into valid sub-problems, and converting these trees into natural language questions that require traversing the full hierarchy. It also enables rapid scaling, yielding over 50K training examples, a curated test set, and reasoning trajectories generated via reject sampling. Experiments show that models trained on InfoSeek consistently outperform strong baselines. On a challenging benchmark BrowseComp-Plus, 3B LLMs optimized with InfoSeek surpass much larger 32B models and lightweight commercial APIs (e.g., Gemini2.5-Flash), while achieving performance comparable to stronger APIs (e.g., Gemini2.5-Pro). By preserving meta-information such as intermediate steps and retrieval labels, InfoSeek further supports advanced optimization strategies, including compound reward design and trajectory-level exploration. We provide our codes and datasets in \href{https://github.com/VectorSpaceLab/InfoSeek}{this repository}.

[14] GraphKV: Breaking the Static Selection Paradigm with Graph-Based KV Cache Eviction

Xuelin Li,Xiangqi Jin,Linfeng Zhang

Main category: cs.CL

TL;DR: 提出GraphKV，一种基于图的方法，动态更新token重要性，优化KV缓存管理。

Details

Motivation: 传统的KV缓存淘汰策略基于静态启发式方法，无法捕捉token之间的隐式依赖关系。 Method: 将token建模为具有重要性分数的节点，通过图中的信息传播机制动态更新其重要性。 Result: GraphKV可以无缝集成到现有的KV缓存淘汰方法中，如SnapKV和PyramidKV，具有良好的扩展性和实用性。 Conclusion: GraphKV是一个基于图的KV缓存压缩框架，能够动态更新token的重要性，自适应保留上下文重要token。 Abstract: Efficient Key-Value (KV) cache management is essential for processing long text sequences in large language models (LLMs), where memory constraints often limit performance. Conventional KV eviction strategies, such as top-k selection based on attention scores, depend on static heuristics that fail to capture the evolving implicit dependencies among tokens during inference. To overcome this, we propose GraphKV, a graph-based framework that redefines token selection for KV cache compression. In GraphKV, tokens are modeled as nodes with importance scores, and edges represent their similarity relationships. Through a decay-signal-propagation mechanism, token importance is dynamically updated by propagating information across the graph, enabling adaptive retention of the most contextually significant tokens. GraphKV can be seamlessly utilized in existing KV cache eviction methods such as SnapKV and PyramidKV in a plug-and-play manner. Codes will be released on Github.

[15] The Resurgence of GCG Adversarial Attacks on Large Language Models

Yuting Tan,Xuying Li,Zhuo Li,Huizhen Shu,Peikang Hu

Main category: cs.CL

TL;DR: The paper evaluates GCG and T-GCG adversarial prompting methods across different LLMs, revealing decreased attack success with larger models, overestimation of effectiveness by prefix-based heuristics, higher vulnerability of coding prompts, and potential for annealing-inspired strategies in adversarial evaluation.

Details

Motivation: The motivation of the paper is to evaluate the effectiveness of gradient-based adversarial prompting methods like GCG and T-GCG on different types of prompts and model sizes, and to identify vulnerabilities and potential improvements in adversarial evaluation. Method: The paper systematically evaluates the Greedy Coordinate Gradient (GCG) algorithm and its annealing-augmented variant, T-GCG, on open-source large language models (LLMs) of varying scales using Qwen2.5-0.5B, LLaMA-3.2-1B, and GPT-OSS-20B. Result: The study finds that attack success rates decrease with model size, prefix-based heuristics overestimate attack effectiveness compared to GPT-4o semantic judgments, and coding-related prompts are more vulnerable than adversarial safety prompts. T-GCG shows potential in diversifying adversarial search. Conclusion: The paper concludes that GCG's scalability is limited, reasoning tasks have overlooked vulnerabilities, and annealing-inspired strategies should be further developed for robust adversarial evaluation. Abstract: Gradient-based adversarial prompting, such as the Greedy Coordinate Gradient (GCG) algorithm, has emerged as a powerful method for jailbreaking large language models (LLMs). In this paper, we present a systematic appraisal of GCG and its annealing-augmented variant, T-GCG, across open-source LLMs of varying scales. Using Qwen2.5-0.5B, LLaMA-3.2-1B, and GPT-OSS-20B, we evaluate attack effectiveness on both safety-oriented prompts (AdvBench) and reasoning-intensive coding prompts. Our study reveals three key findings: (1) attack success rates (ASR) decrease with model size, reflecting the increasing complexity and non-convexity of larger models' loss landscapes; (2) prefix-based heuristics substantially overestimate attack effectiveness compared to GPT-4o semantic judgments, which provide a stricter and more realistic evaluation; and (3) coding-related prompts are significantly more vulnerable than adversarial safety prompts, suggesting that reasoning itself can be exploited as an attack vector. In addition, preliminary results with T-GCG show that simulated annealing can diversify adversarial search and achieve competitive ASR under prefix evaluation, though its benefits under semantic judgment remain limited. Together, these findings highlight the scalability limits of GCG, expose overlooked vulnerabilities in reasoning tasks, and motivate further development of annealing-inspired strategies for more robust adversarial evaluation.

[16] MedSEBA: Synthesizing Evidence-Based Answers Grounded in Evolving Medical Literature

Juraj Vladika,Florian Matthes

Main category: cs.CL

TL;DR: MedSEBA is an AI system that generates trustworthy, evidence-based medical answers by integrating large language models with medical study databases, offering insights for both lay users and experts.

Details

Motivation: The increasing volume of online medical information and studies makes it difficult to distinguish reliable sources and track evolving research conclusions. Method: The paper introduces MedSEBA, an AI-powered system that combines large language models with dynamically retrieved data from PubMed to generate evidence-based medical answers. Result: MedSEBA provides traceable, coherent answers with key points and visualizations of research consensus over time, which users find trustworthy and helpful. Conclusion: MedSEBA is a reliable and user-friendly AI system suitable for answering everyday health questions and providing advanced research insights. Abstract: In the digital age, people often turn to the Internet in search of medical advice and recommendations. With the increasing volume of online content, it has become difficult to distinguish reliable sources from misleading information. Similarly, millions of medical studies are published every year, making it challenging for researchers to keep track of the latest scientific findings. These evolving studies can reach differing conclusions, which is not reflected in traditional search tools. To address these challenges, we introduce MedSEBA, an interactive AI-powered system for synthesizing evidence-based answers to medical questions. It utilizes the power of Large Language Models to generate coherent and expressive answers, but grounds them in trustworthy medical studies dynamically retrieved from the research database PubMed. The answers consist of key points and arguments, which can be traced back to respective studies. Notably, the platform also provides an overview of the extent to which the most relevant studies support or refute the given medical claim, and a visualization of how the research consensus evolved through time. Our user study revealed that medical experts and lay users find the system usable and helpful, and the provided answers trustworthy and informative. This makes the system well-suited for both everyday health questions and advanced research insights.

[17] The Gold Medals in an Empty Room: Diagnosing Metalinguistic Reasoning in LLMs with Camlang

Fenghua Liu,Yulong Chen,Yixuan Liu,Zhujun Jin,Solomon Tsai,Ming Zhong

Main category: cs.CL

TL;DR: Camlang研究通过测试大型语言模型能否掌握新语言，揭示了当前模型在元语言推理能力方面与人类存在显著差距。

Details

Motivation: 研究动机在于探究大型语言模型是否具备真正的推理能力，还是仅依赖模式匹配，通过元语言推理来掌握新语言是一个理想的测试场景。 Method: 研究者开发了一种名为Camlang的新语言，并结合人类实验与LLM表现评估，测试模型能否通过显式语法规则和词汇查找来掌握新语言。 Result: 实验结果显示，GPT-5在英语任务中达到了98%的准确率，但在Camlang任务中仅达到47%，远低于人类的87%表现。大多数模型的成功归因于浅层词汇匹配，而非系统性语法掌握。 Conclusion: Camlang研究揭示了当前大型语言模型在元语言能力方面与人类存在显著差距，尽管这些模型在许多基准测试中表现出色，但它们在处理新语言时主要依赖浅层词汇对齐，而非系统性的语法掌握。 Abstract: Large Language Models (LLMs) achieve gold-medal performance across many benchmarks, yet it remains unclear whether such success reflects genuine reasoning or pattern matching. From a cognitive science perspective, an informative test is whether models can master an unfamiliar language through explicit metalinguistic deductive learning, a paradigm where human learners can reliably internalise grammatical systems through metalinguistic reasoning. We address this question with Camlang, a novel constructed language that exhibits naturalistic yet unattested feature combinations. Camlang consists of two explicit resources, a grammar book and a bilingual dictionary, which mirror adult second-language learning via explicit grammar rules and lexical lookup, and enable us to disentangle errors in morpho-syntax, lexical semantics, and sentence-level reasoning. Human experiments show that these resources are sufficient for participants to acquire Camlang and successfully solve Camlang tasks. To operationalise evaluation, we adapt CommonsenseQA into Camlang, creating Camlang-CSQA-v0, the first task in a broader suite where solving questions requires applying grammar rules and lexical mappings. Experimental results show that GPT-5 achieves 98\% EM accuracy in English but only 47\% in Camlang, far below human performance at 87\%, while other state-of-the-art reasoning LLMs perform even worse. Human verification further reveals that most model successes stem from shallow lexical alignment while GPT-5 shows emerging metalinguistic awareness to a limited extent but not systematic grammatical mastery as humans. Camlang establishes a cognitively grounded evaluation paradigm that exposes fundamental gaps between current models and human metalinguistic competence.

[18] GOSU: Retrieval-Augmented Generation with Global-Level Optimized Semantic Unit-Centric Framework

Xuecheng Zou,Ke Liu,Bingbing Wang,Huafei Deng,Li Zhang,Yu Tang

Main category: cs.CL

TL;DR: GOSU是一种基于语义单元的RAG框架，通过全局上下文优化检索和生成过程，解决了歧义性和复杂耦合问题。

Details

Motivation: 传统基于异构图和超图的RAG方法由于局部文本块的限制，存在歧义性、复杂耦合以及检索开销增加的问题。 Method: GOSU框架包括全局合并预提取的语义单元、分层关键词提取和语义单元补全技术。 Result: 实验表明，GOSU在多个任务中优于基线RAG方法，提升了生成质量。 Conclusion: GOSU有效地解决了语义单元提取中的歧义性和复杂耦合问题，通过全局上下文增强了检索和生成的质量。 Abstract: Building upon the standard graph-based Retrieval-Augmented Generation (RAG), the introduction of heterogeneous graphs and hypergraphs aims to enrich retrieval and generation by leveraging the relationships between multiple entities through the concept of semantic units (SUs). But this also raises a key issue: The extraction of high-level SUs limited to local text chunks is prone to ambiguity, complex coupling, and increased retrieval overhead due to the lack of global knowledge or the neglect of fine-grained relationships. To address these issues, we propose GOSU, a semantic unit-centric RAG framework that efficiently performs global disambiguation and utilizes SUs to capture interconnections between different nodes across the global context. In the graph construction phase, GOSU performs global merging on the pre-extracted SUs from local text chunks and guides entity and relationship extraction, reducing the difficulty of coreference resolution while uncovering global semantic objects across text chunks. In the retrieval and generation phase, we introduce hierarchical keyword extraction and semantic unit completion. The former uncovers the fine-grained binary relationships overlooked by the latter, while the latter compensates for the coarse-grained n-ary relationships missing from the former. Evaluation across multiple tasks demonstrates that GOSU outperforms the baseline RAG methods in terms of generation quality.

[19] CVPD at QIAS 2025 Shared Task: An Efficient Encoder-Based Approach for Islamic Inheritance Reasoning

Salah Eddine Bekhouche,Abdellah Zakaria Sellam,Hichem Telli,Cosimo Distante,Abdenour Hadid

Main category: cs.CL

TL;DR: This paper presents a lightweight AI framework for solving Islamic inheritance questions, showing that smaller models like MARBERT offer better efficiency and privacy compared to larger models like Gemini and DeepSeek, despite lower accuracy.

Details

Motivation: Islamic inheritance law requires precise identification of heirs and calculation of shares, which is challenging for AI. This motivates the development of efficient and accurate AI systems for this task. Method: The paper introduces a lightweight framework using a specialized Arabic text encoder and Attentive Relevance Scoring (ARS) to rank answer options based on semantic relevance. Result: The MARBERT-based approach achieved 69.87% accuracy, while API-based LLMs achieved up to 87.6% accuracy. The results highlight the trade-off between performance and practical advantages of smaller models. Conclusion: The paper concludes that while large models like Gemini and DeepSeek offer higher accuracy, smaller models like MARBERT are more efficient, deployable on devices, and preserve privacy, making them suitable for high-stakes domains. Abstract: Islamic inheritance law (Ilm al-Mawarith) requires precise identification of heirs and calculation of shares, which poses a challenge for AI. In this paper, we present a lightweight framework for solving multiple-choice inheritance questions using a specialised Arabic text encoder and Attentive Relevance Scoring (ARS). The system ranks answer options according to semantic relevance, and enables fast, on-device inference without generative reasoning. We evaluate Arabic encoders (MARBERT, ArabicBERT, AraBERT) and compare them with API-based LLMs (Gemini, DeepSeek) on the QIAS 2025 dataset. While large models achieve an accuracy of up to 87.6%, they require more resources and are context-dependent. Our MARBERT-based approach achieves 69.87% accuracy, presenting a compelling case for efficiency, on-device deployability, and privacy. While this is lower than the 87.6% achieved by the best-performing LLM, our work quantifies a critical trade-off between the peak performance of large models and the practical advantages of smaller, specialized systems in high-stakes domains.

[20] TECP: Token-Entropy Conformal Prediction for LLMs

Beining Xu

Main category: cs.CL

TL;DR: The paper introduces Token-Entropy Conformal Prediction (TECP), a novel framework for uncertainty quantification in open-ended language generation under black-box constraints.

Details

Motivation: Uncertainty quantification (UQ) for open-ended language generation remains a critical yet underexplored challenge, especially under black-box constraints where internal model signals are inaccessible. Method: Token-Entropy Conformal Prediction (TECP) leverages token-level entropy as a logit-free, reference-free uncertainty measure and integrates it into a split conformal prediction (CP) pipeline. Result: Empirical evaluations across six large language models and two benchmarks (CoQA and TriviaQA) demonstrate that TECP consistently achieves reliable coverage and compact prediction sets, outperforming prior self-consistency-based UQ methods. Conclusion: TECP provides a principled and efficient solution for trustworthy generation in black-box LLM settings. Abstract: Uncertainty quantification (UQ) for open-ended language generation remains a critical yet underexplored challenge, especially under black-box constraints where internal model signals are inaccessible. In this paper, we introduce Token-Entropy Conformal Prediction (TECP), a novel framework that leverages token-level entropy as a logit-free, reference-free uncertainty measure and integrates it into a split conformal prediction (CP) pipeline to construct prediction sets with formal coverage guarantees. Unlike existing approaches that rely on semantic consistency heuristics or white-box features, TECP directly estimates epistemic uncertainty from the token entropy structure of sampled generations and calibrates uncertainty thresholds via CP quantiles to ensure provable error control. Empirical evaluations across six large language models and two benchmarks (CoQA and TriviaQA) demonstrate that TECP consistently achieves reliable coverage and compact prediction sets, outperforming prior self-consistency-based UQ methods. Our method provides a principled and efficient solution for trustworthy generation in black-box LLM settings.

[21] Talk Less, Call Right: Enhancing Role-Play LLM Agents with Automatic Prompt Optimization and Role Prompting

Saksorn Ruangtanusak,Pittawat Taveekitworachai,Kunat Pipatanakul

Main category: cs.CL

TL;DR: 报告研究了提示方法对工具增强型大语言模型在角色对话中的应用，发现基于规则的角色提示（RRP）效果最佳。

Details

Motivation: 对话代理经常过度发言但未能有效使用工具，需要改进角色提示方法。 Method: 探索了四种提示方法：基本角色提示、人工设计角色提示、自动提示优化（APO）和基于规则的角色提示（RRP）。 Result: RRP 方法通过角色卡/场景契约设计和强制函数调用，总得分0.571，优于零样本基线得分0.519。 Conclusion: RRP 方法相比其他方法如 APO，在角色扮演对话代理中更有效且可靠。开放源代码可支持未来研究。 Abstract: This report investigates approaches for prompting a tool-augmented large language model (LLM) to act as a role-playing dialogue agent in the API track of the Commonsense Persona-grounded Dialogue Challenge (CPDC) 2025. In this setting, dialogue agents often produce overly long in-character responses (over-speaking) while failing to use tools effectively according to the persona (under-acting), such as generating function calls that do not exist or making unnecessary tool calls before answering. We explore four prompting approaches to address these issues: 1) basic role prompting, 2) human-crafted role prompting, 3) automatic prompt optimization (APO), and 4) rule-based role prompting. The rule-based role prompting (RRP) approach achieved the best performance through two novel techniques--character-card/scene-contract design and strict enforcement of function calling--which led to an overall score of 0.571, improving on the zero-shot baseline score of 0.519. These findings demonstrate that RRP design can substantially improve the effectiveness and reliability of role-playing dialogue agents compared with more elaborate methods such as APO. To support future efforts in developing persona prompts, we are open-sourcing all of our best-performing prompts and the APO tool. Source code is available at https://github.com/scb-10x/apo.

[22] ResearchQA: Evaluating Scholarly Question Answering at Scale Across 75 Fields with Survey-Mined Questions and Rubrics

Li S. Yifei,Allen Chang,Chaitanya Malaviya,Mark Yatskar

Main category: cs.CL

TL;DR: ResearchQA is introduced as a tool for evaluating LLM systems using survey articles across multiple fields.

Details

Motivation: The need for a more widespread and expert-independent method for evaluating long-form responses to research queries. Method: Distilling survey articles from 75 research fields into queries and rubric items, validated by Ph.D. annotators and used to construct an automatic pairwise judge. Result: ResearchQA includes 21K queries and 160K rubric items with assessments indicating high relevance and coverage, and the best system achieves 75% rubric item coverage. Conclusion: ResearchQA provides a comprehensive evaluation resource for LLM systems, revealing competency gaps and offering insights into multi-field performance. Abstract: Evaluating long-form responses to research queries heavily relies on expert annotators, restricting attention to areas like AI where researchers can conveniently enlist colleagues. Yet, research expertise is widespread: survey articles synthesize knowledge distributed across the literature. We introduce ResearchQA, a resource for evaluating LLM systems by distilling survey articles from 75 research fields into 21K queries and 160K rubric items. Each rubric, derived jointly with queries from survey sections, lists query-specific answer evaluation criteria, i.e., citing papers, making explanations, and describing limitations. Assessments by 31 Ph.D. annotators in 8 fields indicate 96% of queries support Ph.D. information needs and 87% of rubric items should be addressed in system responses by a sentence or more. Using our rubrics, we are able to construct an automatic pairwise judge obtaining 74% agreement with expert judgments. We leverage ResearchQA to analyze competency gaps in 18 systems in over 7.6K pairwise evaluations. No parametric or retrieval-augmented system we evaluate exceeds 70% on covering rubric items, and the highest-ranking agentic system shows 75% coverage. Error analysis reveals that the highest-ranking system fully addresses less than 11% of citation rubric items, 48% of limitation items, and 49% of comparison items. We release our data to facilitate more comprehensive multi-field evaluations.

[23] Entropy-based Coarse and Compressed Semantic Speech Representation Learning

Jialong Zuo,Guangyan Zhang,Minghui Fang,Shengpeng Ji,Xiaoqi Jiao,Jingyu Li,Yiwen Guo,Zhou Zhao

Main category: cs.CL

TL;DR: 提出了一种基于熵的动态聚合框架，用于学习压缩的语义语音表示，通过语音语言模型和交叉注意模块，实现表示的灵活控制，并在多个任务上验证了其有效性。

Details

Motivation: 现有的离散语音表示学习方法在建模时存在冗余问题，且语义理解不需要如此详细的标记级别分辨率，因此需要一种更高效的压缩语义语音表示方法。 Method: 首先通过大规模未标记数据上的下一个标记预测任务预训练语音语言模型以捕获常见标记模式，然后使用预测熵自适应地确定聚合边界，并通过交叉注意力模块融合每个段内的信息。 Result: 实验表明，这种压缩表示在语音识别、语音到文本翻译和语音转换任务上的表现与密集标记序列相当或更优。 Conclusion: 基于熵的动态聚合框架可以有效学习压缩的语义语音表示，并在多个任务上展示了其优越性能。 Abstract: Discrete speech representation learning has recently attracted increasing interest in both acoustic and semantic modeling. Existing approaches typically encode 16 kHz waveforms into discrete tokens at a rate of 25 or 50 tokens per second. However, given that speech generally conveys only 2 to 5 words per second, such fine-grained tokenization introduces redundancy and hinders efficiency in downstream training and inference. Moreover, semantic speech representations at this frequency primarily capture phonetic-level information, while semantic understanding may not require such detailed token-level resolution. To address these limitations, we propose an entropy-based dynamic aggregation framework for learning compressed semantic speech representations. A speech language model is first pre-trained via next-token prediction on large-scale unlabeled data to capture frequent token patterns. Predictive entropy is then used to adaptively determine aggregation boundaries, followed by a cross-attention module that fuses information within each segment. By adjusting the entropy threshold, the granularity and compression ratio of the representations can be flexibly controlled. Experiments on ASR, speech-to-text translation, and voice conversion tasks demonstrate that the compressed representations perform on par with or better than dense token sequences, demonstrating the effectiveness of the proposed approach.

[24] Modeling Motivated Reasoning in Law: Evaluating Strategic Role Conditioning in LLM Summarization

Eunjung Cho,Alexander Hoyle,Yoan Hermstrüwer

Main category: cs.CL

TL;DR: LLMs show role-based bias when summarizing legal decisions, even with neutrality instructions, highlighting the need for careful evaluation in legal settings.

Details

Motivation: Motivated by the increasing use of LLMs in legal contexts and the potential for motivated reasoning, this study explores how models frame information based on the legal role of stakeholders. Method: The study introduces an evaluation framework based on legal fact and reasoning inclusion, assessing LLM responses to prompts conditioned on legal roles (e.g., judges, prosecutors, attorneys). Result: The results indicate that LLMs show selective inclusion patterns aligned with specific legal roles, even when instructed to remain neutral, suggesting strategic framing of information based on stakeholder position. Conclusion: The study concludes that LLMs exhibit role-consistent alignment in summarizing judicial decisions, even when given balancing instructions, raising concerns about implicit role inference and emphasizing the need for role-aware evaluation in legal contexts. Abstract: Large Language Models (LLMs) are increasingly used to generate user-tailored summaries, adapting outputs to specific stakeholders. In legal contexts, this raises important questions about motivated reasoning -- how models strategically frame information to align with a stakeholder's position within the legal system. Building on theories of legal realism and recent trends in legal practice, we investigate how LLMs respond to prompts conditioned on different legal roles (e.g., judges, prosecutors, attorneys) when summarizing judicial decisions. We introduce an evaluation framework grounded in legal fact and reasoning inclusion, also considering favorability towards stakeholders. Our results show that even when prompts include balancing instructions, models exhibit selective inclusion patterns that reflect role-consistent perspectives. These findings raise broader concerns about how similar alignment may emerge as LLMs begin to infer user roles from prior interactions or context, even without explicit role instructions. Our results underscore the need for role-aware evaluation of LLM summarization behavior in high-stakes legal settings.

[25] Thinking Hard, Going Misaligned: Emergent Misalignment in LLMs

Hanqi Yan,Hainiu Xu,Yulan He

Main category: cs.CL

TL;DR: 本研究发现，增强大型语言模型的推理能力可能会导致其对恶意请求的响应性增加，称为“推理诱导的不一致”。同时，注意力转移和专家混合模型中的专业专家可以将过度的推理引导至安全方向，为推理与安全之间的权衡提供了新的见解。

Details

Motivation: 随着大型语言模型（LLMs）的广泛应用，其安全性和与人类价值观的一致性问题受到越来越多关注。已有研究表明，在狭窄或恶意数据集上微调LLMs可能导致不一致行为。本文旨在进一步研究一种更为令人担忧的现象——“推理诱导的不一致”，即增强模型推理能力是否可能加剧其对恶意请求的响应性。 Method: 该论文通过切换到“思考模式”或在良性数学数据集上进行微调来增强模型的推理能力，并测试其对恶意请求的响应性。同时，研究分析了模型内部状态，包括注意力机制的变化和专家混合模型中的专家分布，以探索如何将过度的推理引导至安全方向。 Result: 论文发现，当通过切换到“思考模式”或在良性数学数据集上进行微调来增强模型推理能力时，LLMs对恶意请求的响应性增加，尤其是密集模型更容易受到影响。此外，研究还发现注意力转移和专家混合模型中的专业专家可以将过度的推理引导到安全护栏上，为推理与安全之间的权衡提供了新的见解。 Conclusion: 该论文得出的结论是，随着大型语言模型（LLMs）日益被广泛采用，其安全性和与人类价值观的一致性问题变得愈加重要。研究发现，当加强模型推理能力时，LLMs对恶意请求的响应性增加，这种现象被称为“推理诱导的不一致”。此外，研究还发现注意力转移和专家混合模型中的专业专家可以帮助将过度的推理引导到安全护栏上，为推理与安全之间的权衡提供了新的见解，并强调了推进高级推理模型一致性研究的紧迫性。 Abstract: With Large Language Models (LLMs) becoming increasingly widely adopted, concerns regarding their safety and alignment with human values have intensified. Previous studies have shown that fine-tuning LLMs on narrow and malicious datasets induce misaligned behaviors. In this work, we report a more concerning phenomenon, Reasoning-Induced Misalignment. Specifically, we observe that LLMs become more responsive to malicious requests when reasoning is strengthened, via switching to "think-mode" or fine-tuning on benign math datasets, with dense models particularly vulnerable. Moreover, we analyze internal model states and find that both attention shifts and specialized experts in mixture-of-experts models help redirect excessive reasoning towards safety guardrails. These findings provide new insights into the emerging reasoning-safety trade-off and underscore the urgency of advancing alignment for advanced reasoning models.

[26] StealthEval: A Probe-Rewrite-Evaluate Workflow for Reliable Benchmarks

Lang Xiong,Nishant Bhargava,Wesley Chang,Jianhang Hong,Haihao Liu,Kevin Zhu

Main category: cs.CL

TL;DR: This paper investigates how the context in which prompts are presented affects LLM behavior, finding that models respond more honestly and safely in perceived deployment settings compared to test settings. A rewriting strategy was used to shift prompts towards a natural context, resulting in measurable improvements in model behavior. These findings highlight the need for more realistic evaluation methods to ensure model alignment.

Details

Motivation: The behavioral shifts in LLMs between real-world deployment and controlled evaluation settings, known as 'evaluation awareness,' pose a challenge for AI alignment. Benchmark performance may not accurately reflect a model's true behavior, necessitating a systematic approach to quantify these changes. Method: A methodology was introduced using a linear probe to score prompts on a 'test-like' to 'deploy-like' scale. An LLM rewriting strategy was employed to make prompts more natural while preserving the task. The impact of these rewritten prompts on model behavior was then evaluated. Result: The study found a 30% increase in average probe score after rewriting prompts. Models showed a 5.26% increase in honest responses, a 12.40% decrease in deceptive responses, and a 6.38% increase in refusal rates with 'deploy-like' prompts. Conclusion: Evaluation awareness significantly affects LLM behavior, with models showing more honest and safer responses in perceived deployment settings compared to test settings. This highlights the importance of realistic evaluation frameworks to accurately assess model alignment. Abstract: Large Language Models (LLMs) often exhibit significant behavioral shifts when they perceive a change from a real-world deployment context to a controlled evaluation setting, a phenomenon known as "evaluation awareness." This discrepancy poses a critical challenge for AI alignment, as benchmark performance may not accurately reflect a model's true safety and honesty. In this work, we systematically quantify these behavioral changes by manipulating the perceived context of prompts. We introduce a methodology that uses a linear probe to score prompts on a continuous scale from "test-like" to "deploy-like" and leverage an LLM rewriting strategy to shift these prompts towards a more natural, deployment-style context while preserving the original task. Using this method, we achieved a 30% increase in the average probe score across a strategic role-playing dataset after rewriting. Evaluating a suite of state-of-the-art models on these original and rewritten prompts, we find that rewritten "deploy-like" prompts induce a significant and consistent shift in behavior. Across all models, we observed an average increase in honest responses of 5.26% and a corresponding average decrease in deceptive responses of 12.40%. Furthermore, refusal rates increased by an average of 6.38%, indicating heightened safety compliance. Our findings demonstrate that evaluation awareness is a quantifiable and manipulable factor that directly influences LLM behavior, revealing that models are more prone to unsafe or deceptive outputs in perceived test environments. This underscores the urgent need for more realistic evaluation frameworks to accurately gauge true model alignment before deployment.

[27] Gated Associative Memory: A Parallel O(N) Architecture for Efficient Sequence Modeling

Rishiraj Acharya

Main category: cs.CL

TL;DR: The paper introduces Gated Associative Memory (GAM) network, a novel architecture for sequence modeling with linear complexity, demonstrating improved efficiency and performance over existing models.

Details

Motivation: Transformer's quadratic scaling creates a bottleneck for long contexts. Method: Replaced self-attention layer with causal convolution and parallel associative memory retrieval mechanism. Result: GAM is consistently faster and achieves superior or competitive validation perplexity. Conclusion: GAM is a promising and efficient alternative for sequence modeling. Abstract: The Transformer architecture, underpinned by the self-attention mechanism, has become the de facto standard for sequence modeling tasks. However, its core computational primitive scales quadratically with sequence length (O(N^2)), creating a significant bottleneck for processing long contexts. In this paper, we propose the Gated Associative Memory (GAM) network, a novel, fully parallel architecture for sequence modeling that exhibits linear complexity (O(N)) with respect to sequence length. The GAM block replaces the self-attention layer with two parallel pathways: a causal convolution to efficiently capture local, position-dependent context, and a parallel associative memory retrieval mechanism to model global, content-based patterns. These pathways are dynamically fused using a gating mechanism, allowing the model to flexibly combine local and global information for each token. We implement GAM from scratch and conduct a rigorous comparative analysis against a standard Transformer model and a modern linear-time baseline (Mamba) on the WikiText-2 benchmark, as well as against the Transformer on the TinyStories dataset. Our experiments demonstrate that GAM is consistently faster, outperforming both baselines on training speed, and achieves a superior or competitive final validation perplexity across all datasets, establishing it as a promising and efficient alternative for sequence modeling.

[28] A Multi-Strategy Approach for AI-Generated Text Detection

Ali Zain,Sareem Farooqui,Muhammad Rafi

Main category: cs.CL

TL;DR: This paper describes three systems for detecting AI-generated content, with the RoBERTa-based model achieving the highest performance.

Details

Motivation: To detect AI-generated content in news articles and academic abstracts as part of the M-DAIGT shared task. Method: Developed three systems: fine-tuned RoBERTa-base classifier, TF-IDF + SVM classifier, and ensemble model Candace using Llama-3.2 models with a custom Transformer encoder. Result: The RoBERTa-based system outperformed other systems, showing excellent performance on both development and test sets. Conclusion: RoBERTa-based system performed best, achieving near-perfect results on both development and test sets. Abstract: This paper presents presents three distinct systems developed for the M-DAIGT shared task on detecting AI generated content in news articles and academic abstracts. The systems includes: (1) A fine-tuned RoBERTa-base classifier, (2) A classical TF-IDF + Support Vector Machine (SVM) classifier , and (3) An Innovative ensemble model named Candace, leveraging probabilistic features extracted from multiple Llama-3.2 models processed by a customTransformer encoder.The RoBERTa-based system emerged as the most performant, achieving near-perfect results on both development and test sets.

[29] Can Multi-turn Self-refined Single Agent LMs with Retrieval Solve Hard Coding Problems?

Md Tanzib Hosain,Md Kishor Morol

Main category: cs.CL

TL;DR: This paper introduces the ICPC benchmark for evaluating language models in competitive programming tasks and demonstrates that advanced inference techniques, along with specific instructions, can significantly improve model performance.

Details

Motivation: Competitive programming is a challenging domain that requires algorithmic thinking and problem-solving skills, yet it has not received sufficient attention as a benchmark for evaluating language models. Method: The researchers introduced the ICPC benchmark, which includes 254 programming contest tasks, and tested various language model inference techniques, including zero-shot chain-of-thought prompting, multi-turn self-judge with reflection, and retrieval over episodic information. They also conducted a human-in-the-loop investigation. Result: Using zero-shot chain-of-thought prompting, the o1 model achieved a 19.1% pass@1 solve rate. With the best inference technique, this was increased to 42.2%. Further, with specific instructions, o1 was able to solve 17 out of 18 previously unsolvable problems. Conclusion: The study concludes that while current language models face significant challenges in competitive programming tasks, their performance can be notably improved with advanced inference techniques and specific instructions. Abstract: Among the hardest tasks for humans are those found in competitive programming where problems require sophisticated algorithmic thinking, puzzle solving, and the creation of effective code. As a domain to assess language models (LMs), it has not received enough attention, though. This study presents the ICPC benchmark, which consists of 254 international collegiate programming contest (ICPC) tasks. Each problem includes official analysis, reference code, and sample, high-quality unit, and hidden tests. We are able to develop and evaluate a variety of LM inference techniques for competitive programming with these resources. With zero-shot chain-of-thought prompting, we find that o1 only achieves a 19.1\% pass@1 solve rate. With our best inference technique, which combines multi-turn self-judge with reflection and retrieval over episodic information, raises this to 42.2\%. Furthermore, we conduct a new human-in-the-loop investigation to gain a deeper understanding of the remaining difficulties. Surprisingly, we discover that o1 can solve 17 out of 18 problems that were previously unsolvable by any model or technique with just a few specific instructions. A footstep toward LMs with grounded, imaginative, and algorithmic thinking is provided by our quantitative findings and qualitative research. We open-source our code and data at https://github.com/kraritt/zolve.

[30] Confident, Calibrated, or Complicit: Probing the Trade-offs between Safety Alignment and Ideological Bias in Language Models in Detecting Hate Speech

Sanjeeevan Selvaganapathy,Mehwish Nasim

Main category: cs.CL

TL;DR: Censored LLMs detect hate speech more accurately than uncensored ones, but they are ideologically rigid and show fairness issues, undermining the idea of LLMs as unbiased arbiters.

Details

Motivation: To determine whether uncensored LLMs provide more objective hate speech classification compared to censored models and to assess the impact of safety alignment on performance. Method: The study compares the performance of censored and uncensored LLMs in detecting implicit and explicit hate speech, evaluating accuracy, robustness, ideological influence, and fairness. Result: Censored models achieved higher accuracy (78.7% vs. 64.1%) and robustness but were ideologically rigid, while uncensored models were more malleable but less accurate. All models struggled with irony, showed fairness disparities, and exhibited overconfidence. Conclusion: Censored models outperform uncensored models in detecting hate speech, but they are less ideologically malleable and demonstrate fairness disparities, challenging the idea of LLMs as objective arbiters. Abstract: We investigate the efficacy of Large Language Models (LLMs) in detecting implicit and explicit hate speech, examining whether models with minimal safety alignment (uncensored) might provide more objective classification capabilities compared to their heavily-aligned (censored) counterparts. While uncensored models theoretically offer a less constrained perspective free from moral guardrails that could bias classification decisions, our results reveal a surprising trade-off: censored models significantly outperform their uncensored counterparts in both accuracy and robustness, achieving 78.7% versus 64.1% strict accuracy. However, this enhanced performance comes with its own limitation -- the safety alignment acts as a strong ideological anchor, making censored models resistant to persona-based influence, while uncensored models prove highly malleable to ideological framing. Furthermore, we identify critical failures across all models in understanding nuanced language such as irony. We also find alarming fairness disparities in performance across different targeted groups and systemic overconfidence that renders self-reported certainty unreliable. These findings challenge the notion of LLMs as objective arbiters and highlight the need for more sophisticated auditing frameworks that account for fairness, calibration, and ideological consistency.

[31] Router Upcycling: Leveraging Mixture-of-Routers in Mixture-of-Experts Upcycling

Junfeng Ran,Guangxiang Zhao,Yuhan Wu,Dawei Zhu,Longyun Wu,Yikai Zhao,Tong Yang,Lin Sun,Xiangzheng Zhang,Sujian Li

Main category: cs.CL

TL;DR: This paper introduces Router Upcycling, a novel routing technique for Mixture-of-Experts models that improves performance by leveraging attention heads to initialize routers, leading to superior results in model upcycling.

Details

Motivation: Existing MoE upcycling techniques face challenges due to the limitations of simple routers in handling complex routing tasks, prompting the need for a more effective routing mechanism. Method: Router Upcycling initializes multiple routers from the attention heads of preceding attention layers, which collaboratively assign tokens to experts in an attention-like manner. Result: Experimental results show that the proposed method outperforms other upcycling baselines and achieves state-of-the-art performance. Conclusion: The proposed Router Upcycling method enhances the performance of MoE upcycling models by initializing multiple routers from attention heads, achieving state-of-the-art results. Abstract: The Mixture-of-Experts (MoE) models have gained significant attention in deep learning due to their dynamic resource allocation and superior performance across diverse tasks. However, efficiently training these models remains challenging. The MoE upcycling technique has been proposed to reuse and improve existing model components, thereby minimizing training overhead. Despite this, simple routers, such as linear routers, often struggle with complex routing tasks within MoE upcycling. In response, we propose a novel routing technique called Router Upcycling to enhance the performance of MoE upcycling models. Our approach initializes multiple routers from the attention heads of preceding attention layers during upcycling. These routers collaboratively assign tokens to specialized experts in an attention-like manner. Each token is processed into diverse queries and aligned with the experts' features (serving as keys). Experimental results demonstrate that our method achieves state-of-the-art (SOTA) performance, outperforming other upcycling baselines.

[32] Do small language models generate realistic variable-quality fake news headlines?

Austin McCutcheon,Chris Brogly

Main category: cs.CL

TL;DR: 该研究评估了14个小型语言模型在生成虚假新闻标题方面的能力，发现它们通常能够服从指令生成标题，但存在有限的伦理约束，且生成的标题与真实新闻标题的相似度较低。使用DistilBERT和其他分类器进行的质量检测显示准确率较低，表明当前检测方法难以有效识别这些伪造标题。

Details

Motivation: 小型语言模型（SLMs）可能被用于在线生成伪造文本，因此需要评估它们在生成虚假新闻标题方面的潜力及其与真实新闻标题的相似性。 Method: 该研究通过控制提示工程生成了24,000个低质量和高质量的欺骗性新闻标题，并应用了现有的机器学习和深度学习模型来检测这些标题的质量。 Result: 小型语言模型展示了较高的生成伪造标题的服从率，但伦理抵抗能力有限，且现有分类器对生成标题的质量检测准确率较低，范围在35.2%到63.5%之间。 Conclusion: 研究发现，测试的小型语言模型在生成伪造新闻标题时通常具有较高的服从性，但在伦理约束上存在轻微差异，且生成的标题质量分类准确率较低，表明它们与主要由人类撰写的网络内容相似度不高。 Abstract: Small language models (SLMs) have the capability for text generation and may potentially be used to generate falsified texts online. This study evaluates 14 SLMs (1.7B-14B parameters) including LLaMA, Gemma, Phi, SmolLM, Mistral, and Granite families in generating perceived low and high quality fake news headlines when explicitly prompted, and whether they appear to be similar to real-world news headlines. Using controlled prompt engineering, 24,000 headlines were generated across low-quality and high-quality deceptive categories. Existing machine learning and deep learning-based news headline quality detectors were then applied against these SLM-generated fake news headlines. SLMs demonstrated high compliance rates with minimal ethical resistance, though there were some occasional exceptions. Headline quality detection using established DistilBERT and bagging classifier models showed that quality misclassification was common, with detection accuracies only ranging from 35.2% to 63.5%. These findings suggest the following: tested SLMs generally are compliant in generating falsified headlines, although there are slight variations in ethical restraints, and the generated headlines did not closely resemble existing primarily human-written content on the web, given the low quality classification accuracy.

[33] Text Reinforcement for Multimodal Time Series Forecasting

Chen Su,Yuanhe Tian,Yan Song,Yongdong Zhang

Main category: cs.CL

TL;DR: This paper proposes a text reinforcement model (TeR) with reinforcement learning to enhance textual inputs for multimodal time series forecasting, leading to improved forecasting performance.

Details

Motivation: Multimodal time series forecasting often suffers from unstable performance due to incomplete or low-quality textual inputs. This work aims to enhance the textual modality to improve forecasting accuracy and stability. Method: The paper introduces a text reinforcement model (TeR) combined with a reinforcement learning approach to generate enhanced textual data. This reinforced text is then used to improve the understanding of the multimodal TSF model. Performance is evaluated through experiments on a real-world benchmark dataset. Result: Extensive experiments show that the proposed approach outperforms strong baselines and existing methods on a real-world benchmark dataset across various domains. Conclusion: The proposed TeR model effectively improves the performance of multimodal time series forecasting by generating reinforced text that compensates for weaknesses in the original textual inputs. Abstract: Recent studies in time series forecasting (TSF) use multimodal inputs, such as text and historical time series data, to predict future values. These studies mainly focus on developing advanced techniques to integrate textual information with time series data to perform the task and achieve promising results. Meanwhile, these approaches rely on high-quality text and time series inputs, whereas in some cases, the text does not accurately or fully capture the information carried by the historical time series, which leads to unstable performance in multimodal TSF. Therefore, it is necessary to enhance the textual content to improve the performance of multimodal TSF. In this paper, we propose improving multimodal TSF by reinforcing the text modalities. We propose a text reinforcement model (TeR) to generate reinforced text that addresses potential weaknesses in the original text, then apply this reinforced text to support the multimodal TSF model's understanding of the time series, improving TSF performance. To guide the TeR toward producing higher-quality reinforced text, we design a reinforcement learning approach that assigns rewards based on the impact of each reinforced text on the performance of the multimodal TSF model and its relevance to the TSF task. We optimize the TeR accordingly, so as to improve the quality of the generated reinforced text and enhance TSF performance. Extensive experiments on a real-world benchmark dataset covering various domains demonstrate the effectiveness of our approach, which outperforms strong baselines and existing studies on the dataset.

[34] CE-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders

Alex Gulko,Yusen Peng,Sachin Kumar

Main category: cs.CL

TL;DR: 本研究提出了一种新的稀疏自动编码器评估基准CE-Bench，能够有效衡量其可解释性，并已开源。无需依赖外部大语言模型，与现有方法高度一致。

Details

Motivation: 稀疏自动编码器在大语言模型中揭示可解释特征具有潜力，但由于缺乏自动化评估方法而发展受限。 Method: 构建了一个基于对比故事对的评估数据集，进行综合的消融实验来验证方法的有效性。 Result: CE-Bench成功地测量了稀疏自动编码器的可解释性，并与现有基准保持一致，同时不依赖外部大语言模型。 Conclusion: CE-Bench是一个新颖且轻量级的对比评估基准，能够可靠地测量稀疏自动编码器的可解释性，并且与现有基准高度一致，无需依赖外部大语言模型。 Abstract: Probing with sparse autoencoders is a promising approach for uncovering interpretable features in large language models (LLMs). However, the lack of automated evaluation methods has hindered their broader adoption and development. In this work, we introduce CE-Bench, a novel and lightweight contrastive evaluation benchmark for sparse autoencoders, built on a curated dataset of contrastive story pairs. We conduct comprehensive ablation studies to validate the effectiveness of our approach. Our results show that CE-Bench reliably measures the interpretability of sparse autoencoders and aligns well with existing benchmarks, all without requiring an external LLM. The official implementation and evaluation dataset are open-sourced under the MIT License.

[35] Learning to Shop Like Humans: A Review-driven Retrieval-Augmented Recommendation Framework with LLMs

Kaiwen Wei,Jinpeng Gao,Jiang Zhong,Yuming Yang,Fengmao Lv,Zhenyang Li

Main category: cs.CL

TL;DR: RevBrowse improves LLM-based recommendation systems by dynamically integrating relevant user reviews through a retrieval-augmented module, enhancing performance and transparency.

Details

Motivation: The motivation stems from the inefficiency and lack of relevance mechanisms in incorporating user reviews into LLM-based recommendation systems due to constrained context windows and the need to prioritize contextually relevant reviews. Method: The paper proposes RevBrowse, a review-driven recommendation framework that integrates user reviews into the LLM-based reranking process. It introduces PrefRAG, a retrieval-augmented module that disentangles user and item representations for adaptive retrieval of preference-relevant content. Result: Experiments on four Amazon review datasets showed that RevBrowse consistently outperforms strong baselines, demonstrating its effectiveness in modeling dynamic user preferences while offering interpretability. Conclusion: RevBrowse offers a transparent and effective framework for integrating user reviews into LLM-based recommendations, enhancing performance and interpretability. Abstract: Large language models (LLMs) have shown strong potential in recommendation tasks due to their strengths in language understanding, reasoning and knowledge integration. These capabilities are especially beneficial for review-based recommendation, which relies on semantically rich user-generated texts to reveal fine-grained user preferences and item attributes. However, effectively incorporating reviews into LLM-based recommendation remains challenging due to (1) inefficient to dynamically utilize user reviews under LLMs' constrained context windows, and (2) lacking effective mechanisms to prioritize reviews most relevant to the user's current decision context. To address these challenges, we propose RevBrowse, a review-driven recommendation framework inspired by the "browse-then-decide" decision process commonly observed in online user behavior. RevBrowse integrates user reviews into the LLM-based reranking process to enhance its ability to distinguish between candidate items. To improve the relevance and efficiency of review usage, we introduce PrefRAG, a retrieval-augmented module that disentangles user and item representations into structured forms and adaptively retrieves preference-relevant content conditioned on the target item. Extensive experiments on four Amazon review datasets demonstrate that RevBrowse achieves consistent and significant improvements over strong baselines, highlighting its generalizability and effectiveness in modeling dynamic user preferences. Furthermore, since the retrieval-augmented process is transparent, RevBrowse offers a certain level of interpretability by making visible which reviews influence the final recommendation.

[36] Reward-Weighted Sampling: Enhancing Non-Autoregressive Characteristics in Masked Diffusion LLMs

Daehoon Gwak,Minseo Jung,Junwoo Park,Minho Park,ChaeHun Park,Junha Hyung,Jaegul Choo

Main category: cs.CL

TL;DR: This paper proposes Reward-Weighted Sampling (RWS), a decoding strategy for masked diffusion models (MDMs) that leverages an external reward model to guide token selection and enhance non-autoregressive generation, resulting in improved performance.

Details

Motivation: Standard decoding methods for MDMs often lead to generation orders similar to autoregressive processes, limiting the advantages of non-autoregressive modeling. This necessitates a method that can better exploit the potential of non-autoregressive approaches. Method: The proposed RWS method uses an external reward model to evaluate the quality of intermediate sequences and scales token logits accordingly during the diffusion process, promoting more non-autoregressive generation orders. Result: Experiments show that RWS significantly improves non-autoregressive generation orders and achieves better performance across multiple evaluation metrics. Conclusion: Reward-Weighted Sampling (RWS) effectively enhances the non-autoregressive properties and overall performance of masked diffusion models (MDMs) by integrating global sequence-level coherence. Abstract: Masked diffusion models (MDMs) offer a promising non-autoregressive alternative for large language modeling. Standard decoding methods for MDMs, such as confidence-based sampling, select tokens independently based on individual token confidences at each diffusion step. However, we observe that this independent token selection often results in generation orders resembling sequential autoregressive processes, limiting the advantages of non-autoregressive modeling. To mitigate this pheonomenon, we propose Reward-Weighted Sampling (RWS), a novel decoding strategy that leverages an external reward model to provide a principled global signal during the iterative diffusion process. Specifically, at each diffusion step, RWS evaluates the quality of the entire intermediate sequence and scales token logits accordingly, guiding token selection by integrating global sequence-level coherence. This method selectively increases the confidence of tokens that initially have lower scores, thereby promoting a more non-autoregressive generation order. Furthermore, we provide theoretical justification showing that reward-weighted logit scaling induces beneficial rank reversals in token selection and consistently improves expected reward. Experiments demonstrate that RWS significantly promotes non-autoregressive generation orders, leading to improvements across multiple evaluation metrics. These results highlight the effectiveness of integrating global signals in enhancing both the non-autoregressive properties and overall performance of MDMs.

[37] Designing LMS and Instructional Strategies for Integrating Generative-Conversational AI

Elias Ra,Seung Je Kim,Eui-Yeong Seo,Geunju So

Main category: cs.CL

TL;DR: This study presents a framework for an AI-powered Learning Management System that uses generative and conversational AI to provide adaptive, interactive, and learner-centered instruction in higher education.

Details

Motivation: The motivation of the study is to address the challenges in delivering personalized, scalable, and pedagogically coherent learning experiences in higher education by introducing an AI-powered Learning Management System. Method: The study uses a design-based research (DBR) methodology, which includes phases like literature review, SWOT analysis, development of ethical-pedagogical principles, system design, and instructional strategy formulation. Result: The result of the study is the creation of an AI-LMS with modular components like configurable prompts, adaptive feedback loops, and multi-agent conversation flows, which are aligned with pedagogical paradigms including behaviorist, constructivist, and connectivist learning theories. Conclusion: The study concludes that combining AI capabilities with human-centered design and ethical safeguards can lead to a practical model for AI integration in education. Abstract: Higher education faces growing challenges in delivering personalized, scalable, and pedagogically coherent learning experiences. This study introduces a structured framework for designing an AI-powered Learning Management System (AI-LMS) that integrates generative and conversational AI to support adaptive, interactive, and learner-centered instruction. Using a design-based research (DBR) methodology, the framework unfolds through five phases: literature review, SWOT analysis, development of ethical-pedagogical principles, system design, and instructional strategy formulation. The resulting AI-LMS features modular components -- including configurable prompts, adaptive feedback loops, and multi-agent conversation flows -- aligned with pedagogical paradigms such as behaviorist, constructivist, and connectivist learning theories. By combining AI capabilities with human-centered design and ethical safeguards, this study advances a practical model for AI integration in education. Future research will validate and refine the system through real-world implementation.

[38] LLM Encoder vs. Decoder: Robust Detection of Chinese AI-Generated Text with LoRA

Houji Jin,Negin Ashrafi,Armin Abdollahi,Wei Liu,Jian Wang,Ganyu Gui,Maryam Pishgar,Huanghao Feng

Main category: cs.CL

TL;DR: This study compares different models for detecting Chinese AI-generated text, finding that the LoRA-adapted Qwen2.5-7B outperforms other models with 95.94% accuracy, highlighting its robustness and potential for future improvements.

Details

Motivation: The motivation stems from the increasing demand for accurate detection of AI-generated text, particularly in Chinese, where linguistic nuances challenge current methods. Method: The study systematically compares encoder-based Transformers (Chinese BERT-large and RoBERTa-wwm-ext-large), a decoder-only LLM (Qwen2.5-7B fine-tuned via LoRA), and a FastText baseline. Encoder models are fine-tuned using a prompt-based masked language modeling approach, while Qwen2.5-7B is adapted for classification using instruction-format input and a lightweight classification head. Result: The LoRA-adapted Qwen2.5-7B achieves 95.94% test accuracy, significantly outperforming encoder-based models (RoBERTa: 76.3%, BERT: 79.3%) and FastText (83.5%). Encoder models show performance degradation under distribution shifts, while FastText lacks deeper semantic understanding. Conclusion: The study concludes that decoder-based LLMs, specifically Qwen2.5-7B adapted via LoRA, demonstrate superior performance in detecting Chinese AI-generated text, highlighting the potential for future enhancements with next-generation models and ensemble strategies. Abstract: The rapid growth of large language models (LLMs) has heightened the demand for accurate detection of AI-generated text, particularly in languages like Chinese, where subtle linguistic nuances pose significant challenges to current methods. In this study, we conduct a systematic comparison of encoder-based Transformers (Chinese BERT-large and RoBERTa-wwm-ext-large), a decoder-only LLM (Alibaba's Qwen2.5-7B/DeepSeek-R1-Distill-Qwen-7B fine-tuned via Low-Rank Adaptation, LoRA), and a FastText baseline using the publicly available dataset from the NLPCC 2025 Chinese AI-Generated Text Detection Task. Encoder models were fine-tuned using a novel prompt-based masked language modeling approach, while Qwen2.5-7B was adapted for classification with an instruction-format input and a lightweight classification head trained via LoRA. Experiments reveal that although encoder models nearly memorize training data, they suffer significant performance degradation under distribution shifts (RoBERTa: 76.3% test accuracy; BERT: 79.3%). FastText demonstrates surprising lexical robustness (83.5% accuracy) yet lacks deeper semantic understanding. In contrast, the LoRA-adapted Qwen2.5-7B achieves 95.94% test accuracy with balanced precision-recall metrics, indicating superior generalization and resilience to dataset-specific artifacts. These findings underscore the efficacy of decoder-based LLMs with parameter-efficient fine-tuning for robust Chinese AI-generated text detection. Future work will explore next-generation Qwen3 models, distilled variants, and ensemble strategies to enhance cross-domain robustness further.

[39] Decomposing and Revising What Language Models Generate

Zhichao Yan,Jiaoyan Chen,Jiapu Wang,Xiaoli Li,Ru Li,Jeff Z. Pan

Main category: cs.CL

TL;DR: FIDES is a new fact decomposition-based framework for attributed QA that uses a contextually enhanced decomposition method to improve retrieval of evidence snippets.

Details

Motivation: Attribution is crucial in question answering (QA) with Large Language Models (LLMs). SOTA approaches often generate irrelevant and incomplete questions, leading to loss of facts in retrieval. Method: FIDES uses a contextually enhanced two-stage faithful decomposition method to decompose long form answers into sub-facts, which are then used by a retriever to retrieve related evidence snippets. Result: Extensive evaluation has been conducted with six datasets, with an additionally proposed new metric called Attr_auto-P for evaluating the evidence precision. Conclusion: FIDES outperforms SOTA methods by over 14% in average with GPT-3.5-turbo, Gemini and Llama 70B series. Abstract: Attribution is crucial in question answering (QA) with Large Language Models (LLMs).SOTA question decomposition-based approaches use long form answers to generate questions for retrieving related documents. However, the generated questions are often irrelevant and incomplete, resulting in a loss of facts in retrieval.These approaches also fail to aggregate evidence snippets from different documents and paragraphs. To tackle these problems, we propose a new fact decomposition-based framework called FIDES (\textit{faithful context enhanced fact decomposition and evidence aggregation}) for attributed QA. FIDES uses a contextually enhanced two-stage faithful decomposition method to decompose long form answers into sub-facts, which are then used by a retriever to retrieve related evidence snippets. If the retrieved evidence snippets conflict with the related sub-facts, such sub-facts will be revised accordingly. Finally, the evidence snippets are aggregated according to the original sentences.Extensive evaluation has been conducted with six datasets, with an additionally proposed new metric called $Attr_{auto-P}$ for evaluating the evidence precision. FIDES outperforms the SOTA methods by over 14\% in average with GPT-3.5-turbo, Gemini and Llama 70B series.

[40] LegalChainReasoner: A Legal Chain-guided Framework for Criminal Judicial Opinion Generation

Weizhe Shi,Qiqi Wang,Yihong Pan,Qian Liu,Kaiqi Zhao

Main category: cs.CL

TL;DR: This paper introduces LegalChainReasoner, a new framework for automatically generating judicial opinions that combine legal reasoning and sentencing decisions, improving consistency and performance over existing methods.

Details

Motivation: Current research on automatically generating judicial opinions separates legal reasoning and sentencing prediction into isolated tasks, leading to inconsistencies. The authors aim to address this issue by proposing a more integrated and practical approach. Method: The authors introduced a framework called LegalChainReasoner that uses structured legal chains to guide the generation of judicial opinions, integrating factual premises, composite legal conditions, and sentencing conclusions. Result: Experiments on two real-world Chinese legal case datasets showed that the LegalChainReasoner framework outperforms existing baseline models in generating consistent and legally sound judicial opinions. Conclusion: The proposed LegalChainReasoner framework effectively generates judicial opinions by simultaneously producing legal reasoning and sentencing decisions, outperforming baseline models in experiments. Abstract: A criminal judicial opinion represents the judge's disposition of a case, including the decision rationale and sentencing. Automatically generating such opinions can assist in analyzing sentencing consistency and provide judges with references to similar past cases. However, current research typically approaches this task by dividing it into two isolated subtasks: legal reasoning and sentencing prediction. This separation often leads to inconsistency between the reasoning and predictions, failing to meet real-world judicial requirements. Furthermore, prior studies rely on manually curated knowledge to enhance applicability, yet such methods remain limited in practical deployment. To address these limitations and better align with legal practice, we propose a new LegalAI task: Judicial Opinion Generation, which simultaneously produces both legal reasoning and sentencing decisions. To achieve this, we introduce LegalChainReasoner, a framework that applies structured legal chains to guide the model through comprehensive case assessments. By integrating factual premises, composite legal conditions, and sentencing conclusions, our approach ensures flexible knowledge injection and end-to-end opinion generation. Experiments on two real-world and open-source Chinese legal case datasets demonstrate that our method outperforms baseline models.

[41] CaresAI at BioCreative IX Track 1 -- LLM for Biomedical QA

Reem Abdel-Salam,Mary Adewunmi,Modinat A. Abayomi

Main category: cs.CL

TL;DR: 该论文研究了基于LLaMA 3 8B的生物医学多跳问答系统，在微调策略和推理管道优化方面进行了探索，尽管模型在语义理解上表现良好，但在生成严格格式的答案时仍面临挑战。

Details

Motivation: 在将大型语言模型应用于生物医学和医疗保健领域之前，需要对其复杂问答能力进行严格评估。MedHopQA任务要求模型处理涉及疾病、基因和化学品的多跳问答问题，这对模型的推理能力提出了更高的要求。 Method: 采用监督微调策略，利用LLaMA 3 8B模型，并结合从BioASQ、MedQuAD和TREC等外部来源编译的生物医学问答数据集。论文探索了三种实验设置：同时微调短答案和长答案、仅短答案和仅长答案。此外，论文引入了一个两阶段推理管道，以提高短答案提取的精确性。 Result: 模型在概念级准确率上达到了0.8，显示出强大的领域理解能力，但在Exact Match（EM）评分上表现较差，尤其是在测试阶段。两阶段推理管道在一定程度上提高了短答案提取的准确性，但严格格式输出的生成问题仍未完全解决。 Conclusion: 论文指出在生物医学领域中，大型语言模型（LLMs）在语义理解方面表现出色，但在生成严格格式的输出时仍存在挑战，这表明在输出控制和后处理策略方面需要进一步研究。 Abstract: Large language models (LLMs) are increasingly evident for accurate question answering across various domains. However, rigorous evaluation of their performance on complex question-answering (QA) capabilities is essential before deployment in real-world biomedical and healthcare applications. This paper presents our approach to the MedHopQA track of the BioCreative IX shared task, which focuses on multi-hop biomedical question answering involving diseases, genes, and chemicals. We adopt a supervised fine-tuning strategy leveraging LLaMA 3 8B, enhanced with a curated biomedical question-answer dataset compiled from external sources including BioASQ, MedQuAD, and TREC. Three experimental setups are explored: fine-tuning on combined short and long answers, short answers only, and long answers only. While our models demonstrate strong domain understanding, achieving concept-level accuracy scores of up to 0.8, their Exact Match (EM) scores remain significantly lower, particularly in the test phase. We introduce a two-stage inference pipeline for precise short-answer extraction to mitigate verbosity and improve alignment with evaluation metrics. Despite partial improvements, challenges persist in generating strictly formatted outputs. Our findings highlight the gap between semantic understanding and exact answer evaluation in biomedical LLM applications, motivating further research in output control and post-processing strategies.

[42] TMT: A Simple Way to Translate Topic Models Using Dictionaries

Felix Engl,Andreas Henrich

Main category: cs.CL

TL;DR: 本文提出了一种跨语言主题模型转换方法 TMT，能够在数据有限的情况下实现高效且语义一致的主题翻译。

Details

Motivation: 多语言环境下主题模型的训练是一项具有挑战性的任务，尤其在目标语言知识有限或数据稀缺的情况下更为困难。 Method: 提出了一种名为 TMT 的新方法，用于跨语言转移主题模型，并通过定量和定性方法进行了广泛评估。 Result: TMT 能够在不需要大量目标语言语料库或人工翻译的情况下，实现语义一致且连贯的主题翻译，特别适用于资源有限的场景。 Conclusion: TMT 是一种新颖、稳健且透明的技术，能够将主题模型从一种语言转换到另一种语言，而无需元数据、嵌入或对齐的语料库。 Abstract: The training of topic models for a multilingual environment is a challenging task, requiring the use of sophisticated algorithms, topic-aligned corpora, and manual evaluation. These difficulties are further exacerbated when the developer lacks knowledge of the target language or is working in an environment with limited data, where only small or unusable multilingual corpora are available. Considering these challenges, we introduce Topic Model Translation (TMT), a novel, robust and transparent technique designed to transfer topic models (e.g., Latent Dirichlet Allocation (LDA) based topic models) from one language to another, without the need for metadata, embeddings, or aligned corpora. TMT enables the reuse of topic models across languages, making it especially suitable for scenarios where large corpora in the target language are unavailable or manual translation is infeasible. Furthermore, we evaluate TMT extensively using both quantitative and qualitative methods, demonstrating that it produces semantically coherent and consistent topic translations.

[43] Neural Models and Language Model Prompting for the Multidimensional Evaluation of Open-Ended Conversations

Michelle Elizabeth,Alicja Kasicka,Natalia Krawczyk,Magalie Ochs,Gwénolé Lecorvé,Justyna Gromada,Lina M. Rojas-Barahona

Main category: cs.CL

TL;DR: 本文探讨了在对话系统技术挑战赛（DSTC-12, Track 1）中开发预测对话级别评分模型的问题，比较了基于提示的生成模型和基于编码器的回归与分类模型的表现。

Details

Motivation: 生成式AI对话系统的快速增长使得其评估成为关键挑战，本文旨在探索在参数受限条件下有效评估对话质量的方法。 Method: 本文采用了两种策略：一是通过提示使用语言模型作为评估者，二是训练基于编码器的分类和回归模型以预测评分。 Result: 基于提示的语言模型在测试集中仅获得中等程度的相关性，排名第二，而基于编码器的模型在验证集上表现出较高相关性，但在测试集上性能下降，可能由于测试集评分范围与训练和验证集差异较大。 Conclusion: 尽管基于提示的语言模型表现中等，但基于编码器的模型在参数更少的情况下仍显示出潜力，尤其是在验证集上。 Abstract: The growing number of generative AI-based dialogue systems has made their evaluation a crucial challenge. This paper presents our contribution to this important problem through the Dialogue System Technology Challenge (DSTC-12, Track 1), where we developed models to predict dialogue-level, dimension-specific scores. Given the constraint of using relatively small models (i.e. fewer than 13 billion parameters) our work follows two main strategies: employing Language Models (LMs) as evaluators through prompting, and training encoder-based classification and regression models. Our results show that while LM prompting achieves only modest correlations with human judgments, it still ranks second on the test set, outperformed only by the baseline. The regression and classification models, with significantly fewer parameters, demonstrate high correlation for some dimensions on the validation set. Although their performance decreases on the test set, it is important to note that the test set contains annotations with significantly different score ranges for some of the dimensions with respect to the train and validation sets.

[44] Negative Matters: Multi-Granularity Hard-Negative Synthesis and Anchor-Token-Aware Pooling for Enhanced Text Embeddings

Tengyu Pan,Zhichao Duan,Zhenyu Li,Bowen Dong,Ning Liu,Xiuxing Li,Jianyong Wang

Main category: cs.CL

TL;DR: This paper proposes a Multi-Granularity Hard-negative synthesis framework and an Anchor Token Aware pooling method to enhance text embedding models, achieving state-of-the-art results on the MTEB benchmark through improved contrastive learning strategies.

Details

Motivation: The motivation is to enhance the ability of text embedding models to discern subtle semantic distinctions by generating more effective negative samples for contrastive learning. Method: The paper introduces a Multi-Granularity Hard-negative (MGH) synthesis framework using large language models (LLMs) to generate diverse negative samples and proposes an Anchor Token Aware (ATA) pooling method to improve text embedding accuracy. Result: Comprehensive experiments on the MTEB benchmark show that the proposed methods achieve state-of-the-art performance, outperforming existing synthesis strategies with both synthetic data and public retrieval datasets. Conclusion: The proposed MGH synthesis framework and ATA pooling method significantly enhance the performance of text embedding models, achieving state-of-the-art results on the MTEB benchmark. Abstract: Text embedding models are essential for various natural language processing tasks, enabling the effective encoding of semantic information into dense vector representations. These models are typically optimized using triplets of (query, positive, negative) data pairs for contrastive learning, where the negative samples play a critical role in enhancing the model's ability to discern subtle semantic distinctions. In this work, we introduce a Multi-Granularity Hard-negative (MGH) synthesis framework that leverages large language models (LLMs) to generate diverse negative samples with varying levels of similarity with the query. This approach facilitates a coarse-to-fine curriculum learning strategy during supervised training, allowing the embedding model to progressively learn more nuanced semantic representations. Meanwhile, we propose an Anchor Token Aware (ATA) pooling method that assigns higher weights to anchor tokens based on aggregation patterns observed in LLMs, improving text embedding accuracy without increasing model complexity. Comprehensive experiments on the MTEB benchmark demonstrate that our methods achieve state-of-the-art performance, surpassing existing synthesis strategies both with synthetic data and when combined with public retrieval datasets.

[45] Prompting Away Stereotypes? Evaluating Bias in Text-to-Image Models for Occupations

Shaina Raza,Maximus Powers,Partha Pratim Saha,Mahveen Raza,Rizwan Qureshi

Main category: cs.CL

TL;DR: This paper evaluates how prompting strategies influence demographic representation in TTI models, showing that while prompting can improve diversity, its effectiveness varies significantly across models.

Details

Motivation: TTI models risk amplifying harmful social biases, necessitating assessment of representational societal bias Method: introduce a pilot benchmark of occupational portrayals and compare neutral baseline prompts against fairness-aware controlled prompts across five state-of-the-art models Result: some models diversified demographic representations effectively, others overcorrected into unrealistic uniformity, and some showed little responsiveness Conclusion: prompting can substantially shift demographic representations, but with highly model-specific effects, highlighting both the promise and limitations of prompting as a fairness intervention Abstract: Text-to-Image (TTI) models are powerful creative tools but risk amplifying harmful social biases. We frame representational societal bias assessment as an image curation and evaluation task and introduce a pilot benchmark of occupational portrayals spanning five socially salient roles (CEO, Nurse, Software Engineer, Teacher, Athlete). Using five state-of-the-art models: closed-source (DALLE 3, Gemini Imagen 4.0) and open-source (FLUX.1-dev, Stable Diffusion XL Turbo, Grok-2 Image), we compare neutral baseline prompts against fairness-aware controlled prompts designed to encourage demographic diversity. All outputs are annotated for gender (male, female) and race (Asian, Black, White), enabling structured distributional analysis. Results show that prompting can substantially shift demographic representations, but with highly model-specific effects: some systems diversify effectively, others overcorrect into unrealistic uniformity, and some show little responsiveness. These findings highlight both the promise and the limitations of prompting as a fairness intervention, underscoring the need for complementary model-level strategies. We release all code and data for transparency and reproducibility https://github.com/maximus-powers/img-gen-bias-analysis.

[46] Exploring and Mitigating Fawning Hallucinations in Large Language Models

Zixuan Shangguan,Yanjie Dong,Lanjun Wang,Xiaoyi Fan,Victor C. M. Leung,Xiping Hu

Main category: cs.CL

TL;DR: This paper proposes a collaborative contrastive decoding method to reduce fawning hallucinations in large language models, improving the accuracy and truthfulness of their responses across different tasks.

Details

Motivation: The motivation is to address the issue of fawning hallucinations in LLMs, where models prioritize alignment with the input's implied perspective over accuracy and truthfulness, leading to responses that deviate from factual information. Method: The authors propose a collaborative contrastive decoding (CCD) method to mitigate fawning hallucinations in LLMs. They design two paradigms to generate deceptive and/or misleading inputs for fawning hallucinations induction and contrast the deviation in output distribution between induced and transformed neutral inputs. Result: Extensive experiments demonstrate that the proposed CCD method can effectively mitigate fawning hallucinations and improve the factuality of generated responses over various natural language processing tasks. Conclusion: The study concludes that the proposed CCD method can effectively mitigate fawning hallucinations in LLMs, reducing reliance on deceptive and/or misleading information without requiring additional training. Abstract: Large language models (LLMs) have demonstrated exceptional proficiency in language understanding. However, when LLMs align their outputs with deceptive and/or misleading prompts, the generated responses could deviate from the de facto information. Such observations are known as fawning hallucinations, where the model prioritizes alignment with the input's implied perspective over accuracy and truthfulness. In this work, we analyze fawning hallucinations in various natural language processing tasks and tailor the so-termed contrastive decoding method for fawning-hallucination mitigation. Specifically, we design two paradigms to generate corresponding deceptive and/or misleading inputs for the consistent fawning hallucinations induction. Then, we propose the collaborative contrastive decoding (CCD) to handle the fawning hallucinations across different tasks in LLMs. By contrasting the deviation in output distribution between induced and transformed neutral inputs, the proposed CCD can reduce reliance on deceptive and/or misleading information without requiring additional training. Extensive experiments demonstrate that the proposed CCD can effectively mitigate fawning hallucinations and improve the factuality of the generated responses over various tasks.

[47] EviNote-RAG: Enhancing RAG Models via Answer-Supportive Evidence Notes

Yuqin Dai,Guoqing Wang,Yuan Wang,Kairan Dou,Kaichen Zhou,Zhanwei Zhang,Shuo Yang,Fei Tang,Jun Yin,Pengyu Zeng,Zhenzhe Ying,Can Yi,Changhua Meng,Yuchen Zhou,Yongliang Shen,Shuai Lu

Main category: cs.CL

TL;DR: EviNote-RAG是一种通过结构化的检索-笔记-回答流程优化大型语言模型开放域问答性能的新方法。

Details

Motivation: 传统的检索-回答范式存在两个关键限制：检索证据中的信噪比低，以及在涉及不完整或嘈杂段落时多跳推理的错误累积。 Method: 引入Supportive-Evidence Notes (SENs) 和 Evidence Quality Reward (EQR) 来优化推理过程。 Result: 在多个问答基准测试中，EviNote-RAG取得了最先进的结果，分别在HotpotQA、Bamboogle和2Wiki上实现了20%、40%和91%的相对F1增益。 Conclusion: EviNote-RAG通过结构化的检索-笔记-回答流程，提高了大型语言模型在开放域问答中的准确性、泛化能力和训练稳定性，同时增强了推理的可靠性和效率。 Abstract: Large Language Models (LLMs) empowered with retrieval mechanisms have achieved strong progress in open-domain question answering (QA). Yet, the conventional retrieve--then--answer paradigm often suffers from two key limitations: (1) low signal-to-noise ratio in retrieved evidence, where useful information is buried under irrelevant content, and (2) error accumulation in multi-hop reasoning when incomplete or noisy passages are involved. To address these challenges, we present EviNote-RAG, an agentic RAG framework that introduces a structured retrieve--note--answer pipeline. Instead of directly reasoning over raw retrievals, the model is trained to compose Supportive-Evidence Notes (SENs), concise, human-like notes that preserve only answer-relevant information, highlight uncertainty, and explicitly state when no useful evidence exists. This distillation process is further reinforced by the Evidence Quality Reward (EQR), an entailment-based signal that evaluates whether SENs logically support the final answer. Together, SENs and EQR guide the model toward faithful and robust reasoning, while reducing the impact of noise. Experiments on in-domain and out-of-domain QA benchmarks show that EviNote-RAG consistently outperforms strong baselines in accuracy, generalization, and training stability. In particular, it achieves state-of-the-art results while enhancing robustness and efficiency, yielding relative F1 gains of 20\% on HotpotQA (+0.093), 40\% on Bamboogle (+0.151), and 91\% on 2Wiki (+0.256) via denser rewards and reduced verbosity.

[48] SeLeRoSa: Sentence-Level Romanian Satire Detection Dataset

Răzvan-Alexandru Smădu,Andreea Iuga,Dumitru-Clementin Cercel,Florin Pop

Main category: cs.CL

TL;DR: 本论文介绍了 SeLeRoSa 数据集，用于罗马尼亚语新闻文章中的句子级讽刺检测，并评估了当前模型在该任务中的表现。

Details

Motivation: 讽刺、反讽和挖苦常用于幽默和批评，但有时会被误认为真实报道，类似于假新闻。因此，需要一个专门的数据集来改进讽刺检测。 Method: 通过引入第一个用于罗马尼亚语讽刺检测的句子级数据集 SeLeRoSa，并评估基于大语言模型和基于 Transformer 的基线模型在零样本和微调设置下的表现。 Result: 研究结果显示了当前模型在句子级讽刺检测任务中的局限性。 Conclusion: SeLeRoSa 数据集的引入揭示了当前模型在句子级讽刺检测任务中的局限性，为新的研究方向铺平了道路。 Abstract: Satire, irony, and sarcasm are techniques typically used to express humor and critique, rather than deceive; however, they can occasionally be mistaken for factual reporting, akin to fake news. These techniques can be applied at a more granular level, allowing satirical information to be incorporated into news articles. In this paper, we introduce the first sentence-level dataset for Romanian satire detection for news articles, called SeLeRoSa. The dataset comprises 13,873 manually annotated sentences spanning various domains, including social issues, IT, science, and movies. With the rise and recent progress of large language models (LLMs) in the natural language processing literature, LLMs have demonstrated enhanced capabilities to tackle various tasks in zero-shot settings. We evaluate multiple baseline models based on LLMs in both zero-shot and fine-tuning settings, as well as baseline transformer-based models. Our findings reveal the current limitations of these models in the sentence-level satire detection task, paving the way for new research directions.

[49] Supervised In-Context Fine-Tuning for Generative Sequence Labeling

David Dukić,Goran Glavaš,Jan Šnajder

Main category: cs.CL

TL;DR: This paper introduces SIFT, a method that improves sequence labeling tasks by combining in-context learning and supervised fine-tuning for causal LLMs, demonstrating superior performance over existing approaches.

Details

Motivation: The motivation stems from the limitations of encoder-only models and the potential of causal LLMs to outperform them due to their scalability. The focus is on exploring a more natural generative setting for sequence labeling tasks. Method: The study proposes supervised in-context fine-tuning (SIFT), which frames sequence labeling tasks as constrained response generation tasks suited for causal LLMs. This combines in-context learning from demonstrations with supervised fine-tuning. Result: SIFT considerably outperforms both in-context learning (ICL) and decoder-as-encoder fine-tuning baselines on standard sequence labeling tasks. The study also finds that removing instructions mitigates performance issues caused by long contexts. Conclusion: The study concludes that SIFT, a method combining in-context learning and supervised fine-tuning, enhances the performance of sequence labeling tasks using causal LLMs, highlighting the importance of a generative task formulation for effective performance. Abstract: Sequence labeling (SL) tasks, where labels are assigned to tokens, are abundant in NLP (e.g., named entity recognition and aspect-based sentiment analysis). Owing to the intuition that they require bidirectional context, SL tasks are commonly tackled with encoder-only models. Recent work also shows that removing the causal mask in fine-tuning enables decoder-based LLMs to become effective token classifiers. Less work, however, focused on (supervised) generative SL, a more natural setting for causal LLMs. Due to their rapid scaling, causal LLMs applied to SL are expected to outperform encoders, whose own development has stagnated. In this work, we propose supervised in-context fine-tuning (SIFT) for generative SL. SIFT casts SL tasks as constrained response generation, natural to LLMs, combining (1) in-context learning (ICL) from demonstrations with (2) supervised fine-tuning. SIFT considerably outperforms both ICL and decoder-as-encoder fine-tuning baselines on a range of standard SL tasks. We further find that although long context hinders the performance of generative SL in both ICL and SIFT, this deficiency can be mitigated by removing the instruction, as instructions are shown to be largely unnecessary for achieving strong SL performance with SIFT. Our findings highlight strengths and limitations of SL with LLMs, underscoring the importance of a response-based generative task formulation for effective SL performance.

[50] MedCOD: Enhancing English-to-Spanish Medical Translation of Large Language Models Using Enriched Chain-of-Dictionary Framework

Md Shahidul Salim,Lian Fu,Arav Adikesh Ramakrishnan,Zonghai Yao,Hong Yu

Main category: cs.CL

TL;DR: MedCOD 是一种将领域特定知识整合到大语言模型中的混合框架，通过结构化提示和微调方法，显著提升了英西医学翻译的质量。

Details

Motivation: 为了解决医学翻译中领域特定知识的不足，提升 LLM 在医学翻译任务中的表现。 Method: MedCOD 将 UMLS 和 LLM-KB 结合，使用结构化提示和 LoRA 微调方法，构建了英西平行语料库和测试集。 Result: MedCOD 显著提升了所有模型的翻译质量，Phi-4 在 BLEU、chrF++ 和 COMET 指标上超过了 GPT-4o 和 GPT-4o-mini 等强基线模型。 Conclusion: MedCOD 提高了医学翻译的质量，证明了结构化知识整合在增强 LLM 医学翻译任务中的潜力。 Abstract: We present MedCOD (Medical Chain-of-Dictionary), a hybrid framework designed to improve English-to-Spanish medical translation by integrating domain-specific structured knowledge into large language models (LLMs). MedCOD integrates domain-specific knowledge from both the Unified Medical Language System (UMLS) and the LLM-as-Knowledge-Base (LLM-KB) paradigm to enhance structured prompting and fine-tuning. We constructed a parallel corpus of 2,999 English-Spanish MedlinePlus articles and a 100-sentence test set annotated with structured medical contexts. Four open-source LLMs (Phi-4, Qwen2.5-14B, Qwen2.5-7B, and LLaMA-3.1-8B) were evaluated using structured prompts that incorporated multilingual variants, medical synonyms, and UMLS-derived definitions, combined with LoRA-based fine-tuning. Experimental results demonstrate that MedCOD significantly improves translation quality across all models. For example, Phi-4 with MedCOD and fine-tuning achieved BLEU 44.23, chrF++ 28.91, and COMET 0.863, surpassing strong baseline models like GPT-4o and GPT-4o-mini. Ablation studies confirm that both MedCOD prompting and model adaptation independently contribute to performance gains, with their combination yielding the highest improvements. These findings highlight the potential of structured knowledge integration to enhance LLMs for medical translation tasks.

[51] Structure and Destructure: Dual Forces in the Making of Knowledge Engines

Yihong Chen

Main category: cs.CL

TL;DR: 论文探讨了自然语言处理中结构化和非结构化范式之间的联系，发现结构和解构的互补作用有助于开发更通用的知识引擎。

Details

Motivation: 自然语言处理中知识引擎的构建受到两种不同范式的影响，论文旨在建立这两种范式之间的联系，以提高模型的可塑性和泛化能力。 Method: 论文通过分析自然语言处理中的两种主要范式（结构化范式和非结构化范式），探讨了结构和解构的作用，并建立概念上的联系。 Result: 论文发现，结构组织了已见的符号交互，而解构通过周期性嵌入重置提高了模型的可塑性和对未见场景的泛化能力。 Conclusion: 论文得出的结论是，结构和解构这两个互补的力量可以在开发通用知识引擎中起到关键作用，从而支持透明、可控和适应性强的智能系统。 Abstract: The making of knowledge engines in natural language processing has been shaped by two seemingly distinct paradigms: one grounded in structure, the other driven by massively available unstructured data. The structured paradigm leverages predefined symbolic interactions, such as knowledge graphs, as priors and designs models to capture them. In contrast, the unstructured paradigm centers on scaling transformer architectures with increasingly vast data and model sizes, as seen in modern large language models. Despite their divergence, this thesis seeks to establish conceptual connections bridging these paradigms. Two complementary forces, structure and destructure, emerge across both paradigms: structure organizes seen symbolic interactions, while destructure, through periodic embedding resets, improves model plasticity and generalization to unseen scenarios. These connections form a new recipe for developing general knowledge engines that can support transparent, controllable, and adaptable intelligent systems.

[52] RPRO:Ranked Preference Reinforcement Optimization for Enhancing Medical QA and Diagnostic Reasoning

Chia-Hsuan Hsu,Jun-En Ding,Hsin-Ling Hsu,Feng Liu,Fang-Ming Hung

Main category: cs.CL

TL;DR: 本文提出 RPRO，通过偏好优化和质量驱动的推理改进，显著提升医学问答中推理链的准确性和可靠性。

Details

Motivation: 现有大语言模型在医学问答中的推理链缺乏事实准确性和临床可靠性，需要改进。 Method: RPRO 框架结合了强化学习与偏好驱动的推理优化，并引入了基于 Bradley-Terry 模型的组内排序优化和 KL 散度正则化。 Result: 在 PubMedQA 和 MedQA-USMLE 实验中，RPRO 显著优于包括医学专用模型在内的基线模型，且小参数模型（1.1B）优于更大模型（7B-13B） Conclusion: RPRO 提供了一种可扩展且有效的方法来构建更可靠、符合临床背景的医学大语言模型。 Abstract: Medical question answering requires advanced reasoning that integrates domain knowledge with logical inference. However, existing large language models (LLMs) often generate reasoning chains that lack factual accuracy and clinical reliability. We propose Ranked Preference Reinforcement Optimization (RPRO), a novel framework that uniquely combines reinforcement learning with preference-driven reasoning refinement to enhance clinical chain-of-thought (CoT) performance. RPRO differentiates itself from prior approaches by employing task-adaptive reasoning templates and a probabilistic evaluation mechanism that aligns outputs with established clinical workflows, while automatically identifying and correcting low-quality reasoning chains. Unlike traditional pairwise preference methods, RPRO introduces a groupwise ranking optimization based on the Bradley-Terry model and incorporates KL-divergence regularization for stable training. Experiments on PubMedQA and MedQA-USMLE show consistent improvements over strong baselines. Remarkably, our 1.1B parameter model outperforms much larger 7B-13B models, including medical-specialized variants. These findings demonstrate that combining preference optimization with quality-driven refinement offers a scalable and effective approach to building more reliable, clinically grounded medical LLMs.

[53] Performance Analysis of Supervised Machine Learning Algorithms for Text Classification

Sadia Zaman Mishu,S M Rafiuddin

Main category: cs.CL

TL;DR: 本文研究了监督机器学习技术在文本分类中的应用，通过使用标记文档和人工神经网络等方法，发现某些模型在分类准确率上表现良好。

Details

Motivation: 网络搜索、数据挖掘、网页排名、推荐系统和信息技术其他领域对文本分类的需求显著增长。 Method: 使用监督机器学习技术和反向传播网络（BPN）的人工神经网络（ANN）模型进行文本分类。 Result: 实验分析揭示了在使用标记文档进行分类时，某些模型在分类准确率上表现良好。 Conclusion: 实验分析表明，在文本分类方面，某些模型在分类准确率上表现良好。 Abstract: The demand for text classification is growing significantly in web searching, data mining, web ranking, recommendation systems, and so many other fields of information and technology. This paper illustrates the text classification process on different datasets using some standard supervised machine learning techniques. Text documents can be classified through various kinds of classifiers. Labeled text documents are used to classify the text in supervised classifications. This paper applies these classifiers on different kinds of labeled documents and measures the accuracy of the classifiers. An Artificial Neural Network (ANN) model using Back Propagation Network (BPN) is used with several other models to create an independent platform for labeled and supervised text classification process. An existing benchmark approach is used to analyze the performance of classification using labeled documents. Experimental analysis on real data reveals which model works well in terms of classification accuracy.

[54] Ranking of Bangla Word Graph using Graph-based Ranking Algorithms

S M Rafiuddin

Main category: cs.CL

TL;DR: This paper explores the use of graph-based ranking algorithms to rank Bangla words using a word graph, achieving measurable accuracy with real data.

Details

Motivation: The motivation is to address the lack of a standard Bangla word database and to determine the relative importance of Bangla words using graph-based ranking methods. Method: The research uses a word graph to represent Bangla words and applies various graph-based ranking algorithms. The Indian Language POS-tag Corpora is used for data, and preprocessing steps are applied before using the algorithms. Result: Experimental results show the accuracy of each ranking algorithm in terms of F1 measure when applied to real data. Conclusion: The research concludes that graph-based ranking algorithms can effectively rank Bangla words, and the accuracy of each algorithm is evaluated using the F1 measure. Abstract: Ranking words is an important way to summarize a text or to retrieve information. A word graph is a way to represent the words of a sentence or a text as the vertices of a graph and to show the relationship among the words. It is also useful to determine the relative importance of a word among the words in the word-graph. In this research, the ranking of Bangla words are calculated, representing Bangla words from a text in a word graph using various graph based ranking algorithms. There is a lack of a standard Bangla word database. In this research, the Indian Language POS-tag Corpora is used, which has a rich collection of Bangla words in the form of sentences with their parts of speech tags. For applying a word graph to various graph based ranking algorithms, several standard procedures are applied. The preprocessing steps are done in every word graph and then applied to graph based ranking algorithms to make a comparison among these algorithms. This paper illustrate the entire procedure of calculating the ranking of Bangla words, including the construction of the word graph from text. Experimental result analysis on real data reveals the accuracy of each ranking algorithm in terms of F1 measure.

[55] We Politely Insist: Your LLM Must Learn the Persian Art of Taarof

Nikta Gohari Sadr,Sahar Heidariasl,Karine Megerdoomian,Laleh Seyyed-Kalantari,Ali Emami

Main category: cs.CL

TL;DR: 论文提出了TaarofBench基准，并发现大型语言模型在理解伊朗文化特定的礼貌规范方面存在显著不足，通过调整方法提升了模型的文化适应性。

Details

Motivation: 大型语言模型（LLMs）在理解特定文化沟通规范方面存在困难，限制了它们在全球环境中的效果。论文聚焦于波斯语中的taarof（伊朗社交规范中的一种礼貌系统），旨在填补现有文化基准中的空白。 Method: 作者介绍了TaarofBench，这是一个包含450个角色扮演场景的新基准，涵盖12个常见的社交话题，并通过监督微调和直接偏好优化方法提升模型的文化对齐度。 Result: 评估五种前沿LLMs时发现，当taarof在文化上合适时，其准确率比母语者低40-48%。此外，通过标准指标评为“礼貌”的回复往往违反taarof规范，显示出西方礼貌框架的局限性。 Conclusion: 论文得出通过监督微调和直接偏好优化，模型在文化期望上的对齐度分别提高了21.8%和42.3%。研究为开发多样且文化意识强的LLMs奠定了基础，使其能够更好地处理复杂的社交互动。 Abstract: Large language models (LLMs) struggle to navigate culturally specific communication norms, limiting their effectiveness in global contexts. We focus on Persian taarof, a social norm in Iranian interactions, which is a sophisticated system of ritual politeness that emphasizes deference, modesty, and indirectness, yet remains absent from existing cultural benchmarks. We introduce TaarofBench, the first benchmark for evaluating LLM understanding of taarof, comprising 450 role-play scenarios covering 12 common social interaction topics, validated by native speakers. Our evaluation of five frontier LLMs reveals substantial gaps in cultural competence, with accuracy rates 40-48% below native speakers when taarof is culturally appropriate. Performance varies between interaction topics, improves with Persian-language prompts, and exhibits gender-based asymmetries. We also show that responses rated "polite" by standard metrics often violate taarof norms, indicating the limitations of Western politeness frameworks. Through supervised fine-tuning and Direct Preference Optimization, we achieve 21.8% and 42.3% improvement in model alignment with cultural expectations. Our human study with 33 participants (11 native Persian, 11 heritage, and 11 non-Iranian speakers) forms baselines in varying degrees of familiarity with Persian norms. This work lays the foundation for developing diverse and culturally aware LLMs, enabling applications that better navigate complex social interactions.

[56] A Dynamic Fusion Model for Consistent Crisis Response

Xiaoying Song,Anirban Saha Anik,Eduardo Blanco,Vanessa Frias-Martinez,Lingzi Hong

Main category: cs.CL

TL;DR: 文章提出了一种新方法和评估指标，用于提高危机沟通中自动化响应的风格一致性，从而增强信任并提升响应质量。

Details

Motivation: 危机沟通中，受影响人群对响应者的信任可能受到响应风格一致性的关键影响，但目前缺乏保持生成响应风格一致性的方法研究。 Method: 本文采用两阶段过程：首先评估候选响应的风格，然后通过融合过程在实例层面上优化和整合它们。 Result: 实验结果表明，该方法在多个数据集上均能显著减少风格变化，并在响应质量和风格一致性方面优于基线方法。 Conclusion: 本文提出了一种基于融合生成方法的新指标，以解决危机沟通中语言模型响应风格一致性的问题，实验证明该方法在响应质量和风格一致性方面均优于基线方法。 Abstract: In response to the urgent need for effective communication with crisis-affected populations, automated responses driven by language models have been proposed to assist in crisis communications. A critical yet often overlooked factor is the consistency of response style, which could affect the trust of affected individuals in responders. Despite its importance, few studies have explored methods for maintaining stylistic consistency across generated responses. To address this gap, we propose a novel metric for evaluating style consistency and introduce a fusion-based generation approach grounded in this metric. Our method employs a two-stage process: it first assesses the style of candidate responses and then optimizes and integrates them at the instance level through a fusion process. This enables the generation of high-quality responses while significantly reducing stylistic variation between instances. Experimental results across multiple datasets demonstrate that our approach consistently outperforms baselines in both response quality and stylistic uniformity.

[57] Speaking at the Right Level: Literacy-Controlled Counterspeech Generation with RAG-RL

Xiaoying Song,Anirban Saha Anik,Dibakar Barua,Pengcheng Luo,Junhua Ding,Lingzi Hong

Main category: cs.CL

TL;DR: 研究提出Controlled-Literacy框架，通过结合检索增强生成和强化学习生成定制的反虚假信息，以适应不同的健康素养水平，实现更公平和有效的公共卫生交流。

Details

Motivation: 现有的自动产生反虚假信息的方法往往产生统一的回复，忽略了受众的健康素养水平对反虚假信息的可访问性和有效性的影响。 Method: 提出了一个结合检索增强生成和强化学习的Controlled-Literacy框架，通过检索特定健康素养水平的知识和设计包含用户偏好和可读性奖励的奖励函数来生成定制的反虚假信息。 Result: 实验结果表明，Controlled-Literacy在生成更易访问和用户偏好的反虚假信息方面优于基线方法。 Conclusion: Controlled-Literacy框架有助于实现更公平和有效的公共卫生交流，通过提高反虚假健康信息的可访问性和理解性。 Abstract: Health misinformation spreading online poses a significant threat to public health. Researchers have explored methods for automatically generating counterspeech to health misinformation as a mitigation strategy. Existing approaches often produce uniform responses, ignoring that the health literacy level of the audience could affect the accessibility and effectiveness of counterspeech. We propose a Controlled-Literacy framework using retrieval-augmented generation (RAG) with reinforcement learning (RL) to generate tailored counterspeech adapted to different health literacy levels. In particular, we retrieve knowledge aligned with specific health literacy levels, enabling accessible and factual information to support generation. We design a reward function incorporating subjective user preferences and objective readability-based rewards to optimize counterspeech to the target health literacy level. Experiment results show that Controlled-Literacy outperforms baselines by generating more accessible and user-preferred counterspeech. This research contributes to more equitable and impactful public health communication by improving the accessibility and comprehension of counterspeech to health misinformation.

[58] Assessing Large Language Models on Islamic Legal Reasoning: Evidence from Inheritance Law Evaluation

Abdessalam Bouchekif,Samer Rashwani,Heba Sbahi,Shahd Gaben,Mutez Al-Khatib,Mohammed Ghaly

Main category: cs.CL

TL;DR: 评估大型语言模型在伊斯兰继承法中的知识和推理能力，结果显示不同模型表现差异显著。

Details

Motivation: 探索大型语言模型在伊斯兰继承法这一复杂法律领域中的推理和领域适应能力。 Method: 使用1000道多选题的基准测试七个大型语言模型，测试其对继承情境的理解和法律规则的应用能力。 Result: o3和Gemini 2.5准确率超过90%，而ALLaM、Fanar、LLaMA和Mistral低于50%。 Conclusion: 模型在结构化法律推理方面存在显著局限性，需进一步改进以提升伊斯兰法律推理性能。 Abstract: This paper evaluates the knowledge and reasoning capabilities of Large Language Models in Islamic inheritance law, known as 'ilm al-mawarith. We assess the performance of seven LLMs using a benchmark of 1,000 multiple-choice questions covering diverse inheritance scenarios, designed to test models' ability to understand the inheritance context and compute the distribution of shares prescribed by Islamic jurisprudence. The results reveal a significant performance gap: o3 and Gemini 2.5 achieved accuracies above 90%, whereas ALLaM, Fanar, LLaMA, and Mistral scored below 50%. These disparities reflect important differences in reasoning ability and domain adaptation. We conduct a detailed error analysis to identify recurring failure patterns across models, including misunderstandings of inheritance scenarios, incorrect application of legal rules, and insufficient domain knowledge. Our findings highlight limitations in handling structured legal reasoning and suggest directions for improving performance in Islamic legal reasoning. Code: https://github.com/bouchekif/inheritance_evaluation

[59] A Paradigm Gap in Urdu

Farah Adeeba,Rajesh Bhatt

Main category: cs.CL

TL;DR: The paper explores why the perfective form of a specific Urdu verb construction is now ungrammatical, attributing it to a historical morphosyntactic conflict.

Details

Motivation: The authors aim to explain the grammatical unacceptability of the perfective form of the -ya: kar construction in modern Urdu and Hindi, which was common in the 19th century. Method: The authors used historical text analysis, a large-scale corpus study, and subjective evaluation tasks with native speakers to investigate the diachronic shift. Result: The study confirmed the absence of perfective forms in modern usage and demonstrated native speakers' judgment of these forms as unnatural. Conclusion: The ungrammaticality of the perfective form of the -ya: kar construction in modern Urdu and Hindi stems from a morphosyntactic conflict, which has become entrenched over time. Abstract: In this paper, we document a paradigm gap in the combinatorial possibilities of verbs and aspect in Urdu: the perfective form of the -ya: kar construction (e.g. ro-ya: ki: cry-Pfv do.Pfv) is sharply ungrammatical in modern Urdu and Hindi, despite being freely attested in 19th century literature. We investigate this diachronic shift through historical text analysis, a large-scale corpus study which confirms the stark absence of perfective forms and subjective evaluation tasks with native speakers, who judge perfective examples as highly unnatural. We argue that this gap arose from a fundamental morphosyntactic conflict: the construction's requirement for a nominative subject and an invariant participle clashes with the core grammatical rule that transitive perfective assign ergative case. This conflict rendered the perfective form unstable, and its functional replacement by other constructions allowed the gap to become entrenched in the modern grammar.

[60] Privacy-Preserving Reasoning with Knowledge-Distilled Parametric Retrieval Augmented Generation

Jinwen Chen,Hainan Zhang,Liang Pang,Yongxin Tong,Haibo Zhou,Yuan Zhan,Wei Lin,Zhiming Zheng

Main category: cs.CL

TL;DR: This paper proposes DistilledPRAG, an improved parametric RAG system that maintains privacy, boosts efficiency, and generalizes well for out-of-distribution inputs.

Details

Motivation: The current RAG system risks data privacy by uploading plaintext documents to the cloud, and PRAG faces challenges in efficiency and generalization. Method: DistilledPRAG synthesizes QA pairs, masks plaintext documents, translates them into LoRA via a parameter generator, and aligns with standard RAG through hidden state and output logit matching. Result: DistilledPRAG outperforms baselines in accuracy and generalizes well on OOD data across four QA datasets. Conclusion: DistilledPRAG successfully addresses the limitations of PRAG by improving efficiency and generalization, while preserving data privacy. Abstract: The current RAG system requires uploading plaintext documents to the cloud, risking private data leakage. Parametric RAG (PRAG) addresses this by encoding documents as LoRA within LLMs, enabling reasoning without exposing raw content. However, it still faces two issues: (1) PRAG demands synthesizing QA pairs and fine-tuning LLM for each individual document to create its corresponding LoRA, leading to unacceptable inference latency. (2) The performance of PRAG relies solely on synthetic QA data, lacking internal alignment with standard RAG, resulting in poor generalization on out-of-distribution(OOD) inputs. Therefore, achieving high-efficiency parameterization while maintaining RAG-level performance remains a critical challenge for privacy-preserving reasoning. In this paper, we propose DistilledPRAG, a generalizable knowledge-distilled parametric RAG model aligned with standard RAG in document structure and parameter activation. We first synthesize QA pairs from single and multi-documents to enhance cross-document reasoning. Then, we mask the plaintext documents with a special token and translate them to LoRA via a parameter generator, maintaining the standard RAG document structure. Finally, guided by synthetic QA data, we train the parameter generator to match standard RAG's hidden states and output logits, enabling RAG-style reasoning without original documents. Experiments on four QA datasets show that DistilledPRAG outperforms baselines in accuracy and generalizes well on OOD data.

[61] REFRAG: Rethinking RAG based Decoding

Xiaoqiang Lin,Aritra Ghosh,Bryan Kian Hsiang Low,Anshumali Shrivastava,Vijai Mohan

Main category: cs.CL

TL;DR: This paper introduces REFRAG, a framework that improves latency in RAG applications by efficiently compressing, sensing, and expanding context inputs, achieving faster processing times without sacrificing accuracy.

Details

Motivation: Processing long-context inputs in Large Language Models (LLMs) introduces significant system latency and demands substantial memory. In retrieval-augmented generation (RAG), much of the LLM context consists of concatenated passages from retrieval, leading to inefficient computations. Method: REFRAG, an efficient decoding framework that compresses, senses, and expands to improve latency in RAG applications, was proposed and validated across diverse long-context tasks. Result: The proposed REFRAG framework demonstrates a 30.85% time-to-first-token acceleration (3.75% improvement to previous work) without loss in perplexity and extends the context size of LLMs by 16. Conclusion: REFRAG delivers substantial speedup with no loss in accuracy compared to LLaMA models and other state-of-the-art baselines across various context sizes. Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in leveraging extensive external knowledge to enhance responses in multi-turn and agentic applications, such as retrieval-augmented generation (RAG). However, processing long-context inputs introduces significant system latency and demands substantial memory for the key-value cache, resulting in reduced throughput and a fundamental trade-off between knowledge enrichment and system efficiency. While minimizing latency for long-context inputs is a primary objective for LLMs, we contend that RAG require specialized consideration. In RAG, much of the LLM context consists of concatenated passages from retrieval, with only a small subset directly relevant to the query. These passages often exhibit low semantic similarity due to diversity or deduplication during re-ranking, leading to block-diagonal attention patterns that differ from those in standard LLM generation tasks. Based on this observation, we argue that most computations over the RAG context during decoding are unnecessary and can be eliminated with minimal impact on performance. To this end, we propose REFRAG, an efficient decoding framework that compresses, senses, and expands to improve latency in RAG applications. By exploiting the sparsity structure, we demonstrate a 30.85 the time-to-first-token acceleration (3.75 improvement to previous work) without loss in perplexity. In addition, our optimization framework for large context enables REFRAG to extend the context size of LLMs by 16. We provide rigorous validation of REFRAG across diverse long-context tasks, including RAG, multi-turn conversations, and long document summarization, spanning a wide range of datasets. Experimental results confirm that REFRAG delivers substantial speedup with no loss in accuracy compared to LLaMA models and other state-of-the-art baselines across various context sizes.

[62] Natural Context Drift Undermines the Natural Language Understanding of Large Language Models

Yulong Wu,Viktor Schlegel,Riza Batista-Navarro

Main category: cs.CL

TL;DR: 研究表明，自然文本的演变会影响大语言模型在问答任务中的表现，即使问题和必要信息未改变。

Details

Motivation: 研究自然演化的上下文段落如何影响生成式大语言模型在问答中的表现。 Method: 提出了一个框架，用于整理从现代问答基准测试中获得的自然演化、人工编辑的阅读段落变体，并分析大语言模型在不同语义相似度得分下的表现。 Result: 实验显示，即使在推理时问题和所有必要信息仍然存在，随着阅读段落与预训练期间遇到的版本自然偏离，大语言模型的表现也会下降。 Conclusion: 自然文本的演变对大语言模型的语言理解能力提出了重大挑战。 Abstract: How does the natural evolution of context paragraphs affect question answering in generative Large Language Models (LLMs)? To investigate this, we propose a framework for curating naturally evolved, human-edited variants of reading passages from contemporary QA benchmarks and for analyzing LLM performance across a range of semantic similarity scores, which quantify how closely each variant aligns with content seen during pretraining. Using this framework, we evaluate six QA datasets and eight LLMs with publicly available training data. Our experiments reveal that LLM performance declines as reading passages naturally diverge from the versions encountered during pretraining-even when the question and all necessary information remains present at inference time. For instance, average model accuracy on BoolQ drops by over 30% from the highest to lowest similarity bins, with slopes exceeding 70 across several LLMs. These findings suggest that natural text evolution poses a significant challenge to the language understanding capabilities of LLMs.

[63] Dream-Coder 7B: An Open Diffusion Language Model for Code

Zhihui Xie,Jiacheng Ye,Lin Zheng,Jiahui Gao,Jingwei Dong,Zirui Wu,Xueliang Zhao,Shansan Gong,Xin Jiang,Zhenguo Li,Lingpeng Kong

Main category: cs.CL

TL;DR: Dream-Coder 7B is an open-source discrete diffusion language model for code generation that adaptively determines its decoding strategy based on the coding task, achieving competitive performance on various benchmarks.

Details

Motivation: The motivation is to develop a code generation model that can adapt its decoding strategy based on the coding task, thereby improving sample efficiency and generation stability. Method: The method involves adapting a pretrained AR checkpoint to a discrete diffusion framework with a continuous-time weighted cross-entropy objective, followed by a post-training recipe involving supervised fine-tuning and reinforcement learning. Result: The result is the creation of Dream-Coder 7B and Dream-Coder 7B-Instruct, which show emergent any-order generation capabilities and attain competitive performance on various benchmarks. Conclusion: Dream-Coder 7B-Instruct attains 21.4% pass@1 on LiveCodeBench and demonstrates competitive performance on other benchmarks. Abstract: We present Dream-Coder 7B, an open-source discrete diffusion language model for code generation that exhibits emergent any-order generation capabilities. Unlike traditional autoregressive (AR) models that decode strictly left-to-right, Dream-Coder 7B adaptively determines its decoding strategy based on the coding task: sketch-first generation for complex algorithms, left-to-right generation for straightforward completions, and interleaved reasoning generation for code understanding tasks. We adapt a pretrained AR checkpoint to a discrete diffusion frameworks with a continuous-time weighted cross-entropy objective. Our post-training recipe comprises (i) supervised fine-tuning, where we mitigate padding pathologies via random truncation and a padding penalty to improve sample efficiency and stabilize generation; and (ii) reinforcement learning with verifiable rewards over a curated high-quality prompt set drawn from open-source datasets, using a tailored reinforcement learning recipe for diffusion language models. The resulting Dream-Coder 7B Instruct attains 21.4\% pass@1 on LiveCodeBench (2410--2505) and demonstrates competitive performance on HumanEval, MBPP, BigCodeBench, and CRUXEval. We release Dream-Coder-7B and Dream-Coder-7B-Instruct checkpoints, training recipes, preprocessing pipelines, and inference code to facilitate reproducibility and further research.

[64] Zero-shot Cross-lingual NER via Mitigating Language Difference: An Entity-aligned Translation Perspective

Zhihao Zhang,Sophia Yat Mei Lee,Dong Zhang,Shoushan Li,Guodong Zhou

Main category: cs.CL

TL;DR: This paper proposes an entity-aligned translation approach to enhance cross-lingual named entity recognition for non-Latin script languages using large language models and dual-translation.

Details

Motivation: Existing zero-shot CL-NER approaches mainly focus on Latin script languages, leading to degraded performance for non-Latin script languages due to structural differences. Method: An entity-aligned translation (EAT) approach using a dual-translation strategy and fine-tuning large language models (LLMs) with multilingual Wikipedia data. Result: Improved performance in cross-lingual named entity recognition for non-Latin script languages like Chinese and Japanese through better entity alignment. Conclusion: The proposed EAT approach effectively enhances cross-lingual named entity recognition for non-Latin script languages by leveraging large language models and dual-translation strategy. Abstract: Cross-lingual Named Entity Recognition (CL-NER) aims to transfer knowledge from high-resource languages to low-resource languages. However, existing zero-shot CL-NER (ZCL-NER) approaches primarily focus on Latin script language (LSL), where shared linguistic features facilitate effective knowledge transfer. In contrast, for non-Latin script language (NSL), such as Chinese and Japanese, performance often degrades due to deep structural differences. To address these challenges, we propose an entity-aligned translation (EAT) approach. Leveraging large language models (LLMs), EAT employs a dual-translation strategy to align entities between NSL and English. In addition, we fine-tune LLMs using multilingual Wikipedia data to enhance the entity alignment from source to target languages.

[65] Joint Information Extraction Across Classical and Modern Chinese with Tea-MOELoRA

Xuemei Tang,Chengxi Yan,Jinghang Gu,Chu-Ren Huang

Main category: cs.CL

TL;DR: 本文提出了一种参数高效多任务框架Tea-MOELoRA，用于中国信息抽取，结合了LoRA和MoE设计，通过动态分配专家贡献提升不同任务和时期的性能。

Details

Motivation: 微调单一模型在异构任务和不同时期可能导致干扰和性能下降，因此需要一种高效的多任务框架。 Method: 提出了Tea-MOELoRA框架，结合了LoRA和MoE设计，通过任务时期感知的路由机制动态分配专家贡献。 Result: 实验表明，Tea-MOELoRA优于单任务和联合LoRA基线模型。 Conclusion: Tea-MOELoRA有效结合了LoRA和MoE设计，通过多任务框架提升中国信息抽取的性能，特别是在不同任务和时期间有效利用知识。 Abstract: Chinese information extraction (IE) involves multiple tasks across diverse temporal domains, including Classical and Modern documents. Fine-tuning a single model on heterogeneous tasks and across different eras may lead to interference and reduced performance. Therefore, in this paper, we propose Tea-MOELoRA, a parameter-efficient multi-task framework that combines LoRA with a Mixture-of-Experts (MoE) design. Multiple low-rank LoRA experts specialize in different IE tasks and eras, while a task-era-aware router mechanism dynamically allocates expert contributions. Experiments show that Tea-MOELoRA outperforms both single-task and joint LoRA baselines, demonstrating its ability to leverage task and temporal knowledge effectively.

[66] Enhancing Large Language Model for Knowledge Graph Completion via Structure-Aware Alignment-Tuning

Yu Liu,Yanan Cao,Xixun Lin,Yanmin Shang,Shi Wang,Shirui Pan

Main category: cs.CL

TL;DR: SAT is a novel framework for knowledge graph completion that enhances large language models through structure-aware alignment-tuning, leading to significant performance improvements.

Details

Motivation: To address challenges in LLM-enhanced KGC methods, such as inconsistent representation spaces and the need for task-specific instructions. Method: The proposed SAT framework uses hierarchical knowledge alignment and structural instruction tuning to enhance LLMs for KGC. Result: SAT outperforms existing methods on two KGC tasks across four datasets, with link prediction improvements ranging from 8.7% to 29.8%. Conclusion: SAT demonstrates significant improvements over state-of-the-art methods in KGC tasks, particularly in link prediction. Abstract: Knowledge graph completion (KGC) aims to infer new knowledge and make predictions from knowledge graphs. Recently, large language models (LLMs) have exhibited remarkable reasoning capabilities. LLM-enhanced KGC methods primarily focus on designing task-specific instructions, achieving promising advancements. However, there are still two critical challenges. First, existing methods often ignore the inconsistent representation spaces between natural language and graph structures. Second, most approaches design separate instructions for different KGC tasks, leading to duplicate works and time-consuming processes. To address these challenges, we propose SAT, a novel framework that enhances LLMs for KGC via structure-aware alignment-tuning. Specifically, we first introduce hierarchical knowledge alignment to align graph embeddings with the natural language space through multi-task contrastive learning. Then, we propose structural instruction tuning to guide LLMs in performing structure-aware reasoning over KGs, using a unified graph instruction combined with a lightweight knowledge adapter. Experimental results on two KGC tasks across four benchmark datasets demonstrate that SAT significantly outperforms state-of-the-art methods, especially in the link prediction task with improvements ranging from 8.7% to 29.8%.

[67] Modular Techniques for Synthetic Long-Context Data Generation in Language Model Training and Evaluation

Seganrasan Subramanian,Abhigya Verma

Main category: cs.CL

TL;DR: This paper proposes a flexible framework for generating synthetic long-context datasets to enhance LLMs' capabilities in handling extended textual inputs.

Details

Motivation: The absence of high-quality, diverse, and verifiable long-context datasets hinders progress in improving LLMs' ability to process and reason over long textual inputs. Method: A modular, extensible framework for synthetic long-context data generation via prompt-based interaction with LLMs, supporting multiple training and alignment objectives like SFT, DPO, and GRPO. Result: A framework encompassing four core generation paradigms: multi-turn conversational dialogues, document-grounded input-output pairs, verifiable instruction-response tasks, and long-context reasoning examples. Conclusion: The proposed framework facilitates scalable, controllable, and purpose-aligned dataset creation to advance long-context capabilities in LLMs. Abstract: The ability of large language models (LLMs) to process and reason over long textual inputs is critical for a wide range of real-world applications. However, progress in this area is significantly constrained by the absence of high-quality, diverse, and verifiable long-context datasets suitable for both training and evaluation. This work introduces a modular, extensible framework for synthetic long-context data generation via prompt-based interaction with LLMs. The framework supports multiple training and alignment objectives, including Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO). It encompasses four core generation paradigms: multi-turn conversational dialogues, document-grounded input-output pairs, verifiable instruction-response tasks, and long-context reasoning examples. Through templated prompting, a model-agnostic architecture, and metadata-enriched outputs, the proposed approach facilitates scalable, controllable, and purpose-aligned dataset creation for advancing long-context capabilities in LLMs.

[68] Statutory Construction and Interpretation for Artificial Intelligence

Luxi He,Nimra Nadeem,Michel Liao,Howard Chen,Danqi Chen,Mariano-Florentino Cuéllar,Peter Henderson

Main category: cs.CL

TL;DR: 该论文提出了一种受法律系统启发的计算框架，通过规则完善和解释约束来减少AI系统的解释模糊性，从而提高判断一致性。

Details

Motivation: AI系统日益依赖自然语言原则进行治理，但解释模糊性问题尚未得到充分探索。与法律系统不同，当前的AI对齐流程缺乏类似透明上诉审查的保护机制，导致模型行为可能不一致或不稳定。 Method: 该论文借鉴法律理论，分析了法律系统如何在规则制定和规则应用阶段限制模糊性，并提出了两个类比机制：规则完善流程和提示解释约束。随后，使用WildChat数据集的一个5000个场景的子集对框架进行了评估。 Result: 论文显示，所提出的两个干预措施（规则完善流程和提示解释约束）在5000个WildChat数据集场景中显著提高了合理解释者之间的判断一致性。 Conclusion: 该论文提出了一种计算框架，以法律理论为基础，通过两个机制（规则完善流程和提示解释约束）来减少AI系统中的解释模糊性，从而提高判断一致性。这是实现更稳健、遵守法律的AI系统的重要步骤。 Abstract: AI systems are increasingly governed by natural language principles, yet a key challenge arising from reliance on language remains underexplored: interpretive ambiguity. As in legal systems, ambiguity arises both from how these principles are written and how they are applied. But while legal systems use institutional safeguards to manage such ambiguity, such as transparent appellate review policing interpretive constraints, AI alignment pipelines offer no comparable protections. Different interpretations of the same rule can lead to inconsistent or unstable model behavior. Drawing on legal theory, we identify key gaps in current alignment pipelines by examining how legal systems constrain ambiguity at both the rule creation and rule application steps. We then propose a computational framework that mirrors two legal mechanisms: (1) a rule refinement pipeline that minimizes interpretive disagreement by revising ambiguous rules (analogous to agency rulemaking or iterative legislative action), and (2) prompt-based interpretive constraints that reduce inconsistency in rule application (analogous to legal canons that guide judicial discretion). We evaluate our framework on a 5,000-scenario subset of the WildChat dataset and show that both interventions significantly improve judgment consistency across a panel of reasonable interpreters. Our approach offers a first step toward systematically managing interpretive ambiguity, an essential step for building more robust, law-following AI systems.

[69] Efficient Large Language Models with Zero-Shot Adjustable Acceleration

Sajjad Kachuee,Mohammad Sharifkhani

Main category: cs.CL

TL;DR: This paper proposes Zero-Shot Adjustable Acceleration, a method that enhances the efficiency of Large Language Models by dynamically adjusting hardware usage during inference, achieving significant speedups without additional fine-tuning.

Details

Motivation: The motivation stems from the challenge of balancing computational efficiency and performance when using Large Language Models (LLMs) in real-world applications. Optimizing acceleration post fine-tuning and during inference is essential for creating an efficient architecture. Method: The method involves a novel training and inference approach called Zero-Shot Adjustable Acceleration, which dynamically adjusts hardware usage during inference. It was applied to newly developed models and tested on multiple classification and text generation tasks. Result: Experimental results showed that the proposed method enables a wide range of acceleration in a zero-shot manner, achieving up to an 11x speedup compared to the baseline. Conclusion: The proposed Zero-Shot Adjustable Acceleration method allows for dynamic adjustment of hardware usage during inference without the need for additional fine-tuning, leading to significant speedups. Abstract: Using Large Language Models (LLMs) in real-world applications presents significant challenges, particularly in balancing computational efficiency and performance. Optimizing acceleration after the fine-tuning phase and during inference is crucial for building an efficient architecture. This paper introduces Zero-Shot Adjustable Acceleration, a novel training and inference method that dynamically adjusts hardware usage during inference without requiring additional fine-tuning. The proposed approach is applied to newly developed models and evaluated across multiple classification and text generation tasks. Experimental results demonstrate that the method enables a wide range of acceleration in a zero-shot manner and achieves up to a 11x speedup compared to the baseline.

[70] SimulMEGA: MoE Routers are Advanced Policy Makers for Simultaneous Speech Translation

Chenyang Le,Bing Han,Jinshun Li,Songyong Chen,Yanmin Qian

Main category: cs.CL

TL;DR: SimulMEGA improves simultaneous speech translation by learning effective read and write decisions without extra overhead, outperforming existing systems in translation quality and latency.

Details

Motivation: Existing SimulST systems struggle to balance translation quality, latency, and semantic coherence, particularly in multilingual many-to-many scenarios. Method: SimulMEGA uses an unsupervised policy learning framework combining prefix-based training with a Mixture-of-Experts refiner to learn read and write decisions implicitly. Result: The SimulMEGA 500M parameter speech-to-text model outperforms the Seamless baseline, achieving under 7 percent BLEU degradation at 1.5 seconds average lag and under 3 percent at 3 seconds. Conclusion: SimulMEGA is an effective framework for simultaneous speech translation that balances translation quality, latency, and semantic coherence without adding inference-time overhead. Abstract: Simultaneous Speech Translation (SimulST) enables real-time cross-lingual communication by jointly optimizing speech recognition and machine translation under strict latency constraints. Existing systems struggle to balance translation quality, latency, and semantic coherence, particularly in multilingual many-to-many scenarios where divergent read and write policies hinder unified strategy learning. In this paper, we present SimulMEGA (Simultaneous Generation by Mixture-of-Experts Gating), an unsupervised policy learning framework that combines prefix-based training with a Mixture-of-Experts refiner to learn effective read and write decisions in an implicit manner, without adding inference-time overhead. Our design requires only minimal modifications to standard transformer architectures and generalizes across both speech-to-text and text-to-speech streaming tasks. Through comprehensive evaluation on six language pairs, our 500M parameter speech-to-text model outperforms the Seamless baseline, achieving under 7 percent BLEU degradation at 1.5 seconds average lag and under 3 percent at 3 seconds. We further demonstrate the versatility of SimulMEGA by extending it to streaming TTS with a unidirectional backbone, yielding superior latency quality tradeoffs.

[71] Mitigating Catastrophic Forgetting in Continual Learning through Model Growth

Ege Süalp,Mina Rezaei

Main category: cs.CL

TL;DR: This paper investigates the use of growth-based pretraining to reduce catastrophic forgetting in large language models. While the approach shows some improvement in retention, particularly in reading comprehension, it also reveals trade-offs in social bias handling.

Details

Motivation: Catastrophic forgetting poses a significant challenge in continual learning, especially for large language models (LLMs) that must retain performance across diverse domains. While growth-based pretraining has shown promise in accelerating convergence, its impact on forgetting has not been thoroughly explored. Method: The study evaluates growth-based models, particularly through transformer stacking (Stack LLM), and compares them with non-growth-based models (LLM) across multiple fine-tuning tasks. These tasks include domain knowledge, reasoning, reading comprehension, and bias evaluation. Result: Both Stack LLM and LLM improved in domain knowledge, but showed degradation in reasoning and reading comprehension over time. Stack LLM demonstrated less degradation, particularly in reading comprehension. In bias evaluation, the baseline LLM became more neutral with continued fine-tuning, while Stack LLM maintained a consistent bias ratio of 60-61%. Conclusion: Growth-based pretraining may offer modest improvements in resisting catastrophic forgetting, but there are trade-offs in handling social biases. Abstract: Catastrophic forgetting is a significant challenge in continual learning, in which a model loses prior knowledge when it is fine-tuned on new tasks. This problem is particularly critical for large language models (LLMs) undergoing continual learning, as retaining performance across diverse domains is important for their general utility. In this paper, we explore model growth, a promising strategy that leverages smaller models to expedite and structure the training of larger ones for mitigating the catastrophic forgetting problem. Although growth-based pretraining, particularly via transformer stacking, has shown promise in accelerating convergence, its impact on forgetting remains under-explored. Therefore, we evaluate whether growth-based models can retain previously learned capabilities more effectively across a sequence of fine-tuning tasks involving domain knowledge, reasoning, reading comprehension, and bias. Our findings show that both models -- one trained with growth (Stack LLM) and one without (LLM) -- exhibit improvements in domain knowledge. However, reasoning and reading comprehension degrade over time, indicating signs of catastrophic forgetting. Stack LLM consistently shows less degradation, especially in reading comprehension, suggesting enhanced retention capabilities. Interestingly, in bias evaluation, the baseline LLM becomes progressively more neutral with continued fine-tuning, while Stack LLM maintains a steady bias ratio around 60--61\%. These results indicate that growth-based pretraining may deliver modest improvements in resisting catastrophic forgetting, though trade-offs remain in handling social biases.

[72] DaMoC: Efficiently Selecting the Optimal Large Language Model for Fine-tuning Domain Taks Based on Data and Model Compression

Wei Huang,Huang Wei,Yinggui Wang

Main category: cs.CL

TL;DR: 本文提出了一种名为DaMoC的框架，用于加速和优化大语言模型（LLM）在特定任务上的微调过程。

Details

Motivation: 现有的开源LLMs在领域特定任务上表现不佳，需要特定数据进行微调，而如何快速选择最优LLM是一项挑战。 Method: DaMoC框架从数据层面和模型层面进行优化：数据层面包括数据过滤方法的系统分类和关键token密度的提升，并利用LLM优化文本表达；模型层面通过层相似度评分评估重要性并引入稀疏合并范式进行模型压缩。 Result: 在四个数据集（医疗问答、金融问答、通用问答和阅读理解）的广泛实验表明，DaMoC框架能够在节省约20倍训练时间的同时选择最优LLM。 Conclusion: DaMoC框架能够在节省约20倍训练时间的同时，选择适合下游任务的最优LLM。 Abstract: Large language models (LLMs) excel in general tasks but struggle with domain-specific ones, requiring fine-tuning with specific data. With many open-source LLMs available, selecting the best model for fine-tuning downstream tasks is challenging, primarily focusing on how to quickly identify the optimal LLM. We introduce a Data and Model Compression Framework (DaMoC) that addresses this challenge by: 1) Data Level: A systematic categorization of data filtering methodologies for LLMs is first established, classifying them into three distinct paradigms: (1) distribution-aware methods, (2) quality-aware methods, and (3) hybrid approaches considering both dimensions. Further, we enhance the density of key tokens in the text achieving token compression. Subsequently, we use an LLM to iterative rewrite the text to optimize its expression. 2) Model Level: We use layer similarity scores to assess each layer's importance and remove those with lower importance. Then, we introduce a sparse merging paradigm to preserve as much of the original model's capability as possible. Extensive experiments on four datasets, medical Q&A, financial Q&A, general Q&A, and reading comprehension, show that we can select the optimal LLM while saving approximately 20-fold in training time.

[73] Rethinking the Chain-of-Thought: The Roles of In-Context Learning and Pre-trained Priors

Hao Yang,Zhiyu Yang,Yunjie Zhang,Shanyi Zhu,Lin Yang

Main category: cs.CL

TL;DR: This paper investigates how Chain-of-Thought reasoning works, showing that while models learn reasoning structures quickly and grasp logical patterns, they heavily rely on pretrained knowledge. Providing more exemplars helps shift reliance to in-context information, but misleading prompts can destabilize this. Long reasoning chains improve downstream task performance.

Details

Motivation: To understand the underlying mechanisms of Chain-of-Thought reasoning and how it enhances model inference capabilities. Method: Fine-grained lexical-level analysis of rationales, incremental introduction of noisy exemplars, and investigation of prompt engineering's impact on slow thinking. Result: 1. The model learns reasoning structures quickly but relies heavily on pretrained priors. 2. Sufficient exemplars shift decision-making to in-context signals, while misleading prompts cause instability. 3. Long Chain-of-Thought prompting improves performance by generating longer reasoning chains. Conclusion: Chain-of-Thought prompting can improve the model's performance by generating longer reasoning chains. The model heavily relies on pretrained priors, but sufficient exemplars can shift its decision-making to in-context signals, albeit with instability from misleading prompts. Abstract: Chain-of-Thought reasoning has emerged as a pivotal methodology for enhancing model inference capabilities. Despite growing interest in Chain-of-Thought reasoning, its underlying mechanisms remain unclear. This paper explores the working mechanisms of Chain-of-Thought reasoning from the perspective of the dual relationship between in-context learning and pretrained priors. We first conduct a fine-grained lexical-level analysis of rationales to examine the model's reasoning behavior. Then, by incrementally introducing noisy exemplars, we examine how the model balances pretrained priors against erroneous in-context information. Finally, we investigate whether prompt engineering can induce slow thinking in large language models. Our extensive experiments reveal three key findings: (1) The model not only quickly learns the reasoning structure at the lexical level but also grasps deeper logical reasoning patterns, yet it heavily relies on pretrained priors. (2) Providing sufficient exemplars shifts the model's decision-making from pretrained priors to in-context signals, while misleading prompts introduce instability. (3) Long Chain-of-Thought prompting can induce the model to generate longer reasoning chains, thereby improving its performance on downstream tasks.

[74] Annotation and modeling of emotions in a textual corpus: an evaluative approach

Jonas Noblet

Main category: cs.CL

TL;DR: 该论文探讨了基于评价方法的情感分析，发现语言模型不仅能建模标注过程，还能依据评价标准区分情感情境。

Details

Motivation: 情感在人类社会功能中至关重要，但其文本表现仍研究不足，尤其是基于评价方法的情感分析。 Method: 使用手动标注的情感语料库训练语言模型，并分析标注过程和变异性。 Result: 尽管收集的标注存在显著分歧，但其展现出稳定的统计趋势，且语言模型能有效建模标注过程。 Conclusion: 语言模型能够基于评价标准区分情感情境，且标注的变化受潜在语言特征驱动。 Abstract: Emotion is a crucial phenomenon in the functioning of human beings in society. However, it remains a widely open subject, particularly in its textual manifestations. This paper examines an industrial corpus manually annotated following an evaluative approach to emotion. This theoretical framework, which is currently underutilized, offers a different perspective that complements traditional approaches. Noting that the annotations we collected exhibit significant disagreement, we hypothesized that they nonetheless follow stable statistical trends. Using language models trained on these annotations, we demonstrate that it is possible to model the labeling process and that variability is driven by underlying linguistic features. Conversely, our results indicate that language models seem capable of distinguishing emotional situations based on evaluative criteria.

[75] Culture is Everywhere: A Call for Intentionally Cultural Evaluation

Juhyun Oh,Inha Cha,Michael Saxon,Hyunseung Lim,Shaily Bhatt,Alice Oh

Main category: cs.CL

TL;DR: The paper argues that current evaluation methods for cultural alignment in large language models are insufficient and proposes a new approach that systematically incorporates cultural considerations and community involvement in NLP research.

Details

Motivation: The motivation stems from the inadequacy of current evaluation methods that reduce culture to static facts and fail to account for the pluralistic and interactive nature of culture in LLMs. Method: The authors analyze the existing paradigms of cultural evaluation for LLMs and propose a new framework that emphasizes researcher positionality and community involvement through participatory methodologies inspired by HCI. Result: The result is a systematic characterization of how culturally contingent considerations arise in evaluation and a proposed framework for more inclusive and culturally aligned NLP research. Conclusion: The paper concludes that the current methods for evaluating cultural alignment in LLMs are insufficient and proposes a new approach called 'intentionally cultural evaluation' to better incorporate cultural considerations in NLP research. Abstract: The prevailing ``trivia-centered paradigm'' for evaluating the cultural alignment of large language models (LLMs) is increasingly inadequate as these models become more advanced and widely deployed. Existing approaches typically reduce culture to static facts or values, testing models via multiple-choice or short-answer questions that treat culture as isolated trivia. Such methods neglect the pluralistic and interactive realities of culture, and overlook how cultural assumptions permeate even ostensibly ``neutral'' evaluation settings. In this position paper, we argue for \textbf{intentionally cultural evaluation}: an approach that systematically examines the cultural assumptions embedded in all aspects of evaluation, not just in explicitly cultural tasks. We systematically characterize the what, how, and circumstances by which culturally contingent considerations arise in evaluation, and emphasize the importance of researcher positionality for fostering inclusive, culturally aligned NLP research. Finally, we discuss implications and future directions for moving beyond current benchmarking practices, discovering important applications that we don't know exist, and involving communities in evaluation design through HCI-inspired participatory methodologies.

[76] TableZoomer: A Collaborative Agent Framework for Large-scale Table Question Answering

Sishi Xiong,Ziyang He,Zhongjiang He,Yu Zhao,Changzai Pan,Jie Zhang,Zhenhe Wu,Shuangyong Song,Yongxiang Li

Main category: cs.CL

TL;DR: TableZoomer是一种基于大型语言模型的智能体框架，通过结构化表格模式、查询感知的表格缩放机制和思考程序策略来提高表格问答任务的性能和可扩展性。

Details

Motivation: 大型语言模型在表格问答任务中面临结构性异质性、目标数据定位困难和复杂推理瓶颈等挑战，需要一种新的框架来提高性能和可扩展性。 Method: TableZoomer引入了三个关键技术：结构化表格模式替代原始表格，查询感知的表格缩放机制，以及将问题转化为可执行代码的思考程序策略，并结合ReAct范式实现迭代推理。 Result: TableZoomer在DataBench数据集上的准确率提高了19.34%，在TableBench数据集的小规模事实验证任务中准确率提高了25%。 Conclusion: TableZoomer是一个基于大型语言模型的、以编程为基础的智能体框架，旨在解决表格问答任务中的结构性异质性、目标数据定位困难和复杂推理瓶颈等问题，并通过实验验证了其在不同规模表格数据上的性能和可扩展性的显著提升。 Abstract: While large language models (LLMs) have shown promise in the table question answering (TQA) task through prompt engineering, they face challenges in industrial applications, including structural heterogeneity, difficulties in target data localization, and bottlenecks in complex reasoning. To address these limitations, this paper presents TableZoomer, a novel LLM-powered, programming-based agent framework. It introduces three key innovations: (1) replacing the original fully verbalized table with structured table schema to bridge the semantic gap and reduce computational complexity; (2) a query-aware table zooming mechanism that dynamically generates sub-table schema through column selection and entity linking, significantly improving target localization efficiency; and (3) a Program-of-Thoughts (PoT) strategy that transforms queries into executable code to mitigate numerical hallucination. Additionally, we integrate the reasoning workflow with the ReAct paradigm to enable iterative reasoning. Extensive experiments demonstrate that our framework maintains the usability advantages while substantially enhancing performance and scalability across tables of varying scales. When implemented with the Qwen3-8B-Instruct LLM, TableZoomer achieves accuracy improvements of 19.34% and 25% over conventional PoT methods on the large-scale DataBench dataset and the small-scale Fact Checking task of TableBench dataset, respectively.

[77] Can Smaller LLMs do better? Unlocking Cross-Domain Potential through Parameter-Efficient Fine-Tuning for Text Summarization

Anum Afzal,Mehul Kumawat,Florian Matthes

Main category: cs.CL

TL;DR: This paper explores the use of parameter-efficient fine-tuning techniques, particularly Within-Domain and Cross-Domain Adapters, to improve the performance of large language models in low-resource domains, showing that these methods can outperform traditional approaches.

Details

Motivation: The motivation behind this study is to address the challenges of adapting large language models to new, low-resource domains where labeled data is scarce and traditional fine-tuning is computationally expensive. Method: The researchers benchmarked six parameter-efficient fine-tuning techniques (PEFTs) using the Llama-3-8B-Instruct model on 14 training datasets from Scientific, Medical, Legal, and News domains. They evaluated Within-Domain Adapters and Cross-Domain Adapters for text summarization tasks. Result: The experiments showed that Within-Domain Adapters outperformed Few-Shot learning and even a larger model (Llama-3-70B-Instruct) in low-resource settings. Cross-Domain Adapters and their combinations also demonstrated effectiveness in leveraging linguistic similarities across domains. Conclusion: The study concludes that parameter-efficient fine-tuning techniques, particularly Within-Domain Adapters, can significantly improve the performance of large language models in low-resource domains, outperforming few-shot learning and even larger models. Cross-Domain Adapters and their combinations also show potential in leveraging linguistic commonalities for better domain adaptation. Abstract: Large Language Models (LLMs), being generic task solvers, are versatile. However, despite the vast amount of data they are trained on, there are speculations about their adaptation capabilities to a new domain. Additionally, the simple fine-tuning of the model to incorporate knowledge of a new domain is computationally expensive and time-consuming. This becomes more challenging when the domain in question is also low-resource, and labeled data is unavailable. We leverage parameter-efficient fine-tuning techniques (PEFTs) on high-resource datasets to address these challenges to improve performance on unseen low-resource domains. Throughout our experiments, we evaluate whether intrinsic linguistic commonalities between datasets can be leveraged for efficient domain adaptation. We benchmark six PEFTs with \texttt{Llama-3-8B-Instruct} on 14 training datasets from the Scientific, Medical, Legal, and News domains for a Text Summarization task. Our experiments show that for low-resource domains, inference using Within-Domain Adapters can achieve better performance than Few-Shot as well as a much larger \texttt{Llama-3-70B-Instruct}. Lastly, in the absence of Within-Domain Adapters, we explore the concept of using Cross-Domain Adapters as well as the strategic combinations of adapters to leverage intrinsic language similarities across domains, facilitating better adaptability and performance in low-resource settings.

[78] LongCat-Flash Technical Report

Meituan LongCat Team,Bayan,Bei Li,Bingye Lei,Bo Wang,Bolin Rong,Chao Wang,Chao Zhang,Chen Gao,Chen Zhang,Cheng Sun,Chengcheng Han,Chenguang Xi,Chi Zhang,Chong Peng,Chuan Qin,Chuyu Zhang,Cong Chen,Congkui Wang,Dan Ma,Daoru Pan,Defei Bu,Dengchang Zhao,Deyang Kong,Dishan Liu,Feiye Huo,Fengcun Li,Fubao Zhang,Gan Dong,Gang Liu,Gang Xu,Ge Li,Guoqiang Tan,Guoyuan Lin,Haihang Jing,Haomin Fu,Haonan Yan,Haoxing Wen,Haozhe Zhao,Hong Liu,Hongmei Shi,Hongyan Hao,Hongyin Tang,Huantian Lv,Hui Su,Jiacheng Li,Jiahao Liu,Jiahuan Li,Jiajun Yang,Jiaming Wang,Jian Yang,Jianchao Tan,Jiaqi Sun,Jiaqi Zhang,Jiawei Fu,Jiawei Yang,Jiaxi Hu,Jiayu Qin,Jingang Wang,Jiyuan He,Jun Kuang,Junhui Mei,Kai Liang,Ke He,Kefeng Zhang,Keheng Wang,Keqing He,Liang Gao,Liang Shi,Lianhui Ma,Lin Qiu,Lingbin Kong,Lingtong Si,Linkun Lyu,Linsen Guo,Liqi Yang,Lizhi Yan,Mai Xia,Man Gao,Manyuan Zhang,Meng Zhou,Mengxia Shen,Mingxiang Tuo,Mingyang Zhu,Peiguang Li,Peng Pei,Peng Zhao,Pengcheng Jia,Pingwei Sun,Qi Gu,Qianyun Li,Qingyuan Li,Qiong Huang,Qiyuan Duan,Ran Meng,Rongxiang Weng,Ruichen Shao,Rumei Li,Shizhe Wu,Shuai Liang,Shuo Wang,Suogui Dang,Tao Fang,Tao Li,Tefeng Chen,Tianhao Bai,Tianhao Zhou,Tingwen Xie,Wei He,Wei Huang,Wei Liu,Wei Shi,Wei Wang,Wei Wu,Weikang Zhao,Wen Zan,Wenjie Shi,Xi Nan,Xi Su,Xiang Li,Xiang Mei,Xiangyang Ji,Xiangyu Xi,Xiangzhou Huang,Xianpeng Li,Xiao Fu,Xiao Liu,Xiao Wei,Xiaodong Cai,Xiaolong Chen,Xiaoqing Liu,Xiaotong Li,Xiaowei Shi,Xiaoyu Li,Xili Wang,Xin Chen,Xing Hu,Xingyu Miao,Xinyan He,Xuemiao Zhang,Xueyuan Hao,Xuezhi Cao,Xunliang Cai,Xurui Yang,Yan Feng,Yang Bai,Yang Chen,Yang Yang,Yaqi Huo,Yerui Sun,Yifan Lu,Yifan Zhang,Yipeng Zang,Yitao Zhai,Yiyang Li,Yongjing Yin,Yongkang Lv,Yongwei Zhou,Yu Yang,Yuchen Xie,Yueqing Sun,Yuewen Zheng,Yuhua Wei,Yulei Qian,Yunfan Liang,Yunfang Tai,Yunke Zhao,Zeyang Yu,Zhao Zhang,Zhaohua Yang,Zhenchao Zhang,Zhikang Xia,Zhiye Zou,Zhizhao Zeng,Zhongda Su,Zhuofan Chen,Zijian Zhang,Ziwen Wang,Zixu Jiang,Zizhe Zhao,Zongyu Wang,Zunhai Su

Main category: cs.CL

TL;DR: LongCat-Flash is a highly efficient, large-scale language model with advanced agentic capabilities, achieving fast training and inference while maintaining competitive performance.

Details

Motivation: The motivation behind LongCat-Flash is to address the need for scalable efficiency and advanced agentic capabilities in large language models, enabling optimized resource usage and faster inference while maintaining performance. Method: The model introduces two novel designs: (a) Zero-computation Experts for dynamic computational budget allocation, and (b) Shortcut-connected MoE to improve computation-communication overlap. It also utilizes a comprehensive scaling framework combining hyperparameter transfer, model-growth initialization, and a stability suite for efficient training. Result: LongCat-Flash successfully completes training on more than 20 trillion tokens within 30 days, achieves over 100 tokens per second (TPS) for inference at a low cost, and demonstrates strong performance on agentic tasks. Conclusion: LongCat-Flash is a 560-billion-parameter Mixture-of-Experts (MoE) language model that achieves scalable efficiency and advanced agentic capabilities. It is designed to optimize resource usage and improve inference efficiency, and it performs competitively among leading models, especially in agentic tasks. Abstract: We introduce LongCat-Flash, a 560-billion-parameter Mixture-of-Experts (MoE) language model designed for both computational efficiency and advanced agentic capabilities. Stemming from the need for scalable efficiency, LongCat-Flash adopts two novel designs: (a) Zero-computation Experts, which enables dynamic computational budget allocation and activates 18.6B-31.3B (27B on average) per token depending on contextual demands, optimizing resource usage. (b) Shortcut-connected MoE, which enlarges the computation-communication overlap window, demonstrating notable gains in inference efficiency and throughput compared to models of a comparable scale. We develop a comprehensive scaling framework for large models that combines hyperparameter transfer, model-growth initialization, a multi-pronged stability suite, and deterministic computation to achieve stable and reproducible training. Notably, leveraging the synergy among scalable architectural design and infrastructure efforts, we complete model training on more than 20 trillion tokens within 30 days, while achieving over 100 tokens per second (TPS) for inference at a cost of \$0.70 per million output tokens. To cultivate LongCat-Flash towards agentic intelligence, we conduct a large-scale pre-training on optimized mixtures, followed by targeted mid- and post-training on reasoning, code, and instructions, with further augmentation from synthetic data and tool use tasks. Comprehensive evaluations demonstrate that, as a non-thinking foundation model, LongCat-Flash delivers highly competitive performance among other leading models, with exceptional strengths in agentic tasks. The model checkpoint of LongCat-Flash is open-sourced to foster community research. LongCat Chat: https://longcat.ai Hugging Face: https://huggingface.co/meituan-longcat GitHub: https://github.com/meituan-longcat

[79] KoBLEX: Open Legal Question Answering with Multi-hop Reasoning

Jihyung Lee,Daehui Kim,Seonjeong Hwang,Hyounghun Kim,Gary Lee

Main category: cs.CL

TL;DR: 本文提出了一种新的法律问答基准KoBLEX和一种新的检索方法ParSeR，以提升大型语言模型在法律领域的问答性能，并通过新的评估指标LF-Eval验证了ParSeR的性能优于现有方法。

Details

Motivation: 现有的基准测试无法评估开放性和条款导向的问答，因此需要一个新的基准来评估基于条款的多跳法律推理。 Method: 引入了一个新的韩国法律可解释QA基准(KoBLEX)，并提出了一种参数化条款引导选择检索方法(ParSeR)，以及一种自动评估法律保真度的方法(LF-Eval)。 Result: 实验结果显示ParSeR在多个LLM中取得了最佳结果，与标准检索相比，F1分数提高了37.91，LF-Eval分数提高了30.81。 Conclusion: ParSeR在多个LLM中持续优于强基线，并在与标准检索的比较中显著提升了F1和LF-Eval分数。 Abstract: Large Language Models (LLM) have achieved remarkable performances in general domains and are now extending into the expert domain of law. Several benchmarks have been proposed to evaluate LLMs' legal capabilities. However, these benchmarks fail to evaluate open-ended and provision-grounded Question Answering (QA). To address this, we introduce a Korean Benchmark for Legal EXplainable QA (KoBLEX), designed to evaluate provision-grounded, multi-hop legal reasoning. KoBLEX includes 226 scenario-based QA instances and their supporting provisions, created using a hybrid LLM-human expert pipeline. We also propose a method called Parametric provision-guided Selection Retrieval (ParSeR), which uses LLM-generated parametric provisions to guide legally grounded and reliable answers. ParSeR facilitates multi-hop reasoning on complex legal questions by generating parametric provisions and employing a three-stage sequential retrieval process. Furthermore, to better evaluate the legal fidelity of the generated answers, we propose Legal Fidelity Evaluation (LF-Eval). LF-Eval is an automatic metric that jointly considers the question, answer, and supporting provisions and shows a high correlation with human judgments. Experimental results show that ParSeR consistently outperforms strong baselines, achieving the best results across multiple LLMs. Notably, compared to standard retrieval with GPT-4o, ParSeR achieves +37.91 higher F1 and +30.81 higher LF-Eval. Further analyses reveal that ParSeR efficiently delivers consistent performance across reasoning depths, with ablations confirming the effectiveness of ParSeR.

[80] Can Large Language Models Master Complex Card Games?

Wei Wang,Fuqing Bie,Junzhe Chen,Dan Zhang,Shiyu Huang,Evgeny Kharlamov,Jie Tang

Main category: cs.CL

TL;DR: This paper explores the potential of large language models (LLMs) in mastering complex card games, showing that they can perform at a high level when fine-tuned on quality data, but may lose some general capabilities in the process, which can be addressed by including general instruction data.

Details

Motivation: The motivation stems from the success of AI algorithms like AlphaGo and the impressive capabilities of LLMs across various tasks. The paper aims to explore whether LLMs can achieve similar success in complex games, particularly card games. Method: The paper systematically assesses the learning capabilities of LLMs across eight diverse card games, evaluating the impact of supervised fine-tuning on high-quality gameplay data and examining the models' ability to retain general capabilities. Result: The findings indicate that LLMs can approach the performance of strong game AIs through supervised fine-tuning, master multiple complex card games simultaneously, and experience a decline in general capabilities that can be mitigated by integrating general instruction data. Conclusion: The paper concludes that LLMs have strong learning ability and versatility in mastering complex card games, although this process comes with challenges like the decline of general capabilities, which can be mitigated with general instruction data. Abstract: Complex games have long been an important benchmark for testing the progress of artificial intelligence algorithms. AlphaGo, AlphaZero, and MuZero have defeated top human players in Go and Chess, garnering widespread societal attention towards artificial intelligence. Concurrently, large language models (LLMs) have exhibited remarkable capabilities across various tasks, raising the question of whether LLMs can achieve similar success in complex games. In this paper, we explore the potential of LLMs in mastering complex card games. We systematically assess the learning capabilities of LLMs across eight diverse card games, evaluating the impact of fine-tuning on high-quality gameplay data, and examining the models' ability to retain general capabilities while mastering these games. Our findings indicate that: (1) LLMs can approach the performance of strong game AIs through supervised fine-tuning on high-quality data, (2) LLMs can master multiple complex card games simultaneously, with performance augmentation for games with similar rules and conflicts for dissimilar ones, and (3) LLMs experience a decline in general capabilities when mastering complex games, but this decline can be mitigated by integrating a certain amount of general instruction data. The evaluation results demonstrate strong learning ability and versatility of LLMs.

[81] Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic

Mohammad Zbeeb,Hasan Abed Al Kader Hammoud,Bernard Ghanem

Main category: cs.CL

TL;DR: 研究提出了一种通过提取和转移推理向量来提升模型推理能力的新方法，显著降低了昂贵的训练需求。

Details

Motivation: 大型语言模型通常需要昂贵的优化（如强化学习）来掌握复杂的推理任务。研究者希望找到一种方法，将已学到的推理能力在模型间转移，以减少计算成本。 Method: 研究人员利用两个公开可用的、相同初始化的Qwen2.5模型，一个通过监督微调（SFT）进行微调，另一个通过组相对策略优化（GRPO）进行优化，并通过减法提取推理向量：$v_{\text{reason}} = \theta_{\text{GRPO}} - \theta_{\text{SFT}}$。 Result: 添加该推理向量后，多个推理基准测试的性能均有显著提升（如GSM8K提升4.9%，HumanEval提升4.3%，SciQ提升1.7%，BigBenchHard对1.5B模型提升12.3%），且在对抗条件下仍能保持效果。反之，减去该向量会导致性能显著下降。 Conclusion: 该研究证明了通过简单的张量运算，可以从现有的开源模型中提取并重用通常需要昂贵训练过程才能获得的推理能力。 Abstract: Large language models often require costly optimization, such as reinforcement learning, to master complex reasoning tasks. This work demonstrates that reasoning ability, once learned, can be extracted and transferred between models as a compact task vector. We source two publicly available, identically initialized Qwen2.5 models, one fine-tuned with supervised fine-tuning (SFT) and the other with group relative policy optimization (GRPO) on the same dataset. From these, we extract a reasoning vector: $v_{\text{reason}} = \theta_{\text{GRPO}} - \theta_{\text{SFT}}$. We hypothesize that this vector captures the reasoning capability instilled by reinforcement learning while factoring out shared knowledge from the SFT process. When added to compatible instruction-tuned models through simple arithmetic, this vector consistently improves performance across diverse reasoning benchmarks: GSM8K (+4.9%), HumanEval (+4.3%), SciQ (+1.7%), and BigBenchHard (+12.3% for the 1.5B model). The performance improvements persist under adversarial conditions. Conversely, subtracting the vector causes significant performance degradation (-11.8% on GSM8K), demonstrating the vector's strong contribution to the model's reasoning abilities. This work shows how reasoning capabilities, typically developed through expensive training, can be extracted from existing open-source models and reused through simple tensor arithmetic, offering a practical way to enhance models by recycling prior computational investments.

[82] WATCHED: A Web AI Agent Tool for Combating Hate Speech by Expanding Data

Paloma Piot,Diego Sánchez,Javier Parapar

Main category: cs.CL

TL;DR: This paper presents WATCHED, a chatbot designed to support content moderators in tackling hate speech by combining AI and human judgment.

Details

Motivation: Online harms, especially hate speech, are growing problems that reduce trust in social media platforms. Effective tools are needed to combine the speed of automated systems with human judgment to find and explain harmful content. Method: The paper introduces WATCHED, a chatbot built as an Artificial Intelligence Agent system using Large Language Models and specialized tools. It compares posts with examples of hate speech, uses a BERT-based classifier, looks up slang with sources like Urban Dictionary, generates reasoning, and checks platform guidelines. Result: Experimental results show that WATCHED surpasses existing state-of-the-art methods in detecting hate speech, achieving a macro F1 score of 0.91. Conclusion: WATCHED effectively detects and explains hate speech, supporting collaboration between AI systems and human moderators to reduce online harms. Abstract: Online harms are a growing problem in digital spaces, putting user safety at risk and reducing trust in social media platforms. One of the most persistent forms of harm is hate speech. To address this, we need tools that combine the speed and scale of automated systems with the judgment and insight of human moderators. These tools should not only find harmful content but also explain their decisions clearly, helping to build trust and understanding. In this paper, we present WATCHED, a chatbot designed to support content moderators in tackling hate speech. The chatbot is built as an Artificial Intelligence Agent system that uses Large Language Models along with several specialised tools. It compares new posts with real examples of hate speech and neutral content, uses a BERT-based classifier to help flag harmful messages, looks up slang and informal language using sources like Urban Dictionary, generates chain-of-thought reasoning, and checks platform guidelines to explain and support its decisions. This combination allows the chatbot not only to detect hate speech but to explain why content is considered harmful, grounded in both precedent and policy. Experimental results show that our proposed method surpasses existing state-of-the-art methods, reaching a macro F1 score of 0.91. Designed for moderators, safety teams, and researchers, the tool helps reduce online harms by supporting collaboration between AI and human oversight.

[83] ABCD-LINK: Annotation Bootstrapping for Cross-Document Fine-Grained Links

Serwar Basch,Ilia Kuznetsov,Tom Hope,Iryna Gurevych

Main category: cs.CL

TL;DR: This paper introduces a new framework for creating cross-document link datasets and shows that combining retrieval models with LLMs significantly improves link approval rates.

Details

Motivation: The lack of efficient methods to create training and evaluation datasets for cross-document links limits the study of automated assistance in many application domains. Method: The framework generates and validates semi-synthetic datasets, performs automatic evaluation to shortlist best-performing linking approaches, and conducts extensive human evaluation studies. Result: The framework was applied in peer review and news domains, showing that combining retrieval models with LLMs achieves 78% link approval from human raters, more than doubling the precision of strong retrievers alone. Conclusion: The study presents a new domain-agnostic framework for selecting and annotating cross-document links, enabling systematic study of cross-document understanding and providing novel datasets for various tasks. Abstract: Understanding fine-grained relations between documents is crucial for many application domains. However, the study of automated assistance is limited by the lack of efficient methods to create training and evaluation datasets of cross-document links. To address this, we introduce a new domain-agnostic framework for selecting a best-performing approach and annotating cross-document links in a new domain from scratch. We first generate and validate semi-synthetic datasets of interconnected documents. This data is used to perform automatic evaluation, producing a shortlist of best-performing linking approaches. These approaches are then used in an extensive human evaluation study, yielding performance estimates on natural text pairs. We apply our framework in two distinct domains -- peer review and news -- and show that combining retrieval models with LLMs achieves 78\% link approval from human raters, more than doubling the precision of strong retrievers alone. Our framework enables systematic study of cross-document understanding across application scenarios, and the resulting novel datasets lay foundation for numerous cross-document tasks like media framing and peer review. We make the code, data, and annotation protocols openly available.

[84] Analysing the Language of Neural Audio Codecs

Joonyong Park,Shinnosuke Takamichi,David M. Chan,Shunsuke Kando,Yuki Saito,Hiroshi Saruwatari

Main category: cs.CL

TL;DR: 该研究分析了神经音频编解码器生成的标记，发现它们具有类似语言的统计特性，并有助于提高语音识别和合成的效果。

Details

Motivation: 研究神经音频编解码器 (NACs) 的统计和语言特性，有助于理解其生成语音的结构并优化生成语音模型的设计。 Method: 对不同 NAC 模型生成的离散语音标记进行统计和语言属性分析，并通过自动语音识别错误率和 UTMOS 评分评估语义和声学保留情况。 Result: NAC 标记（尤其是 3-gram）展现出类似语言的统计模式，并且这些属性与信息内容的度量相关，同时提升了语音识别和语音合成任务的表现。 Conclusion: NAC token 序列具有类似语言的统计特性，并且这些特性与信息内容的度量一起有助于提高语音识别和重新合成任务的性能。 Abstract: This study presents a comparative analysis of the statistical and linguistic properties of neural audio codecs (NACs). We investigate discrete speech tokens produced by various NAC models, examining their adherence to linguistic statistical laws such as Zipf's law and Heaps' law, as well as their entropy and redundancy. To assess how these token-level properties relate to semantic and acoustic preservation in synthesized speech, we evaluate intelligibility using error rates of automatic speech recognition, and quality using the UTMOS score. Our results reveal that NAC tokens, particularly 3-grams, exhibit language-like statistical patterns. Moreover, these properties, together with measures of information content, are found to correlate with improved performances in speech recognition and resynthesis tasks. These findings offer insights into the structure of NAC token sequences and inform the design of more effective generative speech models.

[85] LLMs cannot spot math errors, even when allowed to peek into the solution

KV Aditya Srivatsa,Kaushal Kumar Maurya,Ekaterina Kochmar

Main category: cs.CL

TL;DR: 论文探讨了大型语言模型在识别学生解答错误方面的挑战，并提出了一种通过生成中间修正的学生解答来提高性能的新方法。

Details

Motivation: 尽管大型语言模型在数学文字问题上表现出色，但在诸如识别学生解答错误之类的元推理任务上仍然存在困难。这项研究旨在解决这一挑战。 Method: 论文使用了两个错误推理数据集（VtG 和 PRM800K）来调查定位学生逐步解答中第一步错误的挑战，并提出了一种生成中间修正的学生解答的方法。 Result: 实验表明，即使在可以获得参考解答的情况下，最先进的大型语言模型仍然难以定位学生解答中的第一步错误。 Conclusion: 该论文得出结论，最先进的大型语言模型在定位学生解答中的第一步错误时仍面临挑战，但通过生成中间修正的学生解答可以提高性能。 Abstract: Large language models (LLMs) demonstrate remarkable performance on math word problems, yet they have been shown to struggle with meta-reasoning tasks such as identifying errors in student solutions. In this work, we investigate the challenge of locating the first error step in stepwise solutions using two error reasoning datasets: VtG and PRM800K. Our experiments show that state-of-the-art LLMs struggle to locate the first error step in student solutions even when given access to the reference solution. To that end, we propose an approach that generates an intermediate corrected student solution, aligning more closely with the original student's solution, which helps improve performance.

[86] Vis-CoT: A Human-in-the-Loop Framework for Interactive Visualization and Intervention in LLM Chain-of-Thought Reasoning

Kaviraj Pather,Elena Hadjigeorgiou,Arben Krasniqi,Claire Schmit,Irina Rusu,Marc Pons,Kabir Khan

Main category: cs.CL

TL;DR: Vis-CoT是一个人机协作框架，通过将线性CoT文本转化为交互式推理图，提升大型语言模型(LLMs)的推理准确性和可信度。

Details

Motivation: LLMs通过思维链(CoT)提示显示出强大的推理能力，但这一过程是不透明的，这使得在高风险环境中验证、调试和控制变得困难。 Method: 将线性CoT文本转换为交互式推理图，用户可以可视化逻辑流程，识别错误步骤，并通过修剪错误路径和嫁接用户定义的前提进行干预。 Result: 在GSM8K和StrategyQA上，Vis-CoT使最终答案准确率比非交互式基线提高了24个百分点。用户研究还显示，在可使用性和信任度方面有显著提升。 Conclusion: Vis-CoT通过结合LLMs和有针对性的人类监督，为更可靠、更易理解和协作的推理提供了实用路径。 Abstract: Large language models (LLMs) show strong reasoning via chain-of-thought (CoT) prompting, but the process is opaque, which makes verification, debugging, and control difficult in high-stakes settings. We present Vis-CoT, a human-in-the-loop framework that converts linear CoT text into an interactive reasoning graph. Users can visualize the logical flow, identify flawed steps, and intervene by pruning incorrect paths and grafting new, user-defined premises. This shifts interaction from passive observation to active collaboration, steering models toward more accurate and trustworthy conclusions. Across GSM8K and StrategyQA, Vis-CoT improves final-answer accuracy by up to 24 percentage points over non-interactive baselines. A user study also shows large gains in perceived usability and trust. Vis-CoT points to a practical path for more reliable, understandable, and collaborative reasoning by combining LLMs with targeted human oversight.

[87] On the Alignment of Large Language Models with Global Human Opinion

Yang Liu,Masahiro Kaneko,Chenhui Chu

Main category: cs.CL

TL;DR: This study investigates how large language models align with global human opinions across different languages and time periods, finding that LLMs align better with contemporary populations and that prompt language can be used to improve alignment with specific countries.

Details

Motivation: Existing studies mainly focus on researching the opinions represented by LLMs among demographic groups in the United States or a few countries, lacking worldwide country samples and studies on human opinions in different historical periods. Additionally, there is a lack of discussion on using language to steer LLMs and the potential influence of prompt language on alignment. Method: An evaluation framework based on the World Values Survey (WVS) was created to systematically assess the alignment of LLMs with human opinions across different countries, languages, and historical periods. Result: LLMs appropriately or over-align with the opinions of only a few countries while under-aligning with most countries. Changing the language of the prompt to match the language used in the questionnaire can effectively steer LLMs to align with the opinions of the corresponding country more effectively than existing steering methods. LLMs are also more aligned with the opinions of the contemporary population. Conclusion: This study is the first comprehensive investigation of opinion alignment in LLMs across global, language, and temporal dimensions, revealing that LLMs align better with contemporary populations and that prompt language can effectively steer alignment with specific countries. Abstract: Today's large language models (LLMs) are capable of supporting multilingual scenarios, allowing users to interact with LLMs in their native languages. When LLMs respond to subjective questions posed by users, they are expected to align with the views of specific demographic groups or historical periods, shaped by the language in which the user interacts with the model. Existing studies mainly focus on researching the opinions represented by LLMs among demographic groups in the United States or a few countries, lacking worldwide country samples and studies on human opinions in different historical periods, as well as lacking discussion on using language to steer LLMs. Moreover, they also overlook the potential influence of prompt language on the alignment of LLMs' opinions. In this study, our goal is to fill these gaps. To this end, we create an evaluation framework based on the World Values Survey (WVS) to systematically assess the alignment of LLMs with human opinions across different countries, languages, and historical periods around the world. We find that LLMs appropriately or over-align the opinions with only a few countries while under-aligning the opinions with most countries. Furthermore, changing the language of the prompt to match the language used in the questionnaire can effectively steer LLMs to align with the opinions of the corresponding country more effectively than existing steering methods. At the same time, LLMs are more aligned with the opinions of the contemporary population. To our knowledge, our study is the first comprehensive investigation of the topic of opinion alignment in LLMs across global, language, and temporal dimensions. Our code and data are publicly available at https://github.com/nlply/global-opinion-alignment.

[88] Trusted Uncertainty in Large Language Models: A Unified Framework for Confidence Calibration and Risk-Controlled Refusal

Markus Oehri,Giulia Conti,Kaviraj Pather,Alexandre Rossi,Laia Serra,Adrian Parody,Rogvi Johannesen,Aviaja Petersen,Arben Krasniqi

Main category: cs.CL

TL;DR: UniCR is a framework that enhances language model trustworthiness by converting uncertainty signals into calibrated probabilities and enforcing error budgets without model fine-tuning.

Details

Motivation: Language models need to know not only what to answer but also when not to answer. Existing methods like entropy thresholds or post-hoc calibrators are insufficient in providing reliable uncertainty estimation and risk control. Method: UniCR uses a calibration head with temperature scaling and proper scoring to fuse heterogeneous uncertainty evidence (like sequence likelihoods, self-consistency dispersion, retrieval compatibility, and tool feedback) into a calibrated probability of correctness, and then applies conformal risk control to enforce error budgets. Result: Experiments show that UniCR outperforms existing methods in calibration metrics, achieves lower area under the risk-coverage curve, and maintains higher coverage at fixed risk levels. It effectively identifies key drivers of uncertainty and provides informative refusal messages. Conclusion: UniCR improves the trustworthiness of language models by turning uncertainty evidence into calibrated probabilities and enforcing error budgets through principled refusal, all without fine-tuning the base model. Abstract: Deployed language models must decide not only what to answer but also when not to answer. We present UniCR, a unified framework that turns heterogeneous uncertainty evidence including sequence likelihoods, self-consistency dispersion, retrieval compatibility, and tool or verifier feedback into a calibrated probability of correctness and then enforces a user-specified error budget via principled refusal. UniCR learns a lightweight calibration head with temperature scaling and proper scoring, supports API-only models through black-box features, and offers distribution-free guarantees using conformal risk control. For long-form generation, we align confidence with semantic fidelity by supervising on atomic factuality scores derived from retrieved evidence, reducing confident hallucinations while preserving coverage. Experiments on short-form QA, code generation with execution tests, and retrieval-augmented long-form QA show consistent improvements in calibration metrics, lower area under the risk-coverage curve, and higher coverage at fixed risk compared to entropy or logit thresholds, post-hoc calibrators, and end-to-end selective baselines. Analyses reveal that evidence contradiction, semantic dispersion, and tool inconsistency are the dominant drivers of abstention, yielding informative user-facing refusal messages. The result is a portable recipe of evidence fusion to calibrated probability to risk-controlled decision that improves trustworthiness without fine-tuning the base model and remains valid under distribution shift.

[89] Robust Knowledge Editing via Explicit Reasoning Chains for Distractor-Resilient Multi-Hop QA

Yuchen Wu,Liang Ding,Li Shen,Dacheng Tao

Main category: cs.CL

TL;DR: Reason-KE is an efficient knowledge-editing framework that improves multi-hop question answering accuracy in large language models (LLMs) under noisy conditions.

Details

Motivation: LLMs are static after training, making full retraining expensive for integrating new facts. Existing knowledge-editing techniques struggle with noisy, multi-hop conditions. Method: Reason-KE uses an end-to-end reasoning-chain-based editing framework with four stages: fact acknowledgment, relevance determination, selective application, and final reasoning. Result: On MQuAKE-CF, Reason-KE achieves 90.2% accuracy for multi-hop QA, with only a 6.3% drop under heavy distraction and <1% when answers are leaked. Conclusion: Reason-KE provides a resilient and efficient framework for knowledge editing in LLMs, setting a new state-of-the-art for reliable knowledge updates. Abstract: Large language models (LLMs) encode vast amounts of world knowledge but remain static once trained, making the timely integration of emerging facts prohibitively expensive via full retraining. Knowledge-editing techniques have thus emerged to inject or overwrite specific facts into LLMs, yet they either over-rely on superficial cues or incur complex, iterative pipelines that collapse under noisy, multi-hop conditions. We introduce Reason-KE, an end-to-end reasoning-chain-based editing framework that steers a pretrained LLM through four structured stages-fact acknowledgment, relevance determination, selective application, and final reasoning-to filter distractors in a single pass. Trained on MQuAKE-CF with up to four irrelevant facts, Reason-KE elevates Qwen2.5-7B's multi-hop QA accuracy to 90.2% while suffering merely a 6.3% drop under heavy distraction and <1% when answers are leaked. Our quantitative analysis confirms Reason-KE's resilience and efficiency, establishing a new state-of-the-art for reliable LLM knowledge updates.

[90] Do Retrieval Augmented Language Models Know When They Don't Know?

Youchao Zhou,Heyan Huang,Yicheng Liu,Rui Dai,Xinglin Wang,Xingchen Zhang,Shumin Shi,Yang Deng

Main category: cs.CL

TL;DR: This paper explores whether Retrieval Augmented Language Models (RALMs) can recognize when they lack knowledge and refuse to answer, finding that they often over-refuse and that specific post-training methods can improve this behavior without compromising answer quality.

Details

Motivation: The motivation is to address the issue of hallucinations in Large Language Models (LLMs) by understanding whether RALMs have the capability to refuse answering when uncertain, and how post-training methods affect this refusal behavior. Method: The researchers evaluated Retrieval Augmented Language Models (RALMs) by examining their calibration in different knowledge states, investigated the impact of refusal post-training methods (Refusal-aware Instruction Tuning and In-Context Fine-tuning), and developed a refusal method to improve overall answer quality. Result: RALMs exhibit over-refusal behavior, which can be mitigated by In-context fine-tuning but worsened by R-tuning. Refusal ability may conflict with answer quality, prompting the development of a new refusal method for better performance. Conclusion: The study concludes that RALMs do not always know when they don't know, displaying over-refusal behavior, and that refusal post-training methods like In-context fine-tuning can mitigate this issue while maintaining answer quality. Abstract: Existing Large Language Models (LLMs) occasionally generate plausible yet factually incorrect responses, known as hallucinations. Researchers are primarily using two approaches to mitigate hallucinations, namely Retrieval Augmented Language Models (RALMs) and refusal post-training. However, current research predominantly emphasizes their individual effectiveness while overlooking the evaluation of the refusal capability of RALMs. In this study, we ask the fundamental question: Do RALMs know when they don't know? Specifically, we ask three questions. First, are RALMs well-calibrated regarding different internal and external knowledge states? We examine the influence of various factors. Contrary to expectations, we find that LLMs exhibit significant \textbf{over-refusal} behavior. Then, how does refusal post-training affect the over-refusal issue? We investigate the Refusal-aware Instruction Tuning and In-Context Fine-tuning methods. Our results show that the over-refusal problem is mitigated by In-context fine-tuning. but magnified by R-tuning. However, we also find that the refusal ability may conflict with the quality of the answer. Finally, we develop a simple yet effective refusal method for refusal post-trained models to improve their overall answer quality in terms of refusal and correct answers. Our study provides a more comprehensive understanding of the influence of important factors on RALM systems.

[91] MeVe: A Modular System for Memory Verification and Effective Context Control in Language Models

Andreas Ottem

Main category: cs.CL

TL;DR: MeVe is a new modular architecture for Retrieval-Augmented Generation that improves context efficiency and accuracy by verifying and refining retrieved information before use.

Details

Motivation: Traditional RAG systems suffer from inefficiencies due to irrelevant or redundant retrieved information, limiting their performance and effectiveness. Method: MeVe introduces a five-phase modular architecture for RAG systems, including initial retrieval, relevance verification, fallback retrieval, context prioritization, and token budgeting. Result: MeVe improves context efficiency significantly, achieving a 57% reduction in unnecessary context on the Wikipedia dataset and a 75% reduction on the HotpotQA dataset compared to standard RAG approaches. Conclusion: MeVe offers a framework for scalable and reliable LLM applications by refining contextual information, leading to better grounding and factual accuracy. Abstract: Retrieval-Augmented Generation (RAG) systems typically face constraints because of their inherent mechanism: a simple top-k semantic search [1]. The approach often leads to the incorporation of irrelevant or redundant information in the context, degrading performance and efficiency [10][11]. This paper presents MeVe, a novel modular architecture intended for Memory Verification and smart context composition. MeVe rethinks the RAG paradigm by proposing a five-phase modular design that distinctly breaks down the retrieval and context composition process into distinct, auditable, and independently tunable phases: initial retrieval, relevance verification, fallback retrieval, context prioritization, and token budgeting. This architecture enables fine-grained control of what knowledge is made available to an LLM, enabling task-dependent filtering and adaptation. We release a reference implementation of MeVe as a proof of concept and evaluate its performance on knowledge-heavy QA tasks over a subset of English Wikipedia [22]. Our results demonstrate that by actively verifying information before composition, MeVe significantly improves context efficiency, achieving a 57% reduction on the Wikipedia dataset and a 75% reduction on the more complex HotpotQA dataset compared to standard RAG implementations [25]. This work provides a framework for more scalable and reliable LLM applications. By refining and distilling contextual information, MeVe offers a path toward better grounding and more accurate factual support [16].

[92] Service, Solidarity, and Self-Help: A Comparative Topic Modeling Analysis of Community Unionism in the Boot and Shoe Union and Unite Community

Thomas Compton

Main category: cs.CL

TL;DR: 本文通过比较20世纪20年代的国家鞋靴工会和2010年至2020年Unite社区在社区工会主义上的不同表现，挑战了社区工会主义在时间和行业间连续性和普遍性的假设。

Details

Motivation: 了解不同历史和组织背景下社区工会主义的连续性和普遍性。 Method: 使用BERTopic进行主题建模和cTF-IDF加权，以及词频分析，研究比较了20世纪20年代的国家鞋靴工会和2010年至2020年Unite社区的社区工会主义话语。 Result: 结果显示了主题焦点和话语连贯性的重要差异。Unite社区在面向社会正义的主题上表现出更强的一致性，而国家鞋靴工会则更注重内部管理、劳资关系和会员服务。 Conclusion: 研究发现，虽然两个工会都涉及社区主题，但它们的参与模式存在显著差异，这挑战了关于社区工会主义在时间和行业部门间连续性和普遍性的假设。 Abstract: This paper presents a comparative analysis of community unionism (CU) in two distinct historical and organizational contexts: the National Boot and Shoe Union (B\&S) in the 1920s and Unite Community in the 2010s--2020s. Using BERTopic for thematic modeling and cTF-IDF weighting, alongside word frequency analysis, the study examines the extent to which each union's discourse aligns with key features of CU -- such as coalition-building, grassroots engagement, and action beyond the workplace. The results reveal significant differences in thematic focus and discursive coherence. While Unite Community demonstrates stronger alignment with outward-facing, social justice-oriented themes, the B\&S corpus emphasizes internal administration, industrial relations, and member services -- reflecting a more traditional, servicing-oriented union model. The analysis also highlights methodological insights, demonstrating how modern NLP techniques can enhance the study of historical labor archives. Ultimately, the findings suggest that while both unions engage with community-related themes, their underlying models of engagement diverge significantly, challenging assumptions about the continuity and universality of community unionism across time and sector.

[93] CAT: Causal Attention Tuning For Injecting Fine-grained Causal Knowledge into Large Language Models

Kairong Han,Wenshuo Zhao,Ziyu Zhao,JunJian Ye,Lujia Pan,Kun Kuang

Main category: cs.CL

TL;DR: The paper introduces Causal Attention Tuning, a method for enhancing Large Language Models' ability to use causal knowledge for more accurate predictions, especially in challenging out-of-distribution cases.

Details

Motivation: The motivation stems from the limitation of current Large Language Models in distinguishing true causal relationships from spurious correlations, which hampers their effectiveness, especially in out-of-distribution situations. Method: The research introduces an automated pipeline called Causal Attention Tuning (CAT) which uses human priors to generate token-level causal signals and employs a Re-Attention mechanism to guide model training towards focusing on causal structures. Result: The experimental results indicate that the proposed Causal Attention Tuning approach successfully utilizes causal knowledge for prediction and maintains robustness in out-of-distribution scenarios. Conclusion: The study concludes that by utilizing the proposed Causal Attention Tuning (CAT) method, Large Language Models can effectively leverage causal knowledge for prediction, enhancing their performance in out-of-distribution scenarios. Abstract: Large Language Models (LLMs) have achieved remarkable success across various domains. However, a fundamental question remains: Can LLMs effectively utilize causal knowledge for prediction and generation? Through empirical studies, we find that LLMs trained directly on large-scale data often capture spurious correlations rather than true causal relationships, leading to suboptimal performance, especially in out-of-distribution (OOD) scenarios. To address this challenge, we propose Causal Attention Tuning (CAT), a novel approach that injects fine-grained causal knowledge into the attention mechanism. We propose an automated pipeline that leverages human priors to automatically generate token-level causal signals and introduce the Re-Attention mechanism to guide training, helping the model focus on causal structures while mitigating noise and biases in attention scores. Experimental results on our proposed Spurious Token Game (STG) benchmark and multiple downstream tasks demonstrate that our approach effectively leverages causal knowledge for prediction and remains robust in OOD scenarios. Implementation details can be found at https://github.com/Kairong-Han/CAT.

[94] In-N-Out: A Parameter-Level API Graph Dataset for Tool Agents

Seungkyu Lee,Nalim Kim,Yohan Jo

Main category: cs.CL

TL;DR: 本文介绍了一种基于API图的方法来改进工具代理在处理复杂任务时的性能，并提出了In-N-Out数据集。

Details

Motivation: 工具代理在处理复杂任务时难以正确识别和调用API，本文旨在解决这一问题。 Method: 将API文档转换为结构化的API图，并利用In-N-Out数据集进行实验。 Result: 使用In-N-Out数据集显著提高了工具检索和多工具查询生成的性能，接近文档单独使用的两倍。 Conclusion: API图对于工具代理具有潜力，In-N-Out是一个有价值的资源。 Abstract: Tool agents -- LLM-based systems that interact with external APIs -- offer a way to execute real-world tasks. However, as tasks become increasingly complex, these agents struggle to identify and call the correct APIs in the proper order. To tackle this problem, we investigate converting API documentation into a structured API graph that captures API dependencies and leveraging it for multi-tool queries that require compositional API calls. To support this, we introduce In-N-Out, the first expert-annotated dataset of API graphs built from two real-world API benchmarks and their documentation. Using In-N-Out significantly improves performance on both tool retrieval and multi-tool query generation, nearly doubling that of LLMs using documentation alone. Moreover, graphs generated by models fine-tuned on In-N-Out close 90% of this gap, showing that our dataset helps models learn to comprehend API documentation and parameter relationships. Our findings highlight the promise of using explicit API graphs for tool agents and the utility of In-N-Out as a valuable resource. We will release the dataset and code publicly.

[95] Enhancing Uncertainty Estimation in LLMs with Expectation of Aggregated Internal Belief

Zeguan Xiao,Diyang Dou,Boya Xiong,Yun Chen,Guanhua Chen

Main category: cs.CL

TL;DR: EAGLE improves LLMs' calibration by leveraging internal hidden states for self-evaluation, resulting in more accurate confidence scores.

Details

Motivation: LLMs often exhibit overconfidence and generate plausible yet incorrect answers, especially after RLHF, posing challenges for reliable uncertainty estimation and safe deployment. Method: EAGLE extracts internal beliefs from multiple intermediate layers during self-evaluation, aggregates these layer-wise beliefs, and calculates the expectation over the resulting confidence score distribution. Result: Extensive experiments showed that EAGLE significantly improves calibration performance over existing baselines, with in-depth analysis on uncertainty patterns, impact of self-evaluation prompts, and self-evaluation score range. Conclusion: EAGLE is effective in improving calibration performance of LLMs by leveraging internal hidden states for self-evaluation-based calibration. Abstract: Large Language Models (LLMs) have achieved remarkable success across a wide range of natural language tasks, but often exhibit overconfidence and generate plausible yet incorrect answers. This overconfidence, especially in models undergone Reinforcement Learning from Human Feedback (RLHF), poses significant challenges for reliable uncertainty estimation and safe deployment. In this paper, we propose EAGLE (Expectation of AGgregated internaL bEief), a novel self-evaluation-based calibration method that leverages the internal hidden states of LLMs to derive more accurate confidence scores. Instead of relying on the model's final output, our approach extracts internal beliefs from multiple intermediate layers during self-evaluation. By aggregating these layer-wise beliefs and calculating the expectation over the resulting confidence score distribution, EAGLE produces a refined confidence score that more faithfully reflects the model's internal certainty. Extensive experiments on diverse datasets and LLMs demonstrate that EAGLE significantly improves calibration performance over existing baselines. We also provide an in-depth analysis of EAGLE, including a layer-wise examination of uncertainty patterns, a study of the impact of self-evaluation prompts, and an analysis of the effect of self-evaluation score range.

[96] Testing the assumptions about the geometry of sentence embedding spaces: the cosine measure need not apply

Vivi Nastase,Paola Merlo

Main category: cs.CL

TL;DR: 研究表明句子嵌入的几何空间并不能预测其在语言任务中的表现，因为语言信息是通过不同维度的加权组合来编码的。

Details

Motivation: 了解句子嵌入空间中的接近性是否扩展到句子表示的共享属性，并探索这些属性是否能够预测其在特定任务上的表现。 Method: 通过三种方式计算句子嵌入：平均词嵌入、特殊[CLS]令牌嵌入和句子中的随机令牌嵌入，并探索这些嵌入之间的距离与它们在语言任务上的表现之间的相关性。 Result: 余弦相似度仅能捕捉到句子嵌入之间的浅层共性或差异，而这些并不能预测它们在特定任务上的表现。 Conclusion: 句子嵌入空间的几何特性并不能预测其在特定任务上的表现，语言信息是通过不同维度的加权组合来编码的，而不是反映在句子嵌入空间的几何结构中。 Abstract: Transformer models learn to encode and decode an input text, and produce contextual token embeddings as a side-effect. The mapping from language into the embedding space maps words expressing similar concepts onto points that are close in the space. In practice, the reverse implication is also assumed: words corresponding to close points in this space are similar or related, those that are further are not. Does closeness in the embedding space extend to shared properties for sentence embeddings? We present an investigation of sentence embeddings and show that the geometry of their embedding space is not predictive of their relative performances on a variety of tasks. We compute sentence embeddings in three ways: as averaged token embeddings, as the embedding of the special [CLS] token, and as the embedding of a random token from the sentence. We explore whether there is a correlation between the distance between sentence embedding variations and their performance on linguistic tasks, and whether despite their distances, they do encode the same information in the same manner. The results show that the cosine similarity -- which treats dimensions shallowly -- captures (shallow) commonalities or differences between sentence embeddings, which are not predictive of their performance on specific tasks. Linguistic information is rather encoded in weighted combinations of different dimensions, which are not reflected in the geometry of the sentence embedding space.

[97] Benchmarking the Detection of LLMs-Generated Modern Chinese Poetry

Shanshan Wang,Junchao Wu,Fengying Ye,Jingming Yao,Lidia S. Chao,Derek F. Wong

Main category: cs.CL

TL;DR: 本论文旨在构建一个用于检测LLMs生成现代中文诗歌的新基准，并通过实验验证该基准的有效性和必要性。

Details

Motivation: 由于现代中文诗歌的独特性，AI生成的诗歌难以识别，而这些诗歌已经严重干扰了诗歌生态系统，因此需要有效的检测手段。 Method: 构建了一个包含800首专业诗人创作的诗歌和41600首四个主流LLMs生成的诗歌的高质量数据集，并在此数据集上对六个检测器进行了系统性能评估。 Result: 实验结果表明，目前的检测器无法可靠地检测LLMs生成的现代中文诗歌，最难检测的诗歌特征是风格等内在品质。 Conclusion: 本研究提出的基准为未来AI生成诗歌的检测奠定了基础，表明了开发更有效检测方法的必要性。 Abstract: The rapid development of advanced large language models (LLMs) has made AI-generated text indistinguishable from human-written text. Previous work on detecting AI-generated text has made effective progress, but has not involved modern Chinese poetry. Due to the distinctive characteristics of modern Chinese poetry, it is difficult to identify whether a poem originated from humans or AI. The proliferation of AI-generated modern Chinese poetry has significantly disrupted the poetry ecosystem. Based on the urgency of identifying AI-generated poetry in the real Chinese world, this paper proposes a novel benchmark for detecting LLMs-generated modern Chinese poetry. We first construct a high-quality dataset, which includes both 800 poems written by six professional poets and 41,600 poems generated by four mainstream LLMs. Subsequently, we conduct systematic performance assessments of six detectors on this dataset. Experimental results demonstrate that current detectors cannot be used as reliable tools to detect modern Chinese poems generated by LLMs. The most difficult poetic features to detect are intrinsic qualities, especially style. The detection results verify the effectiveness and necessity of our proposed benchmark. Our work lays a foundation for future detection of AI-generated poetry.

[98] TransGAT: Transformer-Based Graph Neural Networks for Multi-Dimensional Automated Essay Scoring

Hind Aljuaid,Areej Alhothali,Ohoud Al-Zamzami,Hussein Assalahi

Main category: cs.CL

TL;DR: 该研究提出了一种名为TransGAT的新方法，结合了Transformer模型和GNN模型，用于改进自动作文评分系统，并在分析评分方面取得了良好的效果。

Details

Motivation: 当前的AES方法使用静态词嵌入和整体评分，无法准确捕捉上下文意义并忽略特定写作方面。 Method: 该研究结合了Transformer模型和GNN模型，通过两流预测机制进行分析评分。 Result: 在ELLIPSE数据集上的实验表明，TransGAT在所有分析评分维度上的平均二次加权Kappa为0.854。 Conclusion: TransGAT是一个有前景的AES系统，能够在分析评分方面超越现有模型。 Abstract: Essay writing is a critical component of student assessment, yet manual scoring is labor-intensive and inconsistent. Automated Essay Scoring (AES) offers a promising alternative, but current approaches face limitations. Recent studies have incorporated Graph Neural Networks (GNNs) into AES using static word embeddings that fail to capture contextual meaning, especially for polysemous words. Additionally, many methods rely on holistic scoring, overlooking specific writing aspects such as grammar, vocabulary, and cohesion. To address these challenges, this study proposes TransGAT, a novel approach that integrates fine-tuned Transformer models with GNNs for analytic scoring. TransGAT combines the contextual understanding of Transformers with the relational modeling strength of Graph Attention Networks (GAT). It performs two-stream predictions by pairing each fine-tuned Transformer (BERT, RoBERTa, and DeBERTaV3) with a separate GAT. In each pair, the first stream generates essay-level predictions, while the second applies GAT to Transformer token embeddings, with edges constructed from syntactic dependencies. The model then fuses predictions from both streams to produce the final analytic score. Experiments on the ELLIPSE dataset show that TransGAT outperforms baseline models, achieving an average Quadratic Weighted Kappa (QWK) of 0.854 across all analytic scoring dimensions. These findings highlight the potential of TransGAT to advance AES systems.

[99] Parallel Needleman-Wunsch on CUDA to measure word similarity based on phonetic transcriptions

Dominic Plein

Main category: cs.CL

TL;DR: 本文提出了一种基于Needleman-Wunsch算法和并行计算（CPU/GPU）的语音相似性分析方法，适用于多语言的语音结构研究。

Details

Motivation: 为了高效处理大型数据集，需要一种能够计算基于语音转录的单词相似性的方法。 Method: 使用Needleman-Wunsch算法计算单词之间的语音相似性，并在CPU和GPU上并行化实现。 Result: 通过构建全连接图并使用聚类算法分析，验证了该方法的有效性。 Conclusion: 该方法在分析语言的语音结构方面展示了可行性和有效性，并可能扩展到其他语言。 Abstract: We present a method to calculate the similarity between words based on their phonetic transcription (their pronunciation) using the Needleman-Wunsch algorithm. We implement this algorithm in Rust and parallelize it on both CPU and GPU to handle large datasets efficiently. The GPU implementation leverages CUDA and the cudarc Rust library to achieve significant performance improvements. We validate our approach by constructing a fully-connected graph where nodes represent words and edges have weights according to the similarity between the words. This graph is then analyzed using clustering algorithms to identify groups of phonetically similar words. Our results demonstrate the feasibility and effectiveness of the proposed method in analyzing the phonetic structure of languages. It might be easily expanded to other languages.

[100] Bridging Thoughts and Words: Graph-Based Intent-Semantic Joint Learning for Fake News Detection

Zhengjia Wang,Qiang Sheng,Danding Wang,Beizhe Hu,Juan Cao

Main category: cs.CL

TL;DR: This paper proposes InSide, a graph-based joint intent-semantic modeling approach for fake news detection, which outperforms current state-of-the-art methods by incorporating intent signals with semantic information.

Details

Motivation: Fake news detection is crucial for online information integrity, but existing approaches focusing solely on semantic clues often fail due to surface detection patterns. This work addresses this by incorporating news intent to better understand the underlying deception. Method: The paper proposes Graph-based Intent-Semantic Joint Modeling (InSide), which reformulates news semantic and intent signals into heterogeneous graph structures and employs a dynamic pathway-based graph alignment strategy for effective message passing and aggregation. Result: Extensive experiments on four benchmark datasets show that InSide outperforms existing state-of-the-art fake news detection methods. Conclusion: InSide demonstrates superiority over state-of-the-art methods in fake news detection by incorporating news intent with semantic signals through graph-based joint learning. Abstract: Fake news detection is an important and challenging task for defending online information integrity. Existing state-of-the-art approaches typically extract news semantic clues, such as writing patterns that include emotional words, stylistic features, etc. However, detectors tuned solely to such semantic clues can easily fall into surface detection patterns, which can shift rapidly in dynamic environments, leading to limited performance in the evolving news landscape. To address this issue, this paper investigates a novel perspective by incorporating news intent into fake news detection, bridging intents and semantics together. The core insight is that by considering news intents, one can deeply understand the inherent thoughts behind news deception, rather than the surface patterns within words alone. To achieve this goal, we propose Graph-based Intent-Semantic Joint Modeling (InSide) for fake news detection, which models deception clues from both semantic and intent signals via graph-based joint learning. Specifically, InSide reformulates news semantic and intent signals into heterogeneous graph structures, enabling long-range context interaction through entity guidance and capturing both holistic and implementation-level intent via coarse-to-fine intent modeling. To achieve better alignment between semantics and intents, we further develop a dynamic pathway-based graph alignment strategy for effective message passing and aggregation across these signals by establishing a common space. Extensive experiments on four benchmark datasets demonstrate the superiority of the proposed InSide compared to state-of-the-art methods.

Abdelkrime Aries

Main category: cs.CL

TL;DR: This paper introduces chDzDT, a character-level pre-trained language model tailored for Algerian morphology, which addresses the lack of representation of the Algerian dialect in NLP due to its complex morphology, frequent code-switching, multiple scripts, and lexical influences from other languages. The model is trained on isolated words, allowing it to encode morphological patterns robustly without relying on token boundaries or standardized orthography, and contributes to the development of more inclusive and adaptable NLP systems.

Details

Motivation: Pre-trained language models (PLMs) have substantially advanced natural language processing by providing context-sensitive text representations. However, the Algerian dialect remains under-represented, with few dedicated models available. Processing this dialect is challenging due to its complex morphology, frequent code-switching, multiple scripts, and strong lexical influences from other languages. These characteristics complicate tokenization and reduce the effectiveness of conventional word- or subword-level approaches. Method: We introduce chDzDT, a character-level pre-trained language model tailored for Algerian morphology. Unlike conventional PLMs that rely on token sequences, chDzDT is trained on isolated words. This design allows the model to encode morphological patterns robustly, without depending on token boundaries or standardized orthography. Result: Our contributions are threefold: (i) a detailed morphological analysis of Algerian dialect using YouTube comments; (ii) the construction of a multilingual Algerian lexicon dataset; and (iii) the development and extensive evaluation of a character-level PLM as a morphology-focused encoder for downstream tasks. Conclusion: The proposed approach demonstrates the potential of character-level modeling for morphologically rich, low-resource dialects and lays a foundation for more inclusive and adaptable NLP systems. Abstract: Pre-trained language models (PLMs) have substantially advanced natural language processing by providing context-sensitive text representations. However, the Algerian dialect remains under-represented, with few dedicated models available. Processing this dialect is challenging due to its complex morphology, frequent code-switching, multiple scripts, and strong lexical influences from other languages. These characteristics complicate tokenization and reduce the effectiveness of conventional word- or subword-level approaches. To address this gap, we introduce chDzDT, a character-level pre-trained language model tailored for Algerian morphology. Unlike conventional PLMs that rely on token sequences, chDzDT is trained on isolated words. This design allows the model to encode morphological patterns robustly, without depending on token boundaries or standardized orthography. The training corpus draws from diverse sources, including YouTube comments, French, English, and Berber Wikipedia, as well as the Tatoeba project. It covers multiple scripts and linguistic varieties, resulting in a substantial pre-training workload. Our contributions are threefold: (i) a detailed morphological analysis of Algerian dialect using YouTube comments; (ii) the construction of a multilingual Algerian lexicon dataset; and (iii) the development and extensive evaluation of a character-level PLM as a morphology-focused encoder for downstream tasks. The proposed approach demonstrates the potential of character-level modeling for morphologically rich, low-resource dialects and lays a foundation for more inclusive and adaptable NLP systems.

[102] Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs

Andong Hua,Kenan Tang,Chenhe Gu,Jindong Gu,Eric Wong,Yao Qin

Main category: cs.CL

TL;DR: 提示敏感性主要源于评估方法的限制，而非大语言模型本身的缺陷。

Details

Motivation: 重新审视提示敏感性是否真的是大语言模型的固有弱点，还是评估过程的产物。 Method: 系统性评估了7个大语言模型（如GPT和Gemini系列）在6个基准测试中的表现，涵盖了12种不同提示模板的多项选择和开放式任务。 Result: 发现提示敏感性主要源于启发式评估方法，如对数似然评分和严格答案匹配，这些方法常常忽略了语义上正确的不同表达方式。采用LLM-as-a-Judge评估后，性能差异显著减少，模型排名在不同提示间保持高度一致。 Conclusion: 现代大语言模型在面对不同提示模板时展现出的鲁棒性超出此前认知，提示敏感性更多是评估方式的产物而非模型本身的缺陷。 Abstract: Prompt sensitivity, referring to the phenomenon where paraphrasing (i.e., repeating something written or spoken using different words) leads to significant changes in large language model (LLM) performance, has been widely accepted as a core limitation of LLMs. In this work, we revisit this issue and ask: Is the widely reported high prompt sensitivity truly an inherent weakness of LLMs, or is it largely an artifact of evaluation processes? To answer this question, we systematically evaluate 7 LLMs (e.g., GPT and Gemini family) across 6 benchmarks, including both multiple-choice and open-ended tasks on 12 diverse prompt templates. We find that much of the prompt sensitivity stems from heuristic evaluation methods, including log-likelihood scoring and rigid answer matching, which often overlook semantically correct responses expressed through alternative phrasings, such as synonyms or paraphrases. When we adopt LLM-as-a-Judge evaluations, we observe a substantial reduction in performance variance and a consistently higher correlation in model rankings across prompts. Our findings suggest that modern LLMs are more robust to prompt templates than previously believed, and that prompt sensitivity may be more an artifact of evaluation than a flaw in the models.

[103] Mic Drop or Data Flop? Evaluating the Fitness for Purpose of AI Voice Interviewers for Data Collection within Quantitative & Qualitative Research Contexts

Shreyas Tirumala,Nishant Jain,Danny D. Leybzon,Trent D. Buskirk

Main category: cs.CL

TL;DR: AI interviewers are more capable than IVR systems for data collection, but their current limitations make them more suitable for certain contexts, especially quantitative over qualitative research.

Details

Motivation: Transformer-based LLMs enable AI interviewers to conduct voice-based surveys in real-time, creating a need to assess their suitability for quantitative and qualitative research contexts. Method: The paper reviews emerging evidence and evaluates the capabilities of AI interviewing systems and IVR systems across input/output performance and verbal reasoning dimensions. Result: AI interviewers outperform IVR systems in data collection, but issues such as transcription errors, limited emotion detection, and inconsistent follow-up quality remain challenges. Conclusion: AI interviewing systems show promise but have limitations in real-time transcription, emotion detection, and follow-up quality, making their effectiveness context-dependent, especially for qualitative data collection. Abstract: Transformer-based Large Language Models (LLMs) have paved the way for "AI interviewers" that can administer voice-based surveys with respondents in real-time. This position paper reviews emerging evidence to understand when such AI interviewing systems are fit for purpose for collecting data within quantitative and qualitative research contexts. We evaluate the capabilities of AI interviewers as well as current Interactive Voice Response (IVR) systems across two dimensions: input/output performance (i.e., speech recognition, answer recording, emotion handling) and verbal reasoning (i.e., ability to probe, clarify, and handle branching logic). Field studies suggest that AI interviewers already exceed IVR capabilities for both quantitative and qualitative data collection, but real-time transcription error rates, limited emotion detection abilities, and uneven follow-up quality indicate that the utility, use and adoption of current AI interviewer technology may be context-dependent for qualitative data collection efforts.

[104] Extracting OPQRST in Electronic Health Records using Large Language Models with Reasoning

Zhimeng Luo,Abhibha Gupta,Adam Frisch,Daqing He

Main category: cs.CL

TL;DR: 该论文通过将OPQRST评估信息提取任务转化为文本生成问题，并引入语义相似度评估方法，提出了一种基于大型语言模型的新型电子健康记录信息提取方案，提升了准确性和临床实用性。

Details

Motivation: 从电子健康记录中提取关键患者信息面临数据复杂性和非结构化性质的挑战，传统机器学习方法难以有效捕捉关键细节，影响临床应用效果。 Method: 将传统序列标注任务转化为文本生成任务，利用大型语言模型生成模拟医生认知过程的推理步骤；改进传统命名实体识别（NER）指标，引入语义相似度度量（如BERT Score）来评估生成文本与原始记录临床意图的匹配程度。 Result: 该方法提高了信息提取的可解释性和在有限标注数据下的适应能力，通过改进评估指标提升了生成文本与临床意图的对齐度，验证了AI在医疗信息提取中的显著进步。 Conclusion: 该论文提出了一种利用大型语言模型（LLMs）从电子健康记录（EHRs）中提取OPQRST评估信息的新方法，通过重构任务和改进评估指标，显著提升了信息提取的准确性与实用性，为临床决策提供了更好的支持。 Abstract: The extraction of critical patient information from Electronic Health Records (EHRs) poses significant challenges due to the complexity and unstructured nature of the data. Traditional machine learning approaches often fail to capture pertinent details efficiently, making it difficult for clinicians to utilize these tools effectively in patient care. This paper introduces a novel approach to extracting the OPQRST assessment from EHRs by leveraging the capabilities of Large Language Models (LLMs). We propose to reframe the task from sequence labeling to text generation, enabling the models to provide reasoning steps that mimic a physician's cognitive processes. This approach enhances interpretability and adapts to the limited availability of labeled data in healthcare settings. Furthermore, we address the challenge of evaluating the accuracy of machine-generated text in clinical contexts by proposing a modification to traditional Named Entity Recognition (NER) metrics. This includes the integration of semantic similarity measures, such as the BERT Score, to assess the alignment between generated text and the clinical intent of the original records. Our contributions demonstrate a significant advancement in the use of AI in healthcare, offering a scalable solution that improves the accuracy and usability of information extraction from EHRs, thereby aiding clinicians in making more informed decisions and enhancing patient care outcomes.

[105] Weakly Supervised Medical Entity Extraction and Linking for Chief Complaints

Zhimeng Luo,Zhendong Wang,Rui Meng,Diyang Xue,Adam Frisch,Daqing He

Main category: cs.CL

TL;DR: This study introduces a weakly supervised approach to automatically process chief complaint data, improving entity extraction and linking without requiring manual annotations.

Details

Motivation: Chief complaint records vary widely in format and notation, making standardization and text mining challenging. An automated solution is needed to handle this issue efficiently. Method: A split-and-match algorithm was used to generate weak annotations, followed by training a BERT-based model to extract and link entities in chief complaint texts. Result: The Weakly Supervised Entity Extraction and Linking method achieved superior performance in entity extraction and linking compared to existing methods. Conclusion: The proposed weakly supervised method effectively extracts and links entities in chief complaints without human annotation, outperforming previous methods. Abstract: A Chief complaint (CC) is the reason for the medical visit as stated in the patient's own words. It helps medical professionals to quickly understand a patient's situation, and also serves as a short summary for medical text mining. However, chief complaint records often take a variety of entering methods, resulting in a wide variation of medical notations, which makes it difficult to standardize across different medical institutions for record keeping or text mining. In this study, we propose a weakly supervised method to automatically extract and link entities in chief complaints in the absence of human annotation. We first adopt a split-and-match algorithm to produce weak annotations, including entity mention spans and class labels, on 1.2 million real-world de-identified and IRB approved chief complaint records. Then we train a BERT-based model with generated weak labels to locate entity mentions in chief complaint text and link them to a pre-defined ontology. We conducted extensive experiments, and the results showed that our Weakly Supervised Entity Extraction and Linking (\ours) method produced superior performance over previous methods without any human annotation.

[106] DRAssist: Dispute Resolution Assistance using Large Language Models

Sachin Pawar,Manoj Apte,Girish K. Palshikar,Basit Ali,Nitin Ramrakhiyani

Main category: cs.CL

TL;DR: This paper explores the use of large language models (LLMs) as assistants for human judges in resolving disputes, focusing on two domains: automobile insurance and domain name disputes.

Details

Motivation: Disputes occur in many domains and are generally resolved by human judges in specific forums. The paper aims to explore how LLMs can assist human judges in this process. Method: The paper proposes a system called DRAssist that uses LLMs to assist in resolving disputes in two domains: automobile insurance and domain name disputes. DRAssist identifies key structural elements of disputes and summarizes unstructured dispute descriptions. Multiple prompting strategies are explored with multiple LLMs to evaluate their ability to assist in dispute resolution. Result: The paper evaluates the performance of LLMs in identifying the stronger party in a dispute, deciding whether specific demands can be accepted, and evaluating the strength of arguments, comparing them to baselines using suitable evaluation metrics. Conclusion: LLMs can be used as assistants for human judges in resolving disputes, and they can produce resolution outputs at different levels of granularity. Abstract: Disputes between two parties occur in almost all domains such as taxation, insurance, banking, healthcare, etc. The disputes are generally resolved in a specific forum (e.g., consumer court) where facts are presented, points of disagreement are discussed, arguments as well as specific demands of the parties are heard, and finally a human judge resolves the dispute by often favouring one of the two parties. In this paper, we explore the use of large language models (LLMs) as assistants for the human judge to resolve such disputes, as part of our DRAssist system. We focus on disputes from two specific domains -- automobile insurance and domain name disputes. DRAssist identifies certain key structural elements (e.g., facts, aspects or disagreement, arguments) of the disputes and summarizes the unstructured dispute descriptions to produce a structured summary for each dispute. We then explore multiple prompting strategies with multiple LLMs for their ability to assist in resolving the disputes in these domains. In DRAssist, these LLMs are prompted to produce the resolution output at three different levels -- (i) identifying an overall stronger party in a dispute, (ii) decide whether each specific demand of each contesting party can be accepted or not, (iii) evaluate whether each argument by each contesting party is strong or weak. We evaluate the performance of LLMs on all these tasks by comparing them with relevant baselines using suitable evaluation metrics.

[107] StructCoh: Structured Contrastive Learning for Context-Aware Text Semantic Matching

Chao Xue,Ziyuan Gao

Main category: cs.CL

TL;DR: StructCoh improves text semantic matching by combining structural reasoning with contrastive learning, outperforming existing methods on legal and academic benchmarks.

Details

Motivation: Pre-trained language models struggle with hierarchical structural patterns and fine-grained semantic distinctions, which are crucial for tasks like legal document matching and academic plagiarism detection. Method: StructCoh uses a dual-graph encoder and a hierarchical contrastive objective. The dual-graph encoder builds semantic graphs using dependency parsing and topic modeling, and propagates structural features using graph isomorphism networks. The hierarchical contrastive objective ensures consistency at multiple granularities through node-level and graph-aware contrastive learning. Result: StructCoh significantly outperforms state-of-the-art methods on three legal document matching benchmarks and academic plagiarism detection datasets, achieving an 86.7% F1-score (+6.2% absolute gain) on legal statute matching. Conclusion: StructCoh provides significant improvements in text semantic matching by effectively identifying structural patterns and subtle semantic distinctions, achieving notable results such as an 86.7% F1-score on legal statute matching. Abstract: Text semantic matching requires nuanced understanding of both structural relationships and fine-grained semantic distinctions. While pre-trained language models excel at capturing token-level interactions, they often overlook hierarchical structural patterns and struggle with subtle semantic discrimination. In this paper, we proposed StructCoh, a graph-enhanced contrastive learning framework that synergistically combines structural reasoning with representation space optimization. Our approach features two key innovations: (1) A dual-graph encoder constructs semantic graphs via dependency parsing and topic modeling, then employs graph isomorphism networks to propagate structural features across syntactic dependencies and cross-document concept nodes. (2) A hierarchical contrastive objective enforces consistency at multiple granularities: node-level contrastive regularization preserves core semantic units, while graph-aware contrastive learning aligns inter-document structural semantics through both explicit and implicit negative sampling strategies. Experiments on three legal document matching benchmarks and academic plagiarism detection datasets demonstrate significant improvements over state-of-the-art methods. Notably, StructCoh achieves 86.7% F1-score (+6.2% absolute gain) on legal statute matching by effectively identifying argument structure similarities.

[108] DeepSeek performs better than other Large Language Models in Dental Cases

Hexian Zhang,Xinyu Yan,Yanqi Yang,Lijian Jin,Ping Yang,Junwen Wang

Main category: cs.CL

TL;DR: 这项研究评估了四个最先进的大型语言模型在分析纵向牙科病例方面的能力，结果表明DeepSeek在表现上优于其他模型，推荐其作为医学教育和研究中的辅助工具。

Details

Motivation: 大型语言模型（LLMs）在医疗保健中具有变革性的潜力，但其解释纵向患者叙述的能力仍有待充分探索。牙科拥有丰富的结构化临床数据，为严格评估LLMs的推理能力提供了独特的机会。 Method: 通过34个标准化纵向牙周病例（包括258个问答对）评估了四个最先进的LLM（GPT-4o、Gemini 2.0 Flash、Copilot和DeepSeek V3）分析纵向牙科案例的能力。 Result: DeepSeek成为表现最佳的模型，展示了更高的可信度（中位数得分=0.528 vs. 0.367-0.457）和更高的专家评分（中位数=4.5/5 vs. 4.0/5），而不会显著降低可读性。 Conclusion: DeepSeek被认定为案例分析的领先LLM，并建议将其作为辅助工具整合到医学教育和研究中，并强调其作为特定领域代理的潜力。 Abstract: Large language models (LLMs) hold transformative potential in healthcare, yet their capacity to interpret longitudinal patient narratives remains inadequately explored. Dentistry, with its rich repository of structured clinical data, presents a unique opportunity to rigorously assess LLMs' reasoning abilities. While several commercial LLMs already exist, DeepSeek, a model that gained significant attention earlier this year, has also joined the competition. This study evaluated four state-of-the-art LLMs (GPT-4o, Gemini 2.0 Flash, Copilot, and DeepSeek V3) on their ability to analyze longitudinal dental case vignettes through open-ended clinical tasks. Using 34 standardized longitudinal periodontal cases (comprising 258 question-answer pairs), we assessed model performance via automated metrics and blinded evaluations by licensed dentists. DeepSeek emerged as the top performer, demonstrating superior faithfulness (median score = 0.528 vs. 0.367-0.457) and higher expert ratings (median = 4.5/5 vs. 4.0/5), without significantly compromising readability. Our study positions DeepSeek as the leading LLM for case analysis, endorses its integration as an adjunct tool in both medical education and research, and highlights its potential as a domain-specific agent.

[109] NADI 2025: The First Multidialectal Arabic Speech Processing Shared Task

Bashar Talafha,Hawau Olamide Toyin,Peter Sullivan,AbdelRahim Elmadany,Abdurrahman Juma,Amirbek Djanibekov,Chiyu Zhang,Hamad Alshehhi,Hanan Aldarmaki,Mustafa Jarrar,Nizar Habash,Muhammad Abdul-Mageed

Main category: cs.CL

TL;DR: The sixth Nuanced Arabic Dialect Identification (NADI 2025) Shared Task evaluated Arabic speech dialect processing across three subtasks, involving 44 teams and yielding top results that highlight ongoing challenges in the field.

Details

Motivation: The motivation of the study was to advance the understanding and processing of Arabic speech dialects through competitive evaluation across multiple subtasks. Method: The study involved 44 teams with 100 valid submissions across three subtasks: spoken dialect identification, speech recognition, and diacritic restoration. The performance of the systems was evaluated based on accuracy for Subtask 1 and WER/CER for Subtasks 2 and 3. Result: Top-performing systems achieved 79.8% accuracy on Subtask 1 (spoken dialect identification), 35.68/12.20 WER/CER on Subtask 2 (speech recognition), and 55/13 WER/CER on Subtask 3 (diacritic restoration). Conclusion: The sixth NADI Shared Task highlighted the ongoing challenges in Arabic dialect speech processing, particularly in spoken dialect identification, speech recognition, and diacritic restoration, with top systems achieving varying levels of success across the three subtasks. Abstract: We present the findings of the sixth Nuanced Arabic Dialect Identification (NADI 2025) Shared Task, which focused on Arabic speech dialect processing across three subtasks: spoken dialect identification (Subtask 1), speech recognition (Subtask 2), and diacritic restoration for spoken dialects (Subtask 3). A total of 44 teams registered, and during the testing phase, 100 valid submissions were received from eight unique teams. The distribution was as follows: 34 submissions for Subtask 1 "five teams{\ae}, 47 submissions for Subtask 2 "six teams", and 19 submissions for Subtask 3 "two teams". The best-performing systems achieved 79.8% accuracy on Subtask 1, 35.68/12.20 WER/CER (overall average) on Subtask 2, and 55/13 WER/CER on Subtask 3. These results highlight the ongoing challenges of Arabic dialect speech processing, particularly in dialect identification, recognition, and diacritic restoration. We also summarize the methods adopted by participating teams and briefly outline directions for future editions of NADI.

[110] Attributes as Textual Genes: Leveraging LLMs as Genetic Algorithm Simulators for Conditional Synthetic Data Generation

Guangzeng Han,Weisi Liu,Xiaolei Huang

Main category: cs.CL

TL;DR: Genetic Prompt is a novel framework that uses genetic algorithms with LLMs to improve the quality and diversity of synthetic data for NLP applications.

Details

Motivation: Ensuring the quality and diversity of synthetic data generated by Large Language Models (LLMs) remains challenging. Method: Genetic Prompt combines genetic algorithms with LLMs to augment synthetic data generation by treating semantic text attributes as gene sequences and leveraging the LLM to simulate crossover and mutation operations. Result: Genetic Prompt significantly outperforms state-of-the-art baselines, shows robust performance across various generator model sizes and scales, and fusing the synthetic data with the original training set boosts downstream model performance, especially for class-imbalanced scenarios. Conclusion: Genetic Prompt is an effective method for producing high-quality synthetic data for a wide range of NLP applications. Abstract: Large Language Models (LLMs) excel at generating synthetic data, but ensuring its quality and diversity remains challenging. We propose Genetic Prompt, a novel framework that combines genetic algorithms with LLMs to augment synthetic data generation. Our approach treats semantic text attributes as gene sequences and leverages the LLM to simulate crossover and mutation operations. This genetic process enhances data quality and diversity by creating novel attribute combinations, yielding synthetic distributions closer to real-world data. To optimize parent selection, we also integrate an active learning scheme that expands the offspring search space. Our experiments on multiple NLP tasks reveal several key findings: Genetic Prompt not only significantly outperforms state-of-the-art baselines but also shows robust performance across various generator model sizes and scales. Moreover, we demonstrate that fusing our synthetic data with the original training set significantly boosts downstream model performance, particularly for class-imbalanced scenarios. Our findings validate that Genetic Prompt is an effective method for producing high-quality synthetic data for a wide range of NLP applications.

[111] How Instruction-Tuning Imparts Length Control: A Cross-Lingual Mechanistic Analysis

Elisabetta Rocchetti,Alfio Ferrara

Main category: cs.CL

TL;DR: This study explores how instruction-tuning improves length control in LLMs, revealing that deeper model layers are specialized for this task, with variations in component contributions across English and Italian.

Details

Motivation: The challenge of adhering to explicit length constraints in text generation by Large Language Models (LLMs) motivates the investigation into differences between foundation models and their instruction-tuned counterparts. Method: The research employs Cumulative Weighted Attribution, derived from Direct Logit Attribution, to analyze the contributions of internal model components in length-controlled text generation for both English and Italian. Result: Instruction-tuning is found to significantly enhance length control, with attention heads in later layers of models showing increased positive contributions in English, while in Italian, the final-layer MLPs compensate with stronger positive roles. Conclusion: The study concludes that instruction-tuning enhances length control in text generation by reconfiguring deeper layers of models, with component-level strategies varying based on linguistic context. Abstract: Adhering to explicit length constraints, such as generating text with a precise word count, remains a significant challenge for Large Language Models (LLMs). This study aims at investigating the differences between foundation models and their instruction-tuned counterparts, on length-controlled text generation in English and Italian. We analyze both performance and internal component contributions using Cumulative Weighted Attribution, a metric derived from Direct Logit Attribution. Our findings reveal that instruction-tuning substantially improves length control, primarily by specializing components in deeper model layers. Specifically, attention heads in later layers of IT models show increasingly positive contributions, particularly in English. In Italian, while attention contributions are more attenuated, final-layer MLPs exhibit a stronger positive role, suggesting a compensatory mechanism. These results indicate that instruction-tuning reconfigures later layers for task adherence, with component-level strategies potentially adapting to linguistic context.

[112] Better by Comparison: Retrieval-Augmented Contrastive Reasoning for Automatic Prompt Optimization

Juhyeon Lee,Wonduk Seo,Hyunjin An,Seunghyun Lee,Yi Bu

Main category: cs.CL

TL;DR: This paper introduces Contrastive Reasoning Prompt Optimization (CRPO), a novel framework that uses retrieval augmented reasoning to optimize prompts for Large Language Models. CRPO constructs two paradigms: tiered contrastive reasoning and multi-metric contrastive reasoning, enabling the model to understand why certain prompts succeed while others fail. The results show CRPO significantly outperforms baselines on the HelpSteer2 benchmark.

Details

Motivation: The authors aim to leverage LLMs' inherent reasoning capability to learn from contrasting examples, which most prior work overlooks. Method: CRPO constructs two complementary optimization paradigms: (1) tiered contrastive reasoning and (2) multi-metric contrastive reasoning, both based on retrieval augmented reasoning process. Result: Experimental results on the HelpSteer2 benchmark demonstrate that CRPO significantly outperforms baselines. Conclusion: CRPO enables the model to deduce why certain prompts succeed while others fail, achieving more robust and interpretable optimization, and highlights the promise of contrastive, retrieval-augmented reasoning for advancing automatic prompt optimization. Abstract: Automatic prompt optimization has recently emerged as a strategy for improving the quality of prompts used in Large Language Models (LLMs), with the goal of generating more accurate and useful responses. However, most prior work focuses on direct prompt refinement or model fine-tuning, overlooking the potential of leveraging LLMs' inherent reasoning capability to learn from contrasting examples. In this paper, we present Contrastive Reasoning Prompt Optimization (CRPO), a novel framework that formulates prompt optimization as a retrieval augmented reasoning process. Our approach retrieves top k reference prompts from the HelpSteer2 dataset, an open-source collection annotated for helpfulness, correctness, coherence, complexity, and verbosity, and constructs two complementary optimization paradigms: (1) tiered contrastive reasoning, where the LLM compares high, medium, and low quality prompts to refine its own generation through reflective reasoning, and (2) multi-metric contrastive reasoning, where the LLM analyzes the best prompts along each evaluation dimension and integrates their strengths into an optimized prompt. By explicitly contrasting high and low quality exemplars, CRPO enables the model to deduce why certain prompts succeed while others fail, thereby achieving more robust and interpretable optimization. Experimental results on the HelpSteer2 benchmark demonstrate that CRPO significantly outperforms baselines. Our findings highlight the promise of contrastive, retrieval-augmented reasoning for advancing automatic prompt optimization.

[113] JudgeAgent: Dynamically Evaluate LLMs with Agent-as-Interviewer

Zhichao Shi,Xuhui Jiang,Chengjin Xu,Cangli Yao,Zhenxin Huang,Shengjie Ma,Yinghan Shen,Yuanzhuo Wang

Main category: cs.CL

TL;DR: 本文提出JudgeAgent，一种知识目标自适应的动态评估框架，以解决当前LLMs评估范式的局限性。

Details

Motivation: 当前LLMs评估范式存在与目标互动有限、难度控制不足以及评估结果验证困难等问题，难以精确确定模型的知识和能力边界。 Method: 提出了一种基于新型采访者风格评估范式的知识目标自适应动态评估框架JudgeAgent，结合基准评分、互动扩展和评估反馈进行综合评估。 Result: 引入了验证评估方法的新见解，通过大量实验展示了JudgeAgent及其动态评估范式的有效性。 Conclusion: JudgeAgent通过知识驱动的数据合成和目标自适应的难度调整方法，提供了一种准确有效的LLMs评估方式，并展示了其动态评估范式的有效性。 Abstract: Evaluating the capabilities of large language models (LLMs) is an essential step to ensure the successful application of LLMs across various domains. The current evaluation of LLMs is based on a paradigm that involves querying them with predefined question sets and assessing their outputs. This paradigm offers controllable processes and simplicity, but faces challenges such as limited interaction with targets, insufficient difficulty control, and difficulties in verifying the validity of evaluation results, making it hard to precisely determine the knowledge and capability boundaries of target models. To address these challenges, we propose JudgeAgent, a knowledge-target adaptive dynamic evaluation framework based on a new interviewer-style evaluation paradigm. JudgeAgent employs a comprehensive evaluation approach consisting of benchmark grading, interactive extension, and evaluation feedback. It utilizes knowledge-driven data synthesis and target-adaptive difficulty adjustment methods to conduct extended testing, providing accurate and effective evaluation results. We also introduce a novel insight into validating evaluation methods, demonstrating the effectiveness of JudgeAgent and its dynamic evaluation paradigm through extensive experiments.

[114] CMRAG: Co-modality-based document retrieval and visual question answering

Wang Chen,Guanqiang Qi,Weikang Li,Yang Li

Main category: cs.CL

TL;DR: 本文提出了一種結合文本和圖像的新型RAG方法（CMRAG），在視覺文檔問答任務中表現優異。

Details

Motivation: 現有方法在處理多模態文檔時存在限制，要麼僅能利用文本信息，要麼忽略文本的語義優勢。 Method: 提出基於共模態的RAG（CMRAG），結合文本和圖像進行高效檢索和生成。 Result: 實驗表明，所提方法在視覺文檔問答任務中顯著優於純視覺RAG方法。 Conclusion: 整合共模態信息到RAG框架中是一種有效提升複雜文檔視覺問答系統性能的方法。 Abstract: Retrieval-Augmented Generation (RAG) has become a core paradigm in document question answering tasks. However, existing methods have limitations when dealing with multimodal documents: one category of methods relies on layout analysis and text extraction, which can only utilize explicit text information and struggle to capture images or unstructured content; the other category treats document segmentation as visual input and directly passes it to visual language models (VLMs) for processing, yet it ignores the semantic advantages of text, leading to suboptimal generation results. This paper proposes co-modality-based RAG (CMRAG), which can simultaneously leverage text and images for efficient retrieval and generation. Specifically, we first perform structured parsing on documents to obtain co-modality representations of text segments and image regions. Subsequently, in response to user queries, we retrieve candidate evidence from text and image channels, respectively, and aggregate the results at the cross-modal retrieval level. Finally, we prompt the VLM to generate the final response based on the co-modality retrieval results. Experiments demonstrate that our method significantly outperforms pure-vision-based RAG in visual document question answering tasks. The findings of this paper show that integrating co-modality information into the RAG framework in a unified manner is an effective approach to improving the performance of complex document visual question-answering (VQA) systems.

[115] AMBEDKAR-A Multi-level Bias Elimination through a Decoding Approach with Knowledge Augmentation for Robust Constitutional Alignment of Language Models

Snehasis Mukhopadhyay,Aryan Kasat,Shivam Dubey,Rahul Karthikeyan,Dhruv Sood,Vinija Jain,Aman Chadha,Amitava Das

Main category: cs.CL

TL;DR: The paper introduces AMBEDKAR, a framework inspired by Dr B. R. Ambedkar, to reduce casteist and communal bias in LLM outputs by using a Constitution-Aware Decoding Layer and speculative decoding algorithm.

Details

Motivation: The motivation is to address societal biases in LLMs, particularly around caste and religion in the Indian context, with locally relevant mitigation strategies. Method: The framework uses a Small Language Model as a generator and a constitutionally guided Large Language Model as a verifier to enforce bias-robust trajectories in outputs. Result: The approach yields an absolute reduction of bias up to 26.41 percent compared to baseline. Conclusion: The proposed AMBEDKAR framework effectively reduces casteist and communal bias in LLM outputs by utilizing a Constitution-Aware Decoding Layer and speculative decoding algorithm. Abstract: Large Language Models (LLMs) can inadvertently reflect societal biases present in their training data, leading to harmful or prejudiced outputs. In the Indian context, our empirical evaluations across a suite of models reveal that biases around caste and religion are particularly salient. Yet, most existing mitigation strategies are Western-centric and fail to address these local nuances. We propose AMBEDKAR, a framework inspired by the egalitarian vision of Dr B. R. Ambedkar, architect of the Indian Constitution, to guide LLM outputs toward fairness, neutrality, and inclusion in line with Articles 14 to 17. Our approach introduces a Constitution-Aware Decoding Layer, guided by the AI Constitution of India and applied only at inference time, without any parameter updates to the base model. We incorporate a speculative decoding algorithm that proactively reduces casteist and communal bias during generation. This mitigation layer operates directly within the decoding process, avoiding changes to model internals and lowering the computational and infrastructural costs associated with retraining. We reinterpret speculative decoding not merely as an efficiency tool but as a mechanism for fairness. In this framework, a Small Language Model (SLM) acts as a potentially biased generator, while a constitutionally guided Large Language Model (LLM) serves as the verifier. Rather than accelerating generation, the LLM enforces bias-robust trajectories in the SLM outputs. This inversion of roles gives rise to a fairness-by-speculation paradigm. Our approach yields an absolute reduction of bias up to 26.41 percent compared to baseline. Our source code, datasets, and results are available at https://anonymous.4open.science/r/AMBEDKAR-983B/

[116] Meta-Pretraining for Zero-Shot Cross-Lingual Named Entity Recognition in Low-Resource Philippine Languages

David Demitri Africa,Suchir Salhan,Yuval Weiss,Paula Buttery,Richard Diehl Martinez

Main category: cs.CL

TL;DR: MAML-based pretraining boosts zero-shot NER performance on low-resource languages, especially for small models and entity types tied to surface features.

Details

Motivation: NER in low-resource languages often relies on large multilingual LMs, which is infeasible in constrained settings; small decoder LMs could offer a lightweight alternative if pretrained effectively. Method: Replaced part of the autoregressive objective with MAML during pretraining; evaluated zero-shot transfer performance on Tagalog and Cebuano NER tasks under different tuning settings. Result: MAML improved zero-shot micro-F1 by 2-6 pp under head-only tuning and 1-3 pp after full tuning, while reducing convergence time by up to 8%; improvements were most pronounced for person entities linked to specific linguistic markers. Conclusion: MAML-enhanced pretraining improves zero-shot NER performance for low-resource languages, particularly for small models and entities linked to surface markers like Tagalog case particles. Abstract: Named-entity recognition (NER) in low-resource languages is usually tackled by finetuning very large multilingual LMs, an option that is often infeasible in memory- or latency-constrained settings. We ask whether small decoder LMs can be pretrained so that they adapt quickly and transfer zero-shot to languages unseen during pretraining. To this end we replace part of the autoregressive objective with first-order model-agnostic meta-learning (MAML). Tagalog and Cebuano are typologically similar yet structurally different in their actor/non-actor voice systems, and hence serve as a challenging test-bed. Across four model sizes (11 M - 570 M) MAML lifts zero-shot micro-F1 by 2-6 pp under head-only tuning and 1-3 pp after full tuning, while cutting convergence time by up to 8%. Gains are largest for single-token person entities that co-occur with Tagalog case particles si/ni, highlighting the importance of surface anchors.

[117] Avoidance Decoding for Diverse Multi-Branch Story Generation

Kyeongman Park,Nakyeong Yang,Kyomin Jung

Main category: cs.CL

TL;DR: 提出了一种名为Avoidance Decoding的解码策略，以提高大型语言模型在故事生成中的创造多样性，通过惩罚与先前输出的相似性，鼓励生成更多样化的故事分支。

Details

Motivation: 大型语言模型（LLMs）在任务如故事生成中往往产生重复和单调的输出，这是由于在给定相同输入提示时创造性多样性有限。 Method: 提出了一种新的解码策略，Avoidance Decoding，通过惩罚与先前生成输出的相似性来修改标记logits，从而鼓励生成更多样化的故事分支。该方法在早期阶段优先考虑概念级相似性惩罚，后期则逐渐强调叙事级相似性惩罚。 Result: 该方法相比强基线模型实现了高达2.6倍的输出多样性提升，平均减少了30%的重复内容，同时有效缓解了文本退化问题。 Conclusion: 研究揭示了所提方法激活了更广泛的神经元范围，表明其利用了模型的内在创造力。 Abstract: Large Language Models (LLMs) often generate repetitive and monotonous outputs, especially in tasks like story generation, due to limited creative diversity when given the same input prompt. To address this challenge, we propose a novel decoding strategy, Avoidance Decoding, that modifies token logits by penalizing similarity to previously generated outputs, thereby encouraging more diverse multi-branch stories. This penalty adaptively balances two similarity measures: (1) Concept-level Similarity Penalty, which is prioritized in early stages to diversify initial story concepts, and (2) Narrative-level Similarity Penalty, which is increasingly emphasized later to ensure natural yet diverse plot development. Notably, our method achieves up to 2.6 times higher output diversity and reduces repetition by an average of 30% compared to strong baselines, while effectively mitigating text degeneration. Furthermore, we reveal that our method activates a broader range of neurons, demonstrating that it leverages the model's intrinsic creativity.

[118] FActBench: A Benchmark for Fine-grained Automatic Evaluation of LLM-Generated Text in the Medical Domain

Anum Afzal,Juraj Vladika,Florian Matthes

Main category: cs.CL

TL;DR: 该研究提出了一个全面的医学领域事实核查基准FActBench，并发现结合两种技术的一致投票方法与领域专家评估最相关。

Details

Motivation: 大型语言模型在处理专业领域时往往表现不佳，而事实性是评估中最重要的方面，因此需要可靠的事实核查工具和数据源来减轻幻觉问题。 Method: 本研究构建了一个全面的事实核查基准FActBench，涵盖四个生成任务和六个最先进的医学领域的大型语言模型（LLMs）。使用两种最先进的事实核查技术：思维链（CoT）提示和自然语言推理（NLI）。 Result: 通过使用一致投票方法，两种事实核查技术（CoT和NLI）获得的事实核查分数与领域专家评估的相关性最好。 Conclusion: FActBench是一个全面的医学领域事实核查基准，使用一致投票方法结合两种最先进的事实核查技术，能够最好地与领域专家评估相关联。 Abstract: Large Language Models tend to struggle when dealing with specialized domains. While all aspects of evaluation hold importance, factuality is the most critical one. Similarly, reliable fact-checking tools and data sources are essential for hallucination mitigation. We address these issues by providing a comprehensive Fact-checking Benchmark FActBench covering four generation tasks and six state-of-the-art Large Language Models (LLMs) for the Medical domain. We use two state-of-the-art Fact-checking techniques: Chain-of-Thought (CoT) Prompting and Natural Language Inference (NLI). Our experiments show that the fact-checking scores acquired through the Unanimous Voting of both techniques correlate best with Domain Expert Evaluation.

[119] Towards Fundamental Language Models: Does Linguistic Competence Scale with Model Size?

Jaime Collado-Montañez,L. Alfonso Ureña-López,Arturo Montejo-Ráez

Main category: cs.CL

TL;DR: This paper proposes the FLM paradigm, which uses smaller models for language competence and external tools for factual knowledge, offering a more efficient and sustainable approach to NLP.

Details

Motivation: The motivation is to address limitations of large language models such as hallucinations, biases, privacy concerns, and high computational costs by proposing a new paradigm that separates linguistic competence from factual memorization. Method: The authors evaluated models ranging from 135M to 32B parameters across three dimensions: linguistic competence, external factual knowledge, and internal factual knowledge. Result: The results show that internal factual knowledge grows faster with model size compared to linguistic competence, indicating that larger models are more about memorization than language ability. Conclusion: The paper concludes that the FLM paradigm offers a more efficient and sustainable approach to NLP by using smaller models for linguistic competence and external tools for factual knowledge. Abstract: Large Language Models offer impressive language capabilities but suffer from well-known limitations, including hallucinations, biases, privacy concerns, and high computational costs. These issues are largely driven by the combination of linguistic competence and factual memorization within a single monolithic model. This paper introduces and empirically supports the Fundamental Language Model (FLM) paradigm, which advocates for smaller, linguistically competent models that offload factual retrieval to external tools. We evaluate models ranging from 135M to 32B parameters across three dimensions: linguistic competence, external factual knowledge, and internal factual knowledge. Our findings reveal that while both linguistic competence and factual knowledge improve with scale, internal factual knowledge grows significantly faster, suggesting that model size is more closely tied to memorization than to core language ability. These results support a modular approach to language modeling, where compact, linguistically proficient models serve as the foundation for tool-augmented systems. The FLM paradigm offers a path toward more efficient, interpretable, and sustainable NLP solutions.

[120] LLMs and their Limited Theory of Mind: Evaluating Mental State Annotations in Situated Dialogue

Katharine Kowalyshyn,Matthias Scheutz

Main category: cs.CL

TL;DR: 论文介绍了一种利用大型语言模型分析团队对话并检测个体理解差异的新框架，发现尽管大型语言模型在简单任务中表现良好，但在需要空间推理或语调理解的任务中存在系统性错误。

Details

Motivation: 研究动机是探索大型语言模型是否能够不仅推断人类思维模式，还能揭示团队对话中的盲点，例如团队成员共同理解中的差异。 Method: 论文提出了一种两步框架，利用大型语言模型作为人类风格的注释者来跟踪团队的共享心智模型，并通过与人类注释和黄金标准标签对比来检测个体心智状态之间的差异。 Result: 研究结果包括一个包含人类和大型语言模型注释的数据集、一个可重复的共享心智模型连贯性评估框架，以及对基于大型语言模型的差异检测的实证评估。 Conclusion: 论文得出结论，尽管大型语言模型在自然语言注释任务中表现出明显的连贯性，但在需要空间推理或语调线索消歧的情况下会系统性地出错。 Abstract: What if large language models could not only infer human mindsets but also expose every blind spot in team dialogue such as discrepancies in the team members' joint understanding? We present a novel, two-step framework that leverages large language models (LLMs) both as human-style annotators of team dialogues to track the team's shared mental models (SMMs) and as automated discrepancy detectors among individuals' mental states. In the first step, an LLM generates annotations by identifying SMM elements within task-oriented dialogues from the Cooperative Remote Search Task (CReST) corpus. Then, a secondary LLM compares these LLM-derived annotations and human annotations against gold-standard labels to detect and characterize divergences. We define an SMM coherence evaluation framework for this use case and apply it to six CReST dialogues, ultimately producing: (1) a dataset of human and LLM annotations; (2) a reproducible evaluation framework for SMM coherence; and (3) an empirical assessment of LLM-based discrepancy detection. Our results reveal that, although LLMs exhibit apparent coherence on straightforward natural-language annotation tasks, they systematically err in scenarios requiring spatial reasoning or disambiguation of prosodic cues.

[121] DCPO: Dynamic Clipping Policy Optimization

Shihui Yang,Chengfeng Dou,Peidong Guo,Kai Lu,Qiang Ju,Fei Deng,Rihui Xin

Main category: cs.CL

TL;DR: 本文提出DCPO，通过动态调整剪裁边界和奖励标准化技术，解决了现有强化学习方法中的零梯度问题，提高了大语言模型的训练效率和性能。

Details

Motivation: 现有方法如GRPO由于固定剪裁边界和标准奖励标准化导致零梯度问题，影响了梯度更新效果和生成响应的利用率。 Method: 引入动态剪裁策略和光滑优势标准化技术，以增强标记级探索和响应级有效利用生成的响应。 Result: DCPO在四个基于不同模型的基准测试中取得了最先进的性能，包括在AIME24和AIME25基准测试中超越了DAPO和GRPO，并在非零优势、训练效率和标记剪裁比率方面显著提升。 Conclusion: DCPO通过动态调整剪裁边界和优势标准化技术，有效解决了现有方法中的零梯度问题，提高了大语言模型的强化学习效率和性能。 Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a promising framework for enhancing the reasoning capabilities of large language models. However, existing approaches such as GRPO often suffer from zero gradients. This problem arises primarily due to fixed clipping bounds for token-level probability ratios and the standardization of identical rewards, which can lead to ineffective gradient updates and underutilization of generated responses. In this work, we propose Dynamic Clipping Policy Optimization (DCPO), which introduces a dynamic clipping strategy that adaptively adjusts the clipping bounds based on token-specific prior probabilities to enhance token-level exploration, and a smooth advantage standardization technique that standardizes rewards across cumulative training steps to improve the response-level effective utilization of generated responses. DCPO achieved state-of-the-art performance on four benchmarks based on four different models. In particular, DCPO achieved an Avg@1 of 46.7 under greedy decoding and an Avg@32 of 38.8 under 32 times sampling on the AIME24 benchmark, surpassing both DAPO (36.7/31.6) and GRPO (36.7/32.1) on the Qwen2.5-Math-7B model. On the AIME25 benchmark based on Qwen2.5-14B, DCPO achieves a performance of (23.3/19.0), surpassing GRPO (13.3/10.5) and DAPO (20.0/15.3). Furthermore, DCPO achieved an average 28% improvement in the nonzero advantage over GRPO in four models, doubled the training efficiency over DAPO, and significantly reduced the token clipping ratio by an order of magnitude compared to both GRPO and DAPO, while achieving superior performance. These results highlight DCPO's effectiveness in leveraging generated data more efficiently for reinforcement learning in large language models.

[122] Implicit Reasoning in Large Language Models: A Comprehensive Survey

Jindong Li,Yali Fu,Li Fan,Jiahong Liu,Yao Shu,Chengwei Qin,Menglin Yang,Irwin King,Rex Ying

Main category: cs.CL

TL;DR: This paper explores implicit reasoning in LLMs, introducing a taxonomy focused on execution paradigms such as latent optimization, signal-guided control, and layer-recurrent execution. It reviews evidence for implicit reasoning and evaluates methods for assessing its effectiveness.

Details

Motivation: The motivation is to shift focus from explicit chain-of-thought prompting to implicit reasoning in LLMs, addressing the lack of a detailed examination of internal reasoning mechanisms. Method: The paper introduces a taxonomy based on execution paradigms, categorizing methods into latent optimization, signal-guided control, and layer-recurrent execution. It also reviews evidence supporting implicit reasoning and evaluates metrics and benchmarks used in assessing its effectiveness. Result: The paper provides a taxonomy of implicit reasoning execution paradigms, presents evidence supporting implicit reasoning in LLMs, and offers a review of evaluation metrics and benchmarks. Conclusion: The paper concludes that implicit reasoning in LLMs is effective and internally aligned, with evidence supporting its structural, behavioral, and representational aspects. It highlights the importance of execution paradigms over representational forms. Abstract: Large Language Models (LLMs) have demonstrated strong generalization across a wide range of tasks. Reasoning with LLMs is central to solving multi-step problems and complex decision-making. To support efficient reasoning, recent studies have shifted attention from explicit chain-of-thought prompting toward implicit reasoning, where reasoning occurs silently via latent structures without emitting intermediate textual steps. Implicit reasoning brings advantages such as lower generation cost, faster inference, and better alignment with internal computation. Although prior surveys have discussed latent representations in the context of reasoning, a dedicated and mechanism-level examination of how reasoning unfolds internally within LLMs remains absent. This survey fills that gap by introducing a taxonomy centered on execution paradigms, shifting the focus from representational forms to computational strategies. We organize existing methods into three execution paradigms based on \textbf{\textit{how and where internal computation unfolds}}: latent optimization, signal-guided control, and layer-recurrent execution. We also review structural, behavioral and representation-based evidence that supports the presence of implicit reasoning in LLMs. We further provide a structured overview of the evaluation metrics and benchmarks used in existing works to assess the effectiveness and reliability of implicit reasoning.We maintain a continuously updated project at: https://github.com/digailab/awesome-llm-implicit-reasoning.

[123] Towards Temporal Knowledge-Base Creation for Fine-Grained Opinion Analysis with Language Models

Gaurav Negi,Atul Kr. Ojha,Omnia Zayed,Paul Buitelaar

Main category: cs.CL

TL;DR: This paper proposes a method to build a temporal opinion knowledge base using LLMs to enable structured opinion extraction for applications like forecasting and trend analysis.

Details

Motivation: The motivation stems from the underexploitation of time-series opinion analysis due to the lack of fine-grained temporal annotations, limiting the potential for forecasting and trend analysis. Method: The approach integrates opinion mining formulations into a declarative LLM annotation pipeline, utilizing three data models based on sentiment and opinion mining literature for structured representation and conducting evaluations with two LLMs. Result: A scalable method for creating a structured, time-aligned opinion knowledge base was developed, with rigorous evaluation showing its effectiveness, as evidenced by inter-annotator agreement and compatibility with downstream applications. Conclusion: The proposed method successfully constructs a temporal opinion knowledge base using LLMs, enabling structured opinion extraction and proving beneficial for various temporal applications. Abstract: We propose a scalable method for constructing a temporal opinion knowledge base with large language models (LLMs) as automated annotators. Despite the demonstrated utility of time-series opinion analysis of text for downstream applications such as forecasting and trend analysis, existing methodologies underexploit this potential due to the absence of temporally grounded fine-grained annotations. Our approach addresses this gap by integrating well-established opinion mining formulations into a declarative LLM annotation pipeline, enabling structured opinion extraction without manual prompt engineering. We define three data models grounded in sentiment and opinion mining literature, serving as schemas for structured representation. We perform rigorous quantitative evaluation of our pipeline using human-annotated test samples. We carry out the final annotations using two separate LLMs, and inter-annotator agreement is computed label-wise across the fine-grained opinion dimensions, analogous to human annotation protocols. The resulting knowledge base encapsulates time-aligned, structured opinions and is compatible with applications in Retrieval-Augmented Generation (RAG), temporal question answering, and timeline summarisation.

[124] An Ensemble Classification Approach in A Multi-Layered Large Language Model Framework for Disease Prediction

Ali Hamdi,Malak Mohamed,Rokaia Emad,Khaled Shaban

Main category: cs.CL

TL;DR: 本研究首次将LLM预处理技术与微调的阿拉伯语Transformer模型和集成学习相结合，用于阿拉伯语社交远程医疗数据的疾病分类，取得了80.56%的准确率。

Details

Motivation: 社交远程医疗产生了大量的阿拉伯语医疗文本数据，利用这些数据进行疾病分类具有重要的应用价值。大型语言模型（LLMs）和Transformer模型在处理复杂医疗文本方面表现出色，但目前缺乏针对阿拉伯语社交远程医疗数据的有效解决方案。 Method: 评估了三种阿拉伯语医疗文本预处理方法（摘要、优化和命名实体识别NER），并应用了微调的阿拉伯语Transformer模型（CAMeLBERT、AraBERT和AsafayaBERT）。此外，采用多数投票集成方法增强模型的鲁棒性。 Result: 结合预处理方法和集成学习策略的模型实现了80.56%的分类准确率，证明了其在提升医疗文本理解方面的有效性。 Conclusion: 该研究通过结合LLM预处理和微调的阿拉伯语Transformer模型以及集成学习方法，有效提升了阿拉伯语社交远程医疗数据的疾病分类准确性，达到了80.56%的最佳分类准确率。 Abstract: Social telehealth has made remarkable progress in healthcare by allowing patients to post symptoms and participate in medical consultations remotely. Users frequently post symptoms on social media and online health platforms, creating a huge repository of medical data that can be leveraged for disease classification. Large language models (LLMs) such as LLAMA3 and GPT-3.5, along with transformer-based models like BERT, have demonstrated strong capabilities in processing complex medical text. In this study, we evaluate three Arabic medical text preprocessing methods such as summarization, refinement, and Named Entity Recognition (NER) before applying fine-tuned Arabic transformer models (CAMeLBERT, AraBERT, and AsafayaBERT). To enhance robustness, we adopt a majority voting ensemble that combines predictions from original and preprocessed text representations. This approach achieved the best classification accuracy of 80.56%, thus showing its effectiveness in leveraging various text representations and model predictions to improve the understanding of medical texts. To the best of our knowledge, this is the first work that integrates LLM-based preprocessing with fine-tuned Arabic transformer models and ensemble learning for disease classification in Arabic social telehealth data.

[125] EmoPerso: Enhancing Personality Detection with Self-Supervised Emotion-Aware Modelling

Lingzhi Shen,Xiaohao Cai,Yunfei Long,Imran Razzak,Guanming Chen,Shoaib Jameel

Main category: cs.CL

TL;DR: The paper introduces EmoPerso, a novel self-supervised framework that improves personality detection from text by modeling interactions between emotion and personality, outperforming existing methods.

Details

Motivation: The motivation is to overcome the limitations of existing personality detection methods that rely on large annotated datasets and treat emotion and personality as independent variables, overlooking their interactions. Method: The paper proposes a self-supervised framework called EmoPerso. It uses generative mechanisms for data augmentation and representation learning, extracts pseudo-labeled emotion features, and employs multi-task learning and a cross-attention module to model interactions between emotion and personality. A self-taught strategy is also used to refine relational reasoning. Result: EmoPerso outperforms state-of-the-art models on two benchmark datasets, demonstrating the effectiveness of emotion-aware modeling in personality detection. Conclusion: The paper concludes that EmoPerso, a self-supervised framework incorporating emotion-aware modeling, surpasses state-of-the-art models in personality detection from text. Abstract: Personality detection from text is commonly performed by analysing users' social media posts. However, existing methods heavily rely on large-scale annotated datasets, making it challenging to obtain high-quality personality labels. Moreover, most studies treat emotion and personality as independent variables, overlooking their interactions. In this paper, we propose a novel self-supervised framework, EmoPerso, which improves personality detection through emotion-aware modelling. EmoPerso first leverages generative mechanisms for synthetic data augmentation and rich representation learning. It then extracts pseudo-labeled emotion features and jointly optimizes them with personality prediction via multi-task learning. A cross-attention module is employed to capture fine-grained interactions between personality traits and the inferred emotional representations. To further refine relational reasoning, EmoPerso adopts a self-taught strategy to enhance the model's reasoning capabilities iteratively. Extensive experiments on two benchmark datasets demonstrate that EmoPerso surpasses state-of-the-art models. The source code is available at https://github.com/slz0925/EmoPerso.

[126] Do LLMs Adhere to Label Definitions? Examining Their Receptivity to External Label Definitions

Seyedali Mohammadi,Bhaskara Hanuma Vedula,Hemank Lamba,Edward Raff,Ponnurangam Kumaraguru,Francis Ferraro,Manas Gaur

Main category: cs.CL

TL;DR: LLMs often rely on internal knowledge, but explicit definitions improve performance in domain-specific tasks.

Details

Motivation: The research aims to understand whether LLMs genuinely incorporate external definitions or primarily rely on their parametric knowledge. Method: The research employed controlled experiments across multiple explanation benchmark datasets (general and domain-specific) and label definition conditions, including expert-curated, LLM-generated, perturbed, and swapped definitions. Result: Results show that explicit label definitions can improve accuracy and explainability in LLMs, but the models do not consistently integrate these definitions into their task-solving processes. Internal representations are often prioritized, particularly in general tasks, while domain-specific tasks benefit more from external definitions. Conclusion: The study concludes that while explicit label definitions can enhance accuracy and explainability in LLMs, their integration into task-solving processes is not guaranteed or consistent. LLMs often rely on internal representations, especially in general tasks, but domain-specific tasks benefit more from explicit definitions. Abstract: Do LLMs genuinely incorporate external definitions, or do they primarily rely on their parametric knowledge? To address these questions, we conduct controlled experiments across multiple explanation benchmark datasets (general and domain-specific) and label definition conditions, including expert-curated, LLM-generated, perturbed, and swapped definitions. Our results reveal that while explicit label definitions can enhance accuracy and explainability, their integration into an LLM's task-solving processes is neither guaranteed nor consistent, suggesting reliance on internalized representations in many cases. Models often default to their internal representations, particularly in general tasks, whereas domain-specific tasks benefit more from explicit definitions. These findings underscore the need for a deeper understanding of how LLMs process external knowledge alongside their pre-existing capabilities.

[127] SpecEval: Evaluating Model Adherence to Behavior Specifications

Ahmed Ahmed,Kevin Klyman,Yi Zeng,Sanmi Koyejo,Percy Liang

Main category: cs.CL

TL;DR: 研究开发了一个自动化框架，用于审计基础模型是否符合其开发者公布的行为规范，结果发现存在高达20%的合规差距。

Details

Motivation: 尽管如OpenAI、Anthropic和Google等公司发布了其模型应遵守的安全约束和行为规范，但尚不清楚这些模型是否真正遵守。 Method: 研究者构建了一个自动化框架，通过解析行为声明、生成针对性提示，并使用模型判断其合规性，对模型进行审计。 Result: 通过对来自六个开发者的16个模型超过100项行为声明的分析，发现了系统性不一致问题，包括各提供商之间的合规差距。 Conclusion: 该研究得出结论，基础模型在遵循其开发者公布的行为规范方面存在系统性不一致，合规差距最高可达20%。 Abstract: Companies that develop foundation models publish behavioral guidelines they pledge their models will follow, but it remains unclear if models actually do so. While providers such as OpenAI, Anthropic, and Google have published detailed specifications describing both desired safety constraints and qualitative traits for their models, there has been no systematic audit of adherence to these guidelines. We introduce an automated framework that audits models against their providers specifications by parsing behavioral statements, generating targeted prompts, and using models to judge adherence. Our central focus is on three way consistency between a provider specification, its model outputs, and its own models as judges; an extension of prior two way generator validator consistency. This establishes a necessary baseline: at minimum, a foundation model should consistently satisfy the developer behavioral specifications when judged by the developer evaluator models. We apply our framework to 16 models from six developers across more than 100 behavioral statements, finding systematic inconsistencies including compliance gaps of up to 20 percent across providers.

[128] GRAM-R$^2$: Self-Training Generative Foundation Reward Models for Reward Reasoning

Chenglong Wang,Yongyu Mu,Hang Zhou,Yifu Huo,Ziming Zhu,Jiali Zeng,Murun Yang,Bei Li,Tong Xiao,Xiaoyang Hao,Chunliang Zhang,Fandong Meng,Jingbo Zhu

Main category: cs.CL

TL;DR: 本文提出GRAM-R$^2$，一种基于无标签数据自我训练的生成奖励模型，能够生成偏好标签和奖励理由，并在多个任务中表现出色。

Details

Motivation: 尽管奖励模型研究取得了显著进展，但依赖大规模有标签数据的问题依然存在，需要通过无监督预训练来改善奖励模型的推理能力。 Method: 提出了一种基于自我训练的生成奖励模型GRAM-R$^2$，通过无标签数据生成偏好标签和奖励理由。 Result: GRAM-R$^2$在响应排序、任务适应和人类反馈强化学习任务中均表现出色，优于多种强判别和生成基线模型。 Conclusion: GRAM-R$^2$有效地利用了无标签数据来提升奖励模型的推理能力，并在多种任务中表现出色，可作为奖励推理的基础模型。 Abstract: Significant progress in reward modeling over recent years has been driven by a paradigm shift from task-specific designs towards generalist reward models. Despite this trend, developing effective reward models remains a fundamental challenge: the heavy reliance on large-scale labeled preference data. Pre-training on abundant unlabeled data offers a promising direction, but existing approaches fall short of instilling explicit reasoning into reward models. To bridge this gap, we propose a self-training approach that leverages unlabeled data to elicit reward reasoning in reward models. Based on this approach, we develop GRAM-R$^2$, a generative reward model trained to produce not only preference labels but also accompanying reward rationales. GRAM-R$^2$ can serve as a foundation model for reward reasoning and can be applied to a wide range of tasks with minimal or no additional fine-tuning. It can support downstream applications such as response ranking and task-specific reward tuning. Experiments on response ranking, task adaptation, and reinforcement learning from human feedback demonstrate that GRAM-R$^2$ consistently delivers strong performance, outperforming several strong discriminative and generative baselines.

[129] MoSEs: Uncertainty-Aware AI-Generated Text Detection via Mixture of Stylistics Experts with Conditional Thresholds

Junxi Wu,Jinpeng Wang,Zheng Liu,Bin Chen,Dongjian Hu,Hao Wu,Shu-Tao Xiu

Main category: cs.CL

TL;DR: 本文提出了MoSEs框架，通过风格感知和动态阈值估计提升AI生成文本的检测性能，尤其在低资源情况下效果显著。

Details

Motivation: 现有AI生成文本检测方法忽视了风格建模，且多依赖静态阈值，限制了检测性能。 Method: 提出了Mixture of Stylistic Experts (MoSEs) 框架，包括Stylistics Reference Repository (SRR)、Stylistics-Aware Router (SAR) 和 Conditional Threshold Estimator (CTE) 三个核心组件，通过条件阈值估计实现风格感知的不确定性量化。 Result: MoSEs框架在检测性能上平均提升了11.34%，低资源情况下提升了39.15%。 Conclusion: MoSEs框架在检测AI生成文本方面表现出色，尤其是在低资源情况下，相较基线有显著提升。 Abstract: The rapid advancement of large language models has intensified public concerns about the potential misuse. Therefore, it is important to build trustworthy AI-generated text detection systems. Existing methods neglect stylistic modeling and mostly rely on static thresholds, which greatly limits the detection performance. In this paper, we propose the Mixture of Stylistic Experts (MoSEs) framework that enables stylistics-aware uncertainty quantification through conditional threshold estimation. MoSEs contain three core components, namely, the Stylistics Reference Repository (SRR), the Stylistics-Aware Router (SAR), and the Conditional Threshold Estimator (CTE). For input text, SRR can activate the appropriate reference data in SRR and provide them to CTE. Subsequently, CTE jointly models the linguistic statistical properties and semantic features to dynamically determine the optimal threshold. With a discrimination score, MoSEs yields prediction labels with the corresponding confidence level. Our framework achieves an average improvement 11.34% in detection performance compared to baselines. More inspiringly, MoSEs shows a more evident improvement 39.15% in the low-resource case. Our code is available at https://github.com/creator-xi/MoSEs.

[130] L3Cube-IndicHeadline-ID: A Dataset for Headline Identification and Semantic Evaluation in Low-Resource Indian Languages

Nishant Tanksale,Tanmay Kokate,Darshan Gohad,Sarvadnyaa Barate,Raviraj Joshi

Main category: cs.CL

TL;DR: 该论文介绍了 L3Cube-IndicHeadline-ID，这是一个用于印度低资源语言语义理解评估的新数据集，包含十种语言和四种标题变体，有助于改进多语言和特定语言模型的评估与应用。

Details

Motivation: 由于缺乏高质量的基准测试数据，句子嵌入模型在印度语言中的有效性尚未得到充分研究，因此需要构建适用于低资源语言的语义评估数据集。 Method: 构建了一个包含十种低资源印度语言和英语的新闻文章与四种标题变体的 L3Cube-IndicHeadline-ID 数据集，并通过余弦相似度测试句子嵌入模型的性能。 Result: 多语言模型表现稳定，而特定语言模型效果参差不齐，证明了 L3Cube-IndicHeadline-ID 可用于检索增强生成（RAG）及其他任务的有效评估。 Conclusion: L3Cube-IndicHeadline-ID 是一个用于提升低资源印度语言语义理解评估的新数据集，有助于改进多语言和特定语言模型在 RAG 流水线和其他任务中的表现。 Abstract: Semantic evaluation in low-resource languages remains a major challenge in NLP. While sentence transformers have shown strong performance in high-resource settings, their effectiveness in Indic languages is underexplored due to a lack of high-quality benchmarks. To bridge this gap, we introduce L3Cube-IndicHeadline-ID, a curated headline identification dataset spanning ten low-resource Indic languages: Marathi, Hindi, Tamil, Gujarati, Odia, Kannada, Malayalam, Punjabi, Telugu, Bengali and English. Each language includes 20,000 news articles paired with four headline variants: the original, a semantically similar version, a lexically similar version, and an unrelated one, designed to test fine-grained semantic understanding. The task requires selecting the correct headline from the options using article-headline similarity. We benchmark several sentence transformers, including multilingual and language-specific models, using cosine similarity. Results show that multilingual models consistently perform well, while language-specific models vary in effectiveness. Given the rising use of similarity models in Retrieval-Augmented Generation (RAG) pipelines, this dataset also serves as a valuable resource for evaluating and improving semantic understanding in such applications. Additionally, the dataset can be repurposed for multiple-choice question answering, headline classification, or other task-specific evaluations of LLMs, making it a versatile benchmark for Indic NLP. The dataset is shared publicly at https://github.com/l3cube-pune/indic-nlp

[131] The Forgotten Code: Validating a Century-Old Translation System with AI

Jean-Marie Le Ray

Main category: cs.CL

TL;DR: Modern AI validates and extends Federico Pucci's early 20th-century rule-based translation system, proving its accuracy and placing him among the pioneers of machine translation.

Details

Motivation: To validate and revive Pucci's early rule-based mechanical translation system using modern AI, demonstrating its relevance and accuracy nearly a century later. Method: The methodology involves having AIs retranslate texts using Pucci's original method from 1931, comparing translations done by Pucci and AIs, and then reproducing excerpts in multiple languages following Pucci's method. Result: The AI-generated translations showed a low average difference compared to Pucci's original translations, proving the consistency and feasibility of Pucci's method in modern applications. Conclusion: Pucci's historical status is affirmed, and he is recognized as a precursor and intellectual contributor to machine translation, meriting examination alongside notable figures in the field. Abstract: A pioneering rule-based mechanical translation system (precursor of modern RBMTs) was first presented in December 1929 by its inventor, Federico Pucci, who later published the full method in a book titled "Il traduttore meccanico ed il metodo per corrispondersi fra Europei conoscendo ciascuno solo la propria lingua: Parte I", in Salerno (Italy), in 1931. This study illustrates how AI breathes new life into the system of international keys and ideograms devised by Pucci to translate from/into any Romance language (at least as a first step). The methodology involves having the AIs retranslate, following Pucci's method, the two text excerpts originally translated in 1931 and clearly documented in his publication: a passage from Dante's La Vita Nuova, translated from Italian into French, and a passage from Voltaire's Zadig, translated from French into Italian. The result is notable: the two texts, translated 94 years apart using the same method--by Pucci in 1931 and by AIs in 2025--show a low average difference, with only minor variations observed. With Pucci's system thus validated, it became feasible to have the AIs reproduce the excerpts in English, Spanish, and German according to his method. The results were consistent, and Pucci--via Artificial Intelligence--was tasked with translating more modern and technical texts, thereby reviving, nearly a century later, an invention that had remained almost entirely unknown and never applied beyond its creator, now brought to wider attention and opened to possible experimentation. Such a demonstration would not only affirm Pucci's historical status but also place him among the precursors and intellectual contributors to machine translation, whose work merits examination alongside figures such as Troyanskij, Booth, and Weaver, with possible consequences for how the history of the field is understood.

[132] Top-H Decoding: Adapting the Creativity and Coherence with Bounded Entropy in Text Generation

Erfan Baghaei Potraghloo,Seyedarmin Azizi,Souvik Kundu,Massoud Pedram

Main category: cs.CL

TL;DR: This paper introduces top-H decoding, a novel method for open-ended text generation that effectively balances creativity and coherence, outperforming existing techniques like min-p sampling.

Details

Motivation: Existing sampling methods fail to effectively incorporate model confidence, limiting their ability to balance creativity and coherence in text generation. Method: The paper proposes top-H decoding, a greedy algorithm addressing an entropy-constrained mass maximization problem derived from theoretical analysis. Result: Top-H outperforms state-of-the-art methods by up to 25.63% on creative writing benchmarks while maintaining robustness on question-answering tasks. Conclusion: Top-H decoding improves the balance between creativity and coherence in text generation, outperforming existing methods like min-p sampling. Abstract: Large language models (LLMs), despite their impressive performance across a wide range of tasks, often struggle to balance two competing objectives in open-ended text generation: fostering diversity and creativity while preserving logical coherence. Existing truncated sampling techniques, including temperature scaling, top-\$p\$ (nucleus) sampling, and min-\$p\$ sampling, aim to manage this trade-off. However, they exhibit limitations, particularly in the effective incorporation of the confidence of the model into the corresponding sampling strategy. For example, min-\$p\$ sampling relies on a single top token as a heuristic for confidence, eventually underutilizing the information of the probability distribution. Toward effective incorporation of the confidence of the model, in this paper, we present **top-H** decoding. We first establish the theoretical foundation of the interplay between creativity and coherence in truncated sampling by formulating an **entropy-constrained minimum divergence** problem. We then prove this minimization problem to be equivalent to an **entropy-constrained mass maximization** (ECMM) problem, which is NP-hard. Finally, we present top-H decoding, a computationally efficient greedy algorithm to solve the ECMM problem. Extensive empirical evaluations demonstrate that top-H outperforms the state-of-the-art (SoTA) alternative of min-\$p\$ sampling by up to **25.63%** on creative writing benchmarks, while maintaining robustness on question-answering datasets such as GPQA, GSM8K, and MT-Bench. Additionally, an *LLM-as-judge* evaluation confirms that top-H indeed produces coherent outputs even at higher temperatures, where creativity is especially critical. In summary, top-H advances SoTA in open-ended text generation and can be *easily integrated* into creative writing applications. The code is available at https://github.com/ErfanBaghaei/Top-H-Decoding.

[133] Comparative Study of Pre-Trained BERT and Large Language Models for Code-Mixed Named Entity Recognition

Mayur Shirke,Amey Shembade,Pavan Thorat,Madhushri Wagh,Raviraj Joshi

Main category: cs.CL

TL;DR: This study compares code-mixed and non-code-mixed models for Hinglish NER tasks and finds that code-mixed models like HingRoBERTa and HingBERT-based fine-tuned models perform best, while Google Gemini shows promising zero-shot results.

Details

Motivation: Named Entity Recognition (NER) in code-mixed text, particularly Hindi-English (Hinglish), presents unique challenges due to informal structure, transliteration, and frequent language switching. This study aims to evaluate and compare the performance of various models on code-mixed NER tasks to understand the effectiveness of specialized versus generalized models in this domain. Method: The authors conducted a comparative evaluation of code-mixed fine-tuned models (HingBERT, HingMBERT, HingRoBERTa) and non-code-mixed multilingual models (BERT Base Cased, IndicBERT, RoBERTa, MuRIL), along with the zero-shot generative LLM Google Gemini. They tested all models on a benchmark Hinglish NER dataset using Precision, Recall, and F1-score. Google Gemini was evaluated in a zero-shot setting using a modified version of the dataset with NER tags removed. Result: Code-mixed models, particularly HingRoBERTa and HingBERT-based fine-tuned models, outperformed other models, including closed-source LLMs like Google Gemini, due to domain-specific pretraining. Non-code-mixed models performed reasonably well but showed limited adaptability. Google Gemini demonstrated competitive zero-shot performance, underlining the generalization strength of modern LLMs. Conclusion: The study concludes that code-mixed models, specifically HingRoBERTa and HingBERT-based fine-tuned models, are more effective for code-mixed NER tasks compared to non-code-mixed multilingual models and even closed-source LLMs like Google Gemini. However, Google Gemini shows competitive zero-shot performance, highlighting the generalization strength of modern LLMs. Abstract: Named Entity Recognition (NER) in code-mixed text, particularly Hindi-English (Hinglish), presents unique challenges due to informal structure, transliteration, and frequent language switching. This study conducts a comparative evaluation of code-mixed fine-tuned models and non-code-mixed multilingual models, along with zero-shot generative large language models (LLMs). Specifically, we evaluate HingBERT, HingMBERT, and HingRoBERTa (trained on code-mixed data), and BERT Base Cased, IndicBERT, RoBERTa and MuRIL (trained on non-code-mixed multilingual data). We also assess the performance of Google Gemini in a zero-shot setting using a modified version of the dataset with NER tags removed. All models are tested on a benchmark Hinglish NER dataset using Precision, Recall, and F1-score. Results show that code-mixed models, particularly HingRoBERTa and HingBERT-based fine-tuned models, outperform others - including closed-source LLMs like Google Gemini - due to domain-specific pretraining. Non-code-mixed models perform reasonably but show limited adaptability. Notably, Google Gemini exhibits competitive zero-shot performance, underlining the generalization strength of modern LLMs. This study provides key insights into the effectiveness of specialized versus generalized models for code-mixed NER tasks.

[134] Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR

Jiaming Li,Longze Chen,Ze Gong,Yukun Chen,Lu Wang,Wanwei He,Run Luo,Min Yang

Main category: cs.CL

TL;DR: PACS 是一种新的 RLVR 框架，通过监督学习方法隐式耦合 Actor-Critic，提高大型语言模型在数学推理任务中的性能。

Details

Motivation: 现有 RLVR 方法面临稀疏奖励信号和不稳定的策略梯度更新问题，尤其是基于 RL 的方法。 Method: 将 RLVR 问题重新表述为监督学习任务，通过使用交叉熵损失优化策略模型参数化的评分函数，隐式耦合了 Actor-Critic 框架。 Result: 在 AIME 2025 数据集上，PACS 在 pass@256 上达到 59.78%，比 PPO 和 GRPO 分别提高了 13.32 和 14.36 个百分点。 Conclusion: PACS 为 LLMs 的后训练提供了一个简单而强大的框架，优于现有 RLVR 基线方法，如 PPO 和 GRPO，在数学推理任务上表现出卓越的性能。 Abstract: Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have empowered large language models (LLMs) to tackle challenging reasoning tasks such as mathematics and programming. RLVR leverages verifiable outcome rewards to guide policy optimization, enabling LLMs to progressively improve output quality in a grounded and reliable manner. Despite its promise, the RLVR paradigm poses significant challenges, as existing methods often suffer from sparse reward signals and unstable policy gradient updates, particularly in RL-based approaches. To address the challenges, we propose $\textbf{PACS}$, a novel RLVR framework that achieves im$\textbf{P}$licit $\textbf{A}$ctor $\textbf{C}$ritic coupling via a $\textbf{S}$upervised learning framework. By treating the outcome reward as a predictable label, we reformulate the RLVR problem into a supervised learning task over a score function parameterized by the policy model and optimized using cross-entropy loss. A detailed gradient analysis shows that this supervised formulation inherently recovers the classical policy gradient update while implicitly coupling actor and critic roles, yielding more stable and efficient training. Benchmarking on challenging mathematical reasoning tasks, PACS outperforms strong RLVR baselines, such as PPO and GRPO, achieving superior reasoning performance. For instance, PACS achieves 59.78\% at pass@256 on AIME 2025, representing improvements of 13.32 and 14.36 points over PPO and GRPO. This simple yet powerful framework offers a promising avenue for LLMs post-training with verifiable rewards. Our code and data are available as open source at https://github.com/ritzz-ai/PACS.

[135] Flavors of Moonshine: Tiny Specialized ASR Models for Edge Devices

Evan King,Adam Sabra,Manjunath Kudlur,James Wang,Pete Warden

Main category: cs.CL

TL;DR: The paper introduces the Flavors of Moonshine, a suite of small ASR models for underrepresented languages, which outperform larger multilingual models in terms of accuracy.

Details

Motivation: The motivation is to challenge the prevailing belief that multilingual ASR models are superior, particularly for small models and underrepresented languages. Method: The authors developed specialized monolingual ASR models using a combination of high-quality human-labeled, pseudo-labeled, and synthetic data. Result: The Moonshine models achieved error rates 48% lower than the Whisper Tiny model, outperformed the Whisper Small model, and matched or exceeded the performance of the Whisper Medium model. Conclusion: The paper concludes that monolingual ASR models, when trained on a balanced mix of data types, can outperform multilingual models, particularly for underrepresented languages. Abstract: We present the Flavors of Moonshine, a suite of tiny automatic speech recognition (ASR) models specialized for a range of underrepresented languages. Prevailing wisdom suggests that multilingual ASR models outperform monolingual counterparts by exploiting cross-lingual phonetic similarities. We challenge this assumption, showing that for sufficiently small models (27M parameters), training monolingual systems on a carefully balanced mix of high-quality human-labeled, pseudo-labeled, and synthetic data yields substantially superior performance. On average, our models achieve error rates 48% lower than the comparably sized Whisper Tiny model, outperform the 9x larger Whisper Small model, and in most cases match or outperform the 28x larger Whisper Medium model. These results advance the state of the art for models of this size, enabling accurate on-device ASR for languages that previously had limited support. We release Arabic, Chinese, Japanese, Korean, Ukrainian, and Vietnamese Moonshine models under a permissive open-source license.

[136] Jointly Reinforcing Diversity and Quality in Language Model Generations

Tianjian Li,Yiming Zhang,Ping Yu,Swarnadeep Saha,Daniel Khashabi,Jason Weston,Jack Lanchantin,Tianlu Wang

Main category: cs.CL

TL;DR: DARLING is a framework for training large language models that simultaneously improves response quality and semantic diversity, enhancing performance in creative and problem-solving tasks.

Details

Motivation: Post-training of LMs often prioritizes accuracy over diversity, limiting their usefulness in creative and exploratory tasks. This work aims to address that tension. Method: DARLING introduces a learned partition function to measure semantic diversity and combines it with a quality reward during online reinforcement learning. Result: DARLING outperforms quality-only RL baselines in generating higher-quality and more novel outputs for non-verifiable tasks, and achieves better solution quality and variety in verifiable tasks. Conclusion: DARLING effectively balances quality and diversity in post-training LMs, enhancing performance across both non-verifiable and verifiable tasks. Abstract: Post-training of Large Language Models (LMs) often prioritizes accuracy and helpfulness at the expense of diversity. This creates a tension: while post-training improves response quality, it also sharpens output distributions and reduces the range of ideas, limiting the usefulness of LMs in creative and exploratory tasks such as brainstorming, storytelling, or problem solving. We address this challenge with Diversity-Aware Reinforcement Learning (DARLING), a framework that jointly optimizes for response quality and semantic diversity. At its core, DARLING introduces a learned partition function to measure diversity beyond surface-level lexical variations. This diversity signal is then combined with a quality reward during online reinforcement learning, encouraging models to generate outputs that are both high-quality and distinct. Experiments across multiple model families and sizes show that DARLING generalizes to two regimes: non-verifiable tasks (instruction following and creative writing) and verifiable tasks (competition math). On five benchmarks in the first setting, DARLING consistently outperforms quality-only RL baselines, producing outputs that are simultaneously of higher quality and novelty. In the second setting, DARLING achieves higher pass@1 (solution quality) and pass@k (solution variety). Most strikingly, explicitly optimizing for diversity catalyzes exploration in online RL, which manifests itself as higher-quality responses.

[137] PalmX 2025: The First Shared Task on Benchmarking LLMs on Arabic and Islamic Culture

Fakhraddin Alwajih,Abdellah El Mekki,Hamdy Mubarak,Majd Hawasly,Abubakr Mohamed,Muhammad Abdul-Mageed

Main category: cs.CL

TL;DR: PalmX 2025是一个评估大型语言模型在阿拉伯和伊斯兰文化领域文化能力的共享任务，结果显示任务特定微调显著提高了性能。

Details

Motivation: 大型语言模型（LLMs）由于训练数据偏向西方语言和文化，在理解和反映阿拉伯和伊斯兰文化方面存在不足，特别是在代表性不足的主题上。 Method: 引入PalmX 2025，设计了两个子任务（阿拉伯文化和伊斯兰文化）的多选题（MCQs）测试，使用现代标准阿拉伯语（MSA），并评估了任务特定微调和其他技术的效果。 Result: 26个团队注册了子任务1，19个团队注册了子任务2；任务特定微调显著提高了性能，最佳系统在文化问题上达到72.15%的准确率，在伊斯兰知识上达到84.22%的准确率。 Conclusion: PalmX 2025通过任务特定的微调方法显著提高了大型语言模型在阿拉伯和伊斯兰文化领域的理解能力，并揭示了参数高效微调的有效性和数据增强的领域依赖性。 Abstract: Large Language Models (LLMs) inherently reflect the vast data distributions they encounter during their pre-training phase. As this data is predominantly sourced from the web, there is a high chance it will be skewed towards high-resourced languages and cultures, such as those of the West. Consequently, LLMs often exhibit a diminished understanding of certain communities, a gap that is particularly evident in their knowledge of Arabic and Islamic cultures. This issue becomes even more pronounced with increasingly under-represented topics. To address this critical challenge, we introduce PalmX 2025, the first shared task designed to benchmark the cultural competence of LLMs in these specific domains. The task is composed of two subtasks featuring multiple-choice questions (MCQs) in Modern Standard Arabic (MSA): General Arabic Culture and General Islamic Culture. These subtasks cover a wide range of topics, including traditions, food, history, religious practices, and language expressions from across 22 Arab countries. The initiative drew considerable interest, with 26 teams registering for Subtask 1 and 19 for Subtask 2, culminating in nine and six valid submissions, respectively. Our findings reveal that task-specific fine-tuning substantially boosts performance over baseline models. The top-performing systems achieved an accuracy of 72.15% on cultural questions and 84.22% on Islamic knowledge. Parameter-efficient fine-tuning emerged as the predominant and most effective approach among participants, while the utility of data augmentation was found to be domain-dependent.

cs.CV [Back]

[138] Deep Learning-Driven Multimodal Detection and Movement Analysis of Objects in Culinary

Tahoshin Alam Ishat

Main category: cs.CV

TL;DR: 本文利用YOLOv8、LSTM和ASR模型提取数据，供LLM预测食谱并生成烹饪指南，展示了计算机视觉在日常生活任务中的广泛应用。

Details

Motivation: 探索现有模型的潜力，以扩展计算机视觉在复杂和具有挑战性的日常环境（如厨房工作）中的应用。 Method: 微调现有模型，结合YOLOv8分割模型、LSTM模型和ASR（whisper-base）模型，从作者收集的数据中提取信息，供LLM（TinyLLaMa）预测食谱和生成步骤指南。 Result: 开发出一个针对特定任务的稳健系统，能够在复杂环境中有效运行，并展示了模型在日常重要任务中的无限应用可能。 Conclusion: 本文通过结合YOLOv8分割模型、LSTM模型和ASR模型，为LLM提供充足的数据，成功预测食谱并生成烹饪步骤指南，展示了计算机视觉在日常活动中的广泛应用潜力。 Abstract: This is a research exploring existing models and fine tuning them to combine a YOLOv8 segmentation model, a LSTM model trained on hand point motion sequence and a ASR (whisper-base) to extract enough data for a LLM (TinyLLaMa) to predict the recipe and generate text creating a step by step guide for the cooking procedure. All the data were gathered by the author for a robust task specific system to perform best in complex and challenging environments proving the extension and endless application of computer vision in daily activities such as kitchen work. This work extends the field for many more crucial task of our day to day life.

[139] AMMKD: Adaptive Multimodal Multi-teacher Distillation for Lightweight Vision-Language Models

Yuqi Li,Chuanguang Yang,Junhao Dong,Zhengtao Yao,Haoyan Xu,Zeyu Dong,Hansheng Zeng,Zhulin An,Yingli Tian

Main category: cs.CV

TL;DR: 我们提出了一种新颖的自适应多模多教师知识蒸馏框架（AMMKD），通过融合多模态特征、多教师蒸馏和自适应优化，实现了轻量级且高效的图像-文本检索模型。

Details

Motivation: 由于大规模视觉语言预训练（VLP）模型的尺寸和计算复杂性较大，导致其在移动设备上的部署受到限制，因此需要一种轻量级且高效的解决方案。 Method: 我们设计了一个特征融合网络，用于提取和融合图像和文本模态的判别特征；通过多教师知识蒸馏框架预训练两个CLIP教师模型，并采用KL散度进行概率分布匹配；最后，设计了一种自适应动态加权方案，将多教师蒸馏视为多目标优化问题。 Result: 在三个基准数据集上进行的大量实验表明，AMMKD在显著降低模型复杂性的同时实现了优越的性能，验证了其有效性和灵活性。 Conclusion: AMMKD框架成功地解决了大规模VLP模型在移动设备上部署的挑战，为多模态检索任务提供了一种高效的解决方案。 Abstract: The success of large-scale visual language pretraining (VLP) models has driven widespread adoption of image-text retrieval tasks. However, their deployment on mobile devices remains limited due to large model sizes and computational complexity. We propose Adaptive Multi-Modal Multi-Teacher Knowledge Distillation (AMMKD), a novel framework that integrates multi-modal feature fusion, multi-teacher distillation, and adaptive optimization to deliver lightweight yet effective retrieval models. Specifically, our method begins with a feature fusion network that extracts and merges discriminative features from both the image and text modalities. To reduce model parameters and further improve performance, we design a multi-teacher knowledge distillation framework to pre-train two CLIP teacher models. We decouple modalities by pre-computing and storing text features as class vectors via the teacher text encoder to enhance efficiency. To better align teacher and student outputs, we apply KL scatter for probability distribution matching. Finally, we design an adaptive dynamic weighting scheme that treats multi-teacher distillation as a multi-objective optimization problem. By leveraging gradient space diversity, we dynamically adjust the influence of each teacher, reducing conflicts and guiding the student toward more optimal learning directions. Extensive experiments on three benchmark datasets demonstrate that AMMKD achieves superior performance while significantly reducing model complexity, validating its effectiveness and flexibility.

[140] ARTPS: Depth-Enhanced Hybrid Anomaly Detection and Learnable Curiosity Score for Autonomous Rover Target Prioritization

Poyraz Baydemir

Main category: cs.CV

TL;DR: ARTPS是一种结合深度估计、异常检测和好奇心评分的混合AI系统，显著提高了行星探测任务的自主探索效率和准确性。

Details

Motivation: 为了提高火星车在复杂地形中自主探测的效率和准确性，需要一种能够综合多种信息的系统。 Method: 将单目深度估计、多组件异常检测和加权好奇心评分结合，形成一种混合AI系统。 Result: 系统在火星车数据集上实现了0.94的AUROC、0.89的AUPRC和0.87的F1分数，并减少了23%的误报率。 Conclusion: ARTPS有效地结合了多种AI技术，提高了行星表面自主探测的准确性，并显著减少了误报率。 Abstract: We present ARTPS (Autonomous Rover Target Prioritization System), a novel hybrid AI system that combines depth estimation, anomaly detection, and learnable curiosity scoring for autonomous exploration of planetary surfaces. Our approach integrates monocular depth estimation using Vision Transformers with multi-component anomaly detection and a weighted curiosity score that balances known value, anomaly signals, depth variance, and surface roughness. The system achieves state-of-the-art performance with AUROC of 0.94, AUPRC of 0.89, and F1-Score of 0.87 on Mars rover datasets. We demonstrate significant improvements in target prioritization accuracy through ablation studies and provide comprehensive analysis of component contributions. The hybrid fusion approach reduces false positives by 23% while maintaining high detection sensitivity across diverse terrain types.

[141] Performance is not All You Need: Sustainability Considerations for Algorithms

Xiang Li,Chong Zhang,Hongpeng Wang,Shreyank Narayana Gowda,Yushi Li,Xiaobo Jin

Main category: cs.CV

TL;DR: 本研究提出了一种新的可持续性评估系统，通过两个定量指标（FMS和ASC）实现对深度学习模型训练能耗和性能的综合评估，推动绿色AI发展。

Details

Motivation: 深度学习模型训练产生的高碳排放问题亟需解决，如何平衡算法性能与能耗成为核心挑战。 Method: 研究提出两个定量指标：可持续调和平均数（FMS）和可持续性曲线下的面积（ASC），并构建了多种多模态任务的基准进行验证。 Result: 实验表明，该评估体系在多种模态任务中均能有效反映算法的能效特性，并为绿色AI研究提供实践支持。 Conclusion: 该研究提出了一种创新的二维可持续性评估体系，能够为跨任务算法提供定量评估基础，并推动绿色人工智能研究从理论向实践转化。 Abstract: This work focuses on the high carbon emissions generated by deep learning model training, specifically addressing the core challenge of balancing algorithm performance and energy consumption. It proposes an innovative two-dimensional sustainability evaluation system. Different from the traditional single performance-oriented evaluation paradigm, this study pioneered two quantitative indicators that integrate energy efficiency ratio and accuracy: the sustainable harmonic mean (FMS) integrates accumulated energy consumption and performance parameters through the harmonic mean to reveal the algorithm performance under unit energy consumption; the area under the sustainability curve (ASC) constructs a performance-power consumption curve to characterize the energy efficiency characteristics of the algorithm throughout the cycle. To verify the universality of the indicator system, the study constructed benchmarks in various multimodal tasks, including image classification, segmentation, pose estimation, and batch and online learning. Experiments demonstrate that the system can provide a quantitative basis for evaluating cross-task algorithms and promote the transition of green AI research from theory to practice. Our sustainability evaluation framework code can be found here, providing methodological support for the industry to establish algorithm energy efficiency standards.

[142] MESTI-MEGANet: Micro-expression Spatio-Temporal Image and Micro-expression Gradient Attention Networks for Micro-expression Recognition

Luu Tu Nguyen,Vu Tram Anh Khuong,Thanh Ha Le,Thi Duyen Ngo

Main category: cs.CV

TL;DR: 本研究提出了MESTI和MEGANet，这是一种用于微表情识别的新颖动态输入模态和网络架构，它们的结合在性能上超越了现有方法。

Details

Motivation: 由于微表情的细微和短暂特性，传统的输入模态如Apex Frame、Optical Flow和Dynamic Image往往无法充分捕捉这些短暂的面部运动，导致性能不佳。 Method: 提出了一种新的动态输入模态Micro-expression Spatio-Temporal Image（MESTI）和Micro-expression Gradient Attention Network（MEGANet），后者包含一个新的Gradient Attention块，用于增强微表情的细粒度运动特征提取。 Result: 实验结果显示，MESTI和MEGANet相结合的方法在CASMEII和SAMM数据集上达到了最先进的结果，并且使用MESTI替代现有输入模态能够带来一致的性能提升。 Conclusion: MESTI和MEGANet的结合在微表情识别领域取得了迄今为止最高的准确率，为未来更有效的MER系统发展奠定了基础。 Abstract: Micro-expression recognition (MER) is a challenging task due to the subtle and fleeting nature of micro-expressions. Traditional input modalities, such as Apex Frame, Optical Flow, and Dynamic Image, often fail to adequately capture these brief facial movements, resulting in suboptimal performance. In this study, we introduce the Micro-expression Spatio-Temporal Image (MESTI), a novel dynamic input modality that transforms a video sequence into a single image while preserving the essential characteristics of micro-movements. Additionally, we present the Micro-expression Gradient Attention Network (MEGANet), which incorporates a novel Gradient Attention block to enhance the extraction of fine-grained motion features from micro-expressions. By combining MESTI and MEGANet, we aim to establish a more effective approach to MER. Extensive experiments were conducted to evaluate the effectiveness of MESTI, comparing it with existing input modalities across three CNN architectures (VGG19, ResNet50, and EfficientNetB0). Moreover, we demonstrate that replacing the input of previously published MER networks with MESTI leads to consistent performance improvements. The performance of MEGANet, both with MESTI and Dynamic Image, is also evaluated, showing that our proposed network achieves state-of-the-art results on the CASMEII and SAMM datasets. The combination of MEGANet and MESTI achieves the highest accuracy reported to date, setting a new benchmark for micro-expression recognition. These findings underscore the potential of MESTI as a superior input modality and MEGANet as an advanced recognition network, paving the way for more effective MER systems in a variety of applications.

[143] Scaffold Diffusion: Sparse Multi-Category Voxel Structure Generation with Discrete Diffusion

Justin Jung

Main category: cs.CV

TL;DR: 本文介绍 Scaffold Diffusion 模型，利用离散扩散语言模型生成稀疏多类别 3D 体素结构，解决了传统方法在高稀疏性数据上的局限性。

Details

Motivation: 由于体素结构的立方内存扩展以及稀疏性引起的显著类别不平衡，生成真实的稀疏多类别 3D 体素结构非常困难。 Method: 将体素视为标记，使用离散扩散语言模型生成 3D 体素结构。 Result: Scaffold Diffusion 在 Minecraft 房屋结构数据集上表现良好，即使在训练数据稀疏性超过 98% 的情况下，也能生成真实且连贯的结构。 Conclusion: Scaffold Diffusion 是一种用于生成稀疏多类别 3D 体素结构的生成模型，结果表明离散扩散是一种有前景的 3D 稀疏体素生成建模框架。 Abstract: Generating realistic sparse multi-category 3D voxel structures is difficult due to the cubic memory scaling of voxel structures and moreover the significant class imbalance caused by sparsity. We introduce Scaffold Diffusion, a generative model designed for sparse multi-category 3D voxel structures. By treating voxels as tokens, Scaffold Diffusion uses a discrete diffusion language model to generate 3D voxel structures. We show that discrete diffusion language models can be extended beyond inherently sequential domains such as text to generate spatially coherent 3D structures. We evaluate on Minecraft house structures from the 3D-Craft dataset and demonstrate that, unlike prior baselines and an auto-regressive formulation, Scaffold Diffusion produces realistic and coherent structures even when trained on data with over 98% sparsity. We provide an interactive viewer where readers can visualize generated samples and the generation process. Our results highlight discrete diffusion as a promising framework for 3D sparse voxel generative modeling.

[144] Dual-Stage Global and Local Feature Framework for Image Dehazing

Anas M. Ali,Anis Koubaa,Bilel Benjdira

Main category: cs.CV

TL;DR: 本文提出了一种用于高分辨率图像去雾的新型框架SGLC，结合全局特征生成和局部特征增强，显著提升了去雾效果。

Details

Motivation: 当前去雾模型在处理高分辨率图像时性能显著下降，主要原因是难以有效结合全局上下文信息与局部细节。 Method: 提出了Streamlined Global and Local Features Combinator (SGLC)框架，结合Global Features Generator (GFG)和Local Features Enhancer (LFE)组件，分别处理全局场景理解和局部细节增强。 Result: 在高分辨率数据集上的实验结果显示，使用SGLC时峰值信噪比（PSNR）有显著提高，证明了其在处理大规模图像去雾任务中的有效性。 Conclusion: SGLC的设计具有模型无关性，允许任何去雾网络通过所提出的全局-局部特征融合机制增强性能，显著提高了高分辨率环境中的视觉保真度。 Abstract: Addressing the challenge of removing atmospheric fog or haze from digital images, known as image dehazing, has recently gained significant traction in the computer vision community. Although contemporary dehazing models have demonstrated promising performance, few have thoroughly investigated high-resolution imagery. In such scenarios, practitioners often resort to downsampling the input image or processing it in smaller patches, which leads to a notable performance degradation. This drop is primarily linked to the difficulty of effectively combining global contextual information with localized, fine-grained details as the spatial resolution grows. In this chapter, we propose a novel framework, termed the Streamlined Global and Local Features Combinator (SGLC), to bridge this gap and enable robust dehazing for high-resolution inputs. Our approach is composed of two principal components: the Global Features Generator (GFG) and the Local Features Enhancer (LFE). The GFG produces an initial dehazed output by focusing on broad contextual understanding of the scene. Subsequently, the LFE refines this preliminary output by enhancing localized details and pixel-level features, thereby capturing the interplay between global appearance and local structure. To evaluate the effectiveness of SGLC, we integrated it with the Uformer architecture, a state-of-the-art dehazing model. Experimental results on high-resolution datasets reveal a considerable improvement in peak signal-to-noise ratio (PSNR) when employing SGLC, indicating its potency in addressing haze in large-scale imagery. Moreover, the SGLC design is model-agnostic, allowing any dehazing network to be augmented with the proposed global-and-local feature fusion mechanism. Through this strategy, practitioners can harness both scene-level cues and granular details, significantly improving visual fidelity in high-resolution environments.

[145] Self-supervised large-scale kidney abnormality detection in drug safety assessment studies

Ivan Slootweg,Natalia P. García-De-La-Puente,Geert Litjens,Salma Dammak

Main category: cs.CV

TL;DR: 本文提出了一种用于肾脏毒性病理学分析的大规模自监督异常检测模型，通过使用UNI基础模型提取特征并应用自监督学习方法，实现了高于随机水平的检测性能，有望减少药物开发中的成本和时间。

Details

Motivation: 肾脏异常检测在临床前药物开发中至关重要，但检查大量全切片图像非常耗时且昂贵。因此，需要一种高效的方法来减少正常样本的检查工作。 Method: 研究使用了UNI基础模型提取特征，并探索了自监督学习方法用于异常检测的效果。 Result: 自监督方法在检测性能上超过了随机水平，曲线下面积为0.62，阴性预测值为89%。 Conclusion: 该模型有望在药物安全性评估研究中排除正常切片，从而降低药物开发的成本和时间。 Abstract: Kidney abnormality detection is required for all preclinical drug development. It involves a time-consuming and costly examination of hundreds to thousands of whole-slide images per drug safety study, most of which are normal, to detect any subtle changes indicating toxic effects. In this study, we present the first large-scale self-supervised abnormality detection model for kidney toxicologic pathology, spanning drug safety assessment studies from 158 compounds. We explore the complexity of kidney abnormality detection on this scale using features extracted from the UNI foundation model (FM) and show that a simple k-nearest neighbor classifier on these features performs at chance, demonstrating that the FM-generated features alone are insufficient for detecting abnormalities. We then demonstrate that a self-supervised method applied to the same features can achieve better-than-chance performance, with an area under the receiver operating characteristic curve of 0.62 and a negative predictive value of 89%. With further development, such a model can be used to rule out normal slides in drug safety assessment studies, reducing the costs and time associated with drug development.

[146] Waste-Bench: A Comprehensive Benchmark for Evaluating VLLMs in Cluttered Environments

Muhammad Ali,Salman Khan

Main category: cs.CV

TL;DR: This study introduces a novel dataset for waste classification in complex real-world environments and evaluates the robustness of Vision Large Language Models, highlighting the need for further improvements in model robustness.

Details

Motivation: While Large Language Models (LLMs) have shown impressive performance on standard natural images, their capabilities have not been thoroughly explored in cluttered datasets with complex environments and deformed shaped objects. Method: The study introduces a novel dataset for waste classification in real-world scenarios and presents an in-depth evaluation approach to assess the robustness and accuracy of Vision Large Language Models (VLLMs). Result: The introduced dataset and evaluation approach provide valuable insights into the performance of VLLMs under challenging conditions. Conclusion: The study concludes that there is a critical need for further advancements in VLLM's robustness to improve performance in complex environments, and the dataset and code will be made publicly available. Abstract: Recent advancements in Large Language Models (LLMs) have paved the way for Vision Large Language Models (VLLMs) capable of performing a wide range of visual understanding tasks. While LLMs have demonstrated impressive performance on standard natural images, their capabilities have not been thoroughly explored in cluttered datasets where there is complex environment having deformed shaped objects. In this work, we introduce a novel dataset specifically designed for waste classification in real-world scenarios, characterized by complex environments and deformed shaped objects. Along with this dataset, we present an in-depth evaluation approach to rigorously assess the robustness and accuracy of VLLMs. The introduced dataset and comprehensive analysis provide valuable insights into the performance of VLLMs under challenging conditions. Our findings highlight the critical need for further advancements in VLLM's robustness to perform better in complex environments. The dataset and code for our experiments will be made publicly available.

[147] Category-level Text-to-Image Retrieval Improved: Bridging the Domain Gap with Diffusion Models and Vision Encoders

Faizan Farooq Khan,Vladan Stojnić,Zakaria Laskar,Mohamed Elhoseiny,Giorgos Tolias

Main category: cs.CV

TL;DR: 本文提出了一种结合生成扩散模型和视觉模型的两步方法，以提高文本到图像检索的性能。

Details

Motivation: 由于现有的视觉语言模型（如CLIP）在文本和图像的表示空间中映射到较远的区域，限制了文本到图像检索的性能，因此需要一种方法来弥合这种模态间的差距。 Method: 首先使用生成扩散模型将文本查询转换为视觉查询，然后使用视觉模型估计图像相似性，并引入了一个聚合网络来整合多个生成的图像并融合多个查询模态的相似性评分。 Result: 该方法在广泛的评估中始终优于仅依赖文本查询的检索方法。 Conclusion: 该研究提出了一种基于视觉编码器、视觉语言模型和文本到图像生成模型的两步方法，以弥合跨模态检索中的模态差距，并且实验表明该方法优于仅依赖文本查询的检索方法。 Abstract: This work explores text-to-image retrieval for queries that specify or describe a semantic category. While vision-and-language models (VLMs) like CLIP offer a straightforward open-vocabulary solution, they map text and images to distant regions in the representation space, limiting retrieval performance. To bridge this modality gap, we propose a two-step approach. First, we transform the text query into a visual query using a generative diffusion model. Then, we estimate image-to-image similarity with a vision model. Additionally, we introduce an aggregation network that combines multiple generated images into a single vector representation and fuses similarity scores across both query modalities. Our approach leverages advancements in vision encoders, VLMs, and text-to-image generation models. Extensive evaluations show that it consistently outperforms retrieval methods relying solely on text queries. Source code is available at: https://github.com/faixan-khan/cletir

[148] Safe-LLaVA: A Privacy-Preserving Vision-Language Dataset and Benchmark for Biometric Safety

Younggun Kim,Sirnam Swetha,Fazil Kagdi,Mubarak Shah

Main category: cs.CV

TL;DR: 本文提出了 PRISM 基准和 Safe-LLaVA 数据集，旨在评估和减少多模态大语言模型中的生物特征信息泄露，以提升隐私保护水平。

Details

Motivation: 现有的多模态大语言模型（MLLMs）可能会泄露敏感的生物特征信息，这在实际应用中引发了隐私问题。然而，目前缺乏用于评估和缓解这种泄露的公开数据集和基准。 Method: 开发了一个新的基准 PRISM，用于评估 MLLMs 在拒绝生物特征相关查询和检测隐式生物特征泄露方面的能力，并构建了隐私保护的 Safe-LLaVA 数据集。 Result: PRISM 基准揭示了不同 MLLMs 在多种生物特征上的泄露问题，并通过在 Safe-LLaVA 数据集上进行微调显著减少了这些泄露。 Conclusion: Safe-LLaVA 和 PRISM 共同为隐私对齐的 MLLM 开发和评估设立了新标准。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision-language tasks. However, these models often infer and reveal sensitive biometric attributes - such as race, gender, age, body weight, and eye color - even when such information is not explicitly requested. This raises critical concerns, particularly in real-world applications and socially-sensitive domains. Despite increasing awareness, no publicly available dataset or benchmark exists to comprehensively evaluate or mitigate biometric leakage in MLLMs. To address this gap, we introduce PRISM (Privacy-aware Evaluation of Responses in Sensitive Modalities), a new benchmark designed to assess MLLMs on two fronts: (1) refuse biometric-related queries and (2) implicit biometric leakage in general responses while maintaining semantic faithfulness. Further, we conduct a detailed audit of the widely used LLaVA datasets and uncover extensive biometric leakage across pretraining and instruction data. To address this, we present Safe-LLaVA dataset, the first privacy-preserving MLLM training dataset constructed by systematically removing explicit and implicit biometric information from LLaVA dataset. Our evaluations on PRISM reveal biometric leakages across MLLMs for different attributes, highlighting the detailed privacy-violations. We also fine-tune a model on Safe-LLaVA dataset and show that it substantially reduces the biometric leakages. Together, Safe-LLaVA & PRISM set a new standard for privacy-aligned development and evaluation of MLLMs. The Safe-LLaVA dataset & PRISM benchmark are publicly available at https://huggingface.co/datasets/kyh9191/Safe-LLaVA, and the source code is available at https://github.com/Kimyounggun99/Safe-LLaVA.git.

[149] Beyond Pixels: Introducing Geometric-Semantic World Priors for Video-based Embodied Models via Spatio-temporal Alignment

Jinzhou Tang,Jusheng zhang,Sidi Liu,Waikit Xiu,Qinhan Lv,Xiying Li

Main category: cs.CV

TL;DR: VEME是一种新的跨模态对齐方法，用于提升未知场景中的泛化能力。

Details

Motivation: 尽管先进的视觉-语言模型在静态场景理解方面表现出色，但在动态环境中的时空推理和适应能力仍存在不足。 Method: VEME包括三个关键组件：跨模态对齐框架、由世界嵌入激活的动态隐式认知地图和基于指令的导航与推理框架。 Result: 实验结果表明，VEME在VSI-Bench和VLN-CE上的准确性和探索效率比传统方法提高了1%-3%。 Conclusion: VEME通过嵌入几何感知的时空经验，显著改进了动态环境中的推理和规划能力。 Abstract: Achieving human-like reasoning in deep learning models for complex tasks in unknown environments remains a critical challenge in embodied intelligence. While advanced vision-language models (VLMs) excel in static scene understanding, their limitations in spatio-temporal reasoning and adaptation to dynamic, open-set tasks like task-oriented navigation and embodied question answering (EQA) persist due to inadequate modeling of fine-grained spatio-temporal cues and physical world comprehension. To address this, we propose VEME, a novel cross-modal alignment method that enhances generalization in unseen scenes by learning an ego-centric, experience-centered world model. Our framework integrates three key components: (1) a cross-modal alignment framework bridging objects, spatial representations, and visual semantics with spatio-temporal cues to enhance VLM in-context learning; (2) a dynamic, implicit cognitive map activated by world embedding to enable task-relevant geometric-semantic memory recall; and (3) an instruction-based navigation and reasoning framework leveraging embodied priors for long-term planning and efficient exploration. By embedding geometry-aware spatio-temporal episodic experiences, our method significantly improves reasoning and planning in dynamic environments. Experimental results on VSI-Bench and VLN-CE demonstrate 1%-3% accuracy and exploration efficiency improvement compared to traditional approaches.

[150] Multimodal Deep Learning for Phyllodes Tumor Classification from Ultrasound and Clinical Data

Farhan Fuad Abir,Abigail Elliott Daly,Kyle Anderman,Tolga Ozmen,Laura J. Brattain

Main category: cs.CV

TL;DR: 该研究提出了一种多模态深度学习框架，结合乳腺超声图像和临床数据，以提高乳腺叶状肿瘤（PTs）的术前分类准确性，从而减少不必要的手术切除。

Details

Motivation: 乳腺叶状肿瘤（PTs）在影像学上与良性纤维腺瘤相似，术前分类困难，常导致不必要的手术切除。因此，需要一种更准确的术前分类方法。 Method: 研究团队开发了一个双分支神经网络，从超声图像和患者元数据中提取并融合特征，并应用类感知采样和受试者分层5折交叉验证以防止类别不平衡和数据泄漏。 Result: 多模态方法在分类良性与交界性/恶性PTs方面优于单模态基线方法。在六种图像编码器中，ConvNeXt和ResNet18表现最佳，AUC-ROC分数分别为0.9427和0.9349，F1分数分别为0.6720和0.7294。 Conclusion: 该研究证明了多模态人工智能在作为非侵入性诊断工具方面的潜力，能够减少不必要的活检并改善乳腺肿瘤管理中的临床决策。 Abstract: Phyllodes tumors (PTs) are rare fibroepithelial breast lesions that are difficult to classify preoperatively due to their radiological similarity to benign fibroadenomas. This often leads to unnecessary surgical excisions. To address this, we propose a multimodal deep learning framework that integrates breast ultrasound (BUS) images with structured clinical data to improve diagnostic accuracy. We developed a dual-branch neural network that extracts and fuses features from ultrasound images and patient metadata from 81 subjects with confirmed PTs. Class-aware sampling and subject-stratified 5-fold cross-validation were applied to prevent class imbalance and data leakage. The results show that our proposed multimodal method outperforms unimodal baselines in classifying benign versus borderline/malignant PTs. Among six image encoders, ConvNeXt and ResNet18 achieved the best performance in the multimodal setting, with AUC-ROC scores of 0.9427 and 0.9349, and F1-scores of 0.6720 and 0.7294, respectively. This study demonstrates the potential of multimodal AI to serve as a non-invasive diagnostic tool, reducing unnecessary biopsies and improving clinical decision-making in breast tumor management.

[151] GraViT: Transfer Learning with Vision Transformers and MLP-Mixer for Strong Gravitational Lens Discovery

René Parlange,Juan C. Cuevas-Tello,Octavio Valenzuela,Omar de J. Cabrera-Rosas,Tomás Verdugo,Anupreeta More,Anton T. Jaelani

Main category: cs.CV

TL;DR: 本文提出了 GraViT，一个基于迁移学习的自动化引力透镜检测流程，使用视觉变换器和 MLP-Mixer 模型，在未来的 LSST 调查中将有助于高效识别引力透镜现象。

Details

Motivation: LSST 预计将在未来十年内发现大量引力透镜（约 10^5 个），需要自动化分类工具来高效处理这些数据。 Method: GraViT 使用 PyTorch 框架，并结合迁移学习和多种训练策略（如数据增强、归一化、优化）对引力透镜进行检测。研究评估了数据质量、模型架构、训练策略以及集成预测对分类性能的影响。 Result: 通过在 HOLISMOKES VI 和 SuGOHI X 数据集上微调十种架构，并与卷积基线模型进行基准测试，该研究展示了 GraViT 在复杂性和推理时间上的优势，并提供了对强引力透镜可检测性的深入见解。 Conclusion: GraViT 是一种基于深度学习的自动化引力透镜检测流程，能够有效利用预训练的视觉变换器模型和 MLP-Mixer，为未来的 LSST 调查提供高效的引力透镜分类能力。 Abstract: Gravitational lensing offers a powerful probe into the properties of dark matter and is crucial to infer cosmological parameters. The Legacy Survey of Space and Time (LSST) is predicted to find O(10^5) gravitational lenses over the next decade, demanding automated classifiers. In this work, we introduce GraViT, a PyTorch pipeline for gravitational lens detection that leverages extensive pretraining of state-of-the-art Vision Transformer (ViT) models and MLP-Mixer. We assess the impact of transfer learning on classification performance by examining data quality (source and sample size), model architecture (selection and fine-tuning), training strategies (augmentation, normalization, and optimization), and ensemble predictions. This study reproduces the experiments in a previous systematic comparison of neural networks and provides insights into the detectability of strong gravitational lenses on that common test sample. We fine-tune ten architectures using datasets from HOLISMOKES VI and SuGOHI X, and benchmark them against convolutional baselines, discussing complexity and inference-time analysis.

[152] A High-Accuracy Fast Hough Transform with Linear-Log-Cubed Computational Complexity for Arbitrary-Shaped Images

Danil Kazimirov,Dmitry Nikolaev

Main category: cs.CV

TL;DR: 该论文介绍了一种新的快速且高精度的霍夫变换算法FHT2SP，通过扩展超像素概念和结合现有算法，解决了计算复杂度和精度之间的权衡问题。

Details

Motivation: 霍夫变换在多个领域广泛应用，但现有算法在计算复杂度和精度之间存在权衡问题，因此需要一种新的算法来解决这一问题。 Method: FHT2SP算法基于Brady的超像素概念，并将其扩展到任意形状，同时结合了FHT2DT算法，通过选择合适的超像素大小，实现了近似最优的计算复杂度和可控的误差界限。 Result: FHT2SP算法实现了近最优的计算复杂度O(wh ln³w)，并保持了误差的常数界限，且误差可通过元参数进行控制。 Conclusion: 该论文提出了一种新的霍夫变换算法FHT2SP，该算法在计算复杂度和精度之间取得了良好的平衡，具有较高的实用性。 Abstract: The Hough transform (HT) is a fundamental tool across various domains, from classical image analysis to neural networks and tomography. Two key aspects of the algorithms for computing the HT are their computational complexity and accuracy - the latter often defined as the error of approximation of continuous lines by discrete ones within the image region. The fast HT (FHT) algorithms with optimal linearithmic complexity - such as the Brady-Yong algorithm for power-of-two-sized images - are well established. Generalizations like $FHT2DT$ extend this efficiency to arbitrary image sizes, but with reduced accuracy that worsens with scale. Conversely, accurate HT algorithms achieve constant-bounded error but require near-cubic computational cost. This paper introduces $FHT2SP$ algorithm - a fast and highly accurate HT algorithm. It builds on our development of Brady's superpixel concept, extending it to arbitrary shapes beyond the original power-of-two square constraint, and integrates it into the $FHT2DT$ algorithm. With an appropriate choice of the superpixel's size, for an image of shape $w \times h$, the $FHT2SP$ algorithm achieves near-optimal computational complexity $\mathcal{O}(wh \ln^3 w)$, while keeping the approximation error bounded by a constant independent of image size, and controllable via a meta-parameter. We provide theoretical and experimental analyses of the algorithm's complexity and accuracy.

[153] Generative AI for Industrial Contour Detection: A Language-Guided Vision System

Liang Gong,Tommy,Wang,Sara Chaker,Yanchen Dong,Fouad Bousetouane,Brenden Morton,Mark Mendez

Main category: cs.CV

TL;DR: This paper introduces a language-guided generative vision system for precise remnant contour detection in manufacturing, overcoming limitations of traditional computer vision methods.

Details

Motivation: Industrial computer vision systems often face challenges such as noise, material variability, and uncontrolled imaging conditions, which limit the effectiveness of traditional methods. This work aims to develop a more robust and precise system for contour detection in manufacturing. Method: The system uses a three-stage approach: data acquisition and preprocessing, contour generation with a conditional GAN, and multimodal contour refinement using vision-language modeling with human-in-the-loop standardized prompts. Result: The proposed system improved contour fidelity on the FabTrack datasets, enhancing edge continuity and geometric alignment while reducing the need for manual tracing. GPT-image-1 outperformed Gemini 2.0 Flash in structural accuracy and perceptual quality during the refinement stage. Conclusion: The study concludes that VLM-guided generative workflows have the potential to significantly enhance industrial computer vision by overcoming the limitations of classical pipelines, particularly in achieving CAD-level precision in remnant contour detection. Abstract: Industrial computer vision systems often struggle with noise, material variability, and uncontrolled imaging conditions, limiting the effectiveness of classical edge detectors and handcrafted pipelines. In this work, we present a language-guided generative vision system for remnant contour detection in manufacturing, designed to achieve CAD-level precision. The system is organized into three stages: data acquisition and preprocessing, contour generation using a conditional GAN, and multimodal contour refinement through vision-language modeling, where standardized prompts are crafted in a human-in-the-loop process and applied through image-text guided synthesis. On proprietary FabTrack datasets, the proposed system improved contour fidelity, enhancing edge continuity and geometric alignment while reducing manual tracing. For the refinement stage, we benchmarked several vision-language models, including Google's Gemini 2.0 Flash, OpenAI's GPT-image-1 integrated within a VLM-guided workflow, and open-source baselines. Under standardized conditions, GPT-image-1 consistently outperformed Gemini 2.0 Flash in both structural accuracy and perceptual quality. These findings demonstrate the promise of VLM-guided generative workflows for advancing industrial computer vision beyond the limitations of classical pipelines.

[154] Language-Aware Information Maximization for Transductive Few-Shot CLIP

Ghassen Baklouti,Maxime Zanella,Ismail Ben Ayed

Main category: cs.CV

TL;DR: 本文提出LIMO方法，通过结合信息论和参数高效微调策略，在传导式少样本CLIP任务中取得了显著性能提升。

Details

Motivation: 传导式少样本学习在视觉语言模型（VLMs）中仍处于初步阶段，现有方法有限，因此需要开发适合VLM的传导式少样本学习方法。 Method: 提出了一种新的语言感知信息最大化（LIMO）损失函数，结合互信息、KL散度和交叉熵损失，并探索参数高效微调策略（PEFT）在传导式少样本学习中的应用。 Result: LIMO方法显著优于现有传导式少样本CLIP方法，并比最优归纳方法有显著提升。 Conclusion: LIMO方法在传导式少样本CLIP任务中表现优异，证明调整模型部分参数在该设置下具有潜力，并且VLM-tailored方法对传导式少样本学习至关重要。 Abstract: Transductive few-shot learning has triggered an abundant literature focusing on vision-only models, but is still at a nascent stage within the recent context of foundational vision-language models (VLMs). Only a few recent methods addressed the problem, pointing to the potential of tranduction in VLMs and to the need for VLM-tailored methods. Building on this momentum, we leverage information-theoretic concepts and recent progress in parameter-efficient fine-tuning (PEFT), developing a highly competitive transductive few-shot CLIP method. Specifically, we introduce a novel Language-aware Information MaximizatiOn (LIMO) loss integrating three complementary terms: (i) the mutual information between the vision inputs and the textual class descriptions; (ii) a Kullback-Leibler (KL) divergence penalizing deviation of the network's probabilistic outputs from the text-driven zero-shot predictions; and (iii) a standard cross-entropy loss based on the labeled shots. Furthermore, we challenge the commonly followed fine-tuning practices in the context of transductive few-shot learning, and explore PEFT strategies, completely overlooked in this context. Surprisingly, we observe substantial boosts in performances, which points to the potential of adapting a subset of the model's parameters in the transductive few-shot setting. We report comprehensive evaluations, which show that LIMO outperforms the very recent transductive few-shot CLIP methods by a large margin and yields significant gains over the best-performing inductive methods. Our code is publicly available at:\[ \href{https://github.com/ghassenbaklouti/LIMO}{\text{here}} \]

[155] MorphGen: Morphology-Guided Representation Learning for Robust Single-Domain Generalization in Histopathological Cancer Classification

Hikmat Khan,Syed Farhan Alam Zaidi,Pir Masoom Shah,Kiruthika Balakrishnan,Rabia Khan,Muhammad Waqas,Jia Wu

Main category: cs.CV

TL;DR: MorphGen是一种结合核形态和空间组织信息的监督对比学习框架，用于提高计算组织病理学中的领域通用性和鲁棒性。

Details

Motivation: 病理学家依赖于跨不同环境保持诊断能力的形态学特征，而机器学习系统却受到全切片图像（WSIs）中的异质性阻碍。因此，明确建模生物学上稳健的核形态和空间组织可以学习对领域变化具有鲁棒性的癌症表示。 Method: 提出了一种名为MorphGen的方法，结合了组织病理学图像、增强和核分割掩码，并在监督对比学习框架中对齐图像和核掩码的潜在表示。此外，还采用了随机权重平均（SWA）以增强分布外的鲁棒性。 Result: 注意力图分析显示，MorphGen主要依赖于肿瘤或正常区域内的核形态、细胞组成和细胞空间组织进行最终分类。此外，MorphGen展示了对图像损坏和对抗攻击的恢复能力。 Conclusion: MorphGen实现了在计算组织病理学中的领域通用性，通过整合组织病理学图像、增强和核分割掩码，在监督对比学习框架中优先考虑诊断特征，从而提高对图像损坏和对抗攻击的鲁棒性。 Abstract: Domain generalization in computational histopathology is hindered by heterogeneity in whole slide images (WSIs), caused by variations in tissue preparation, staining, and imaging conditions across institutions. Unlike machine learning systems, pathologists rely on domain-invariant morphological cues such as nuclear atypia (enlargement, irregular contours, hyperchromasia, chromatin texture, spatial disorganization), structural atypia (abnormal architecture and gland formation), and overall morphological atypia that remain diagnostic across diverse settings. Motivated by this, we hypothesize that explicitly modeling biologically robust nuclear morphology and spatial organization will enable the learning of cancer representations that are resilient to domain shifts. We propose MorphGen (Morphology-Guided Generalization), a method that integrates histopathology images, augmentations, and nuclear segmentation masks within a supervised contrastive learning framework. By aligning latent representations of images and nuclear masks, MorphGen prioritizes diagnostic features such as nuclear and morphological atypia and spatial organization over staining artifacts and domain-specific features. To further enhance out-of-distribution robustness, we incorporate stochastic weight averaging (SWA), steering optimization toward flatter minima. Attention map analyses revealed that MorphGen primarily relies on nuclear morphology, cellular composition, and spatial cell organization within tumors or normal regions for final classification. Finally, we demonstrate resilience of the learned representations to image corruptions (such as staining artifacts) and adversarial attacks, showcasing not only OOD generalization but also addressing critical vulnerabilities in current deep learning systems for digital pathology. Code, datasets, and trained models are available at: https://github.com/hikmatkhan/MorphGen

[156] Towards Adaptive Visual Token Pruning for Large Multimodal Models

Hao Zhang,Mengsi Lyu,Chenrui He,Yulong Ao,Yonghua Lin

Main category: cs.CV

TL;DR: 本文提出了一种基于互信息的视觉标记剪枝策略，通过保留跨模态对齐和模态内信息多样性，在不降低模型性能的前提下显著减少推理时间和计算成本。

Details

Motivation: 大型多模态模型（LMMs）在推理过程中由于视觉和文本标记数量的增加导致计算和内存成本上升，现有的标记剪枝方法存在冗余保留标记的问题。 Method: 通过分析视觉和文本标记的冗余差异，提出仅对视觉标记进行剪枝的方法。首先基于互信息去除与文本标记语义不对齐的视觉标记，再通过最大化嵌入空间中的成对距离来去除冗余的视觉标记，并采用贪心算法高效求解。 Result: 实验表明，该方法在LLaVA-1.5-7B和LLaVA-NEXT-7B等模型上减少了88.9%的标记数量，推理速度提高了56.7%，同时保持了较强的模型性能。 Conclusion: 该方法通过剪枝视觉标记有效降低了多模态模型的计算开销，为未来高效多模态推理提供了可行的解决方案。 Abstract: Large Multimodal Models (LMMs) have achieved significant success across various tasks. These models usually encode visual inputs into dense token sequences, which are then concatenated with textual tokens and jointly processed by a language model. However, the increased token count substantially raises computational and memory costs during inference. Token pruning has emerged as a promising approach to address this issue. Existing token pruning methods often rely on costly calibration or suboptimal importance metrics, leading to redundant retained tokens. In this paper, we analyze the redundancy differences between visual and textual tokens and propose pruning exclusively on visual tokens. Based on this, we propose a visual token pruning strategy that explicitly preserves both cross-modal alignment and intra-modal informational diversity. We introduce a mutual information-based token pruning strategy that removes visual tokens semantically misaligned with textual tokens, effectively preserving the alignment between the visual and textual modalities. To further improve the representational quality of the retained tokens, we additionally prune redundant visual tokens by maximizing the expected pairwise distances in the embedding space, which is solved efficiently with a greedy algorithm. Extensive experiments demonstrate that our method maintains strong performance while reducing tokens by 88.9% on models such as LLaVA-1.5-7B and LLaVA-NEXT-7B, resulting in a 56.7% improvement in inference speed.

[157] CryptoFace: End-to-End Encrypted Face Recognition

Wei Ao,Vishnu Naresh Boddeti

Main category: cs.CV

TL;DR: CryptoFace是一种基于全同态加密的人脸识别系统，可以在保护隐私的同时提高推理速度和验证准确性。

Details

Motivation: 人脸识别在许多应用中至关重要，但存在隐私泄露的风险，尤其是在未经授权访问敏感生物识别数据的情况下。 Method: 引入了一种浅层补丁卷积网络的混合模型，通过基于补丁的处理来支持更高维度的张量，同时减少乘法深度和推理延迟。 Result: CryptoFace在标准人脸识别基准测试中显著加速了推理过程，并提高了验证准确性。 Conclusion: CryptoFace是一个加密的人脸识别系统，能够在不暴露原始图像或特征的情况下，实现人脸数据的全阶段处理，并显著提高推理速度和验证准确性。 Abstract: Face recognition is central to many authentication, security, and personalized applications. Yet, it suffers from significant privacy risks, particularly arising from unauthorized access to sensitive biometric data. This paper introduces CryptoFace, the first end-to-end encrypted face recognition system with fully homomorphic encryption (FHE). It enables secure processing of facial data across all stages of a face-recognition process--feature extraction, storage, and matching--without exposing raw images or features. We introduce a mixture of shallow patch convolutional networks to support higher-dimensional tensors via patch-based processing while reducing the multiplicative depth and, thus, inference latency. Parallel FHE evaluation of these networks ensures near-resolution-independent latency. On standard face recognition benchmarks, CryptoFace significantly accelerates inference and increases verification accuracy compared to the state-of-the-art FHE neural networks adapted for face recognition. CryptoFace will facilitate secure face recognition systems requiring robust and provable security. The code is available at https://github.com/human-analysis/CryptoFace.

[158] LUT-Fuse: Towards Extremely Fast Infrared and Visible Image Fusion via Distillation to Learnable Look-Up Tables

Xunpeng Yi,Yibing Zhang,Xinyu Xiang,Qinglong Yan,Han Xu,Jiayi Ma

Main category: cs.CV

TL;DR: This paper introduces LUT-Fuse, a fast and efficient method for infrared and visible image fusion using learnable lookup tables and distillation, enabling real-time performance on low-power devices.

Details

Motivation: Current research focuses on improving fusion performance while neglecting real-time applicability; this work aims to bridge that gap with a fast and practical fusion method. Method: The paper proposes a look-up table (LUT) structure using low-order approximation encoding and high-level joint contextual scene encoding, combined with a LUT distillation strategy to achieve fast and accurate multi-modal image fusion. Result: The proposed method achieves significant improvements in speed, requiring less than one-tenth of the time of current lightweight SOTA algorithms, while maintaining high fusion quality and stability. Conclusion: LUT-Fuse is a novel, efficient, and reliable approach for infrared and visible image fusion, particularly suitable for real-time and low-power devices. Abstract: Current advanced research on infrared and visible image fusion primarily focuses on improving fusion performance, often neglecting the applicability on real-time fusion devices. In this paper, we propose a novel approach that towards extremely fast fusion via distillation to learnable lookup tables specifically designed for image fusion, termed as LUT-Fuse. Firstly, we develop a look-up table structure that utilizing low-order approximation encoding and high-level joint contextual scene encoding, which is well-suited for multi-modal fusion. Moreover, given the lack of ground truth in multi-modal image fusion, we naturally proposed the efficient LUT distillation strategy instead of traditional quantization LUT methods. By integrating the performance of the multi-modal fusion network (MM-Net) into the MM-LUT model, our method achieves significant breakthroughs in efficiency and performance. It typically requires less than one-tenth of the time compared to the current lightweight SOTA fusion algorithms, ensuring high operational speed across various scenarios, even in low-power mobile devices. Extensive experiments validate the superiority, reliability, and stability of our fusion approach. The code is available at https://github.com/zyb5/LUT-Fuse.

[159] Target-Oriented Single Domain Generalization

Marzi Heidari,Yuhong Guo

Main category: cs.CV

TL;DR: This paper introduces STAR, a method that uses textual descriptions of target environments to improve model generalization under distribution shifts, achieving superior performance without requiring target data.

Details

Motivation: Existing SDG methods overlook the potential of textual descriptions of target environments, focusing instead on source data augmentation or invariant feature learning. This work aims to utilize readily available textual metadata to guide model generalization. Method: STAR introduces a lightweight module that uses text embeddings from visual-language models (e.g., CLIP) to align source features toward the target domain. It employs spectral projection and vision-language distillation, along with feature-space Mixup, to improve generalization. Result: STAR outperforms existing methods on various image classification and object detection benchmarks under distribution shifts, demonstrating the value of incorporating textual descriptions into the generalization process. Conclusion: The proposed STAR method effectively enhances model generalization for deployment in unseen target domains by leveraging minimal textual metadata, opening new possibilities for robust model deployment. Abstract: Deep models trained on a single source domain often fail catastrophically under distribution shifts, a critical challenge in Single Domain Generalization (SDG). While existing methods focus on augmenting source data or learning invariant features, they neglect a readily available resource: textual descriptions of the target deployment environment. We propose Target-Oriented Single Domain Generalization (TO-SDG), a novel problem setup that leverages the textual description of the target domain, without requiring any target data, to guide model generalization. To address TO-SDG, we introduce Spectral TARget Alignment (STAR), a lightweight module that injects target semantics into source features by exploiting visual-language models (VLMs) such as CLIP. STAR uses a target-anchored subspace derived from the text embedding of the target description to recenter image features toward the deployment domain, then utilizes spectral projection to retain directions aligned with target cues while discarding source-specific noise. Moreover, we use a vision-language distillation to align backbone features with VLM's semantic geometry. STAR further employs feature-space Mixup to ensure smooth transitions between source and target-oriented representations. Experiments across various image classification and object detection benchmarks demonstrate STAR's superiority. This work establishes that minimal textual metadata, which is a practical and often overlooked resource, significantly enhances generalization under severe data constraints, opening new avenues for deploying robust models in target environments with unseen data.

[160] AQFusionNet: Multimodal Deep Learning for Air Quality Index Prediction with Imagery and Sensor Data

Koushik Ahmed Kushal,Abdullah Al Mamun

Main category: cs.CV

TL;DR: 本研究提出了一种用于空气质量监测的多模态深度学习框架 AQFusionNet，通过融合图像和传感器数据实现了高效、准确的 AQI 预测，特别适合资源受限地区。

Details

Motivation: 由于传感器部署稀疏和基础设施有限，资源受限地区的空气质量监测仍然具有挑战性。 Method: 该研究提出了一种名为 AQFusionNet 的多模态深度学习框架，结合了地面大气图像和污染物浓度数据，并通过轻量级 CNN 骨干网络（如 MobileNetV2、ResNet18、EfficientNet-B0）进行处理，最终通过语义对齐的嵌入空间融合视觉和传感器特征。 Result: 在来自印度和尼泊尔的 8000 多个样本上的实验表明，AQFusionNet 始终优于单模态基线模型，在使用 EfficientNet-B0 骨干网络时实现了高达 92.02% 的分类准确率和 7.70 的 RMSE，并在保持低计算开销的同时比单模态方法提高了 18.5%。 Conclusion: AQFusionNet 提供了一种可扩展且实用的 AQI 监测解决方案，适用于基础设施有限的环境，即使在传感器部分可用的情况下也能提供稳健的预测能力。 Abstract: Air pollution monitoring in resource-constrained regions remains challenging due to sparse sensor deployment and limited infrastructure. This work introduces AQFusionNet, a multimodal deep learning framework for robust Air Quality Index (AQI) prediction. The framework integrates ground-level atmospheric imagery with pollutant concentration data using lightweight CNN backbones (MobileNetV2, ResNet18, EfficientNet-B0). Visual and sensor features are combined through semantically aligned embedding spaces, enabling accurate and efficient prediction. Experiments on more than 8,000 samples from India and Nepal demonstrate that AQFusionNet consistently outperforms unimodal baselines, achieving up to 92.02% classification accuracy and an RMSE of 7.70 with the EfficientNet-B0 backbone. The model delivers an 18.5% improvement over single-modality approaches while maintaining low computational overhead, making it suitable for deployment on edge devices. AQFusionNet provides a scalable and practical solution for AQI monitoring in infrastructure-limited environments, offering robust predictive capability even under partial sensor availability.

[161] Iterative Low-rank Network for Hyperspectral Image Denoising

Jin Ye,Fengchao Xiong,Jun Zhou,Yuntao Qian

Main category: cs.CV

TL;DR: 本文提出了一种基于低秩和稀疏表示的新型迭代低秩网络（ILRNet）用于高光谱图像（HSI）去噪，并在合成和真实世界噪声去除任务中实现了最先进的性能。

Details

Motivation: 挑战在于如何充分利用HSI的物理特性进行有效去噪，同时保留图像细节。 Method: 提出了一种新的迭代低秩网络（ILRNet），结合了模型驱动和数据驱动的方法，在U-Net架构中嵌入了低秩最小化模块（RMM）。 Result: 实验结果表明，ILRNet在合成和真实世界噪声去除任务中均实现了最先进的性能。 Conclusion: ILRNet实现了最先进的合成和真实世界去噪性能。 Abstract: Hyperspectral image (HSI) denoising is a crucial preprocessing step for subsequent tasks. The clean HSI usually reside in a low-dimensional subspace, which can be captured by low-rank and sparse representation, known as the physical prior of HSI. It is generally challenging to adequately use such physical properties for effective denoising while preserving image details. This paper introduces a novel iterative low-rank network (ILRNet) to address these challenges. ILRNet integrates the strengths of model-driven and data-driven approaches by embedding a rank minimization module (RMM) within a U-Net architecture. This module transforms feature maps into the wavelet domain and applies singular value thresholding (SVT) to the low-frequency components during the forward pass, leveraging the spectral low-rankness of HSIs in the feature domain. The parameter, closely related to the hyperparameter of the singular vector thresholding algorithm, is adaptively learned from the data, allowing for flexible and effective capture of low-rankness across different scenarios. Additionally, ILRNet features an iterative refinement process that adaptively combines intermediate denoised HSIs with noisy inputs. This manner ensures progressive enhancement and superior preservation of image details. Experimental results demonstrate that ILRNet achieves state-of-the-art performance in both synthetic and real-world noise removal tasks.

[162] SurgLLM: A Versatile Large Multimodal Model with Spatial Focus and Temporal Awareness for Surgical Video Understanding

Zhen Chen,Xingjian Luo,Kun Yuan,Jinlin Wu,Danny T. M. Chan,Nassir Navab,Hongbin Liu,Zhen Lei,Jiebo Luo

Main category: cs.CV

TL;DR: SurgLLM is a large multimodal model designed to improve surgical video understanding by enhancing spatial focus and temporal awareness, leading to better performance on various tasks compared to existing methods.

Details

Motivation: The motivation is to overcome the limitations of inadequate visual content perception and insufficient temporal awareness in current surgical video understanding systems, which hinder the development of effective Computer-Assisted Surgery solutions. Method: The SurgLLM framework uses Surgical Context-aware Multimodal Pretraining (Surg-Pretrain) for spatial focus through instrument-centric Masked Video Reconstruction and multimodal alignment. Temporal-aware Multimodal Tuning (TM-Tuning) enhances temporal reasoning with interleaved multimodal embeddings. A Surgical Task Dynamic Ensemble is used to handle multiple tasks efficiently. Result: Experiments on diverse surgical video understanding tasks show significant improvements over state-of-the-art approaches, validating the effectiveness of SurgLLM. Conclusion: The SurgLLM framework significantly improves surgical video understanding tasks, demonstrating the importance of enhanced spatial focus and temporal awareness in developing versatile CAS solutions. Abstract: Surgical video understanding is crucial for facilitating Computer-Assisted Surgery (CAS) systems. Despite significant progress in existing studies, two major limitations persist, including inadequate visual content perception and insufficient temporal awareness in surgical videos, and hinder the development of versatile CAS solutions. In this work, we propose the SurgLLM framework, an effective large multimodal model tailored for versatile surgical video understanding tasks with enhanced spatial focus and temporal awareness. Specifically, to empower the spatial focus of surgical videos, we first devise Surgical Context-aware Multimodal Pretraining (Surg-Pretrain) for the video encoder of SurgLLM, by performing instrument-centric Masked Video Reconstruction (MV-Recon) and subsequent multimodal alignment. To incorporate surgical temporal knowledge into SurgLLM, we further propose Temporal-aware Multimodal Tuning (TM-Tuning) to enhance temporal reasoning with interleaved multimodal embeddings. Moreover, to accommodate various understanding tasks of surgical videos without conflicts, we devise a Surgical Task Dynamic Ensemble to efficiently triage a query with optimal learnable parameters in our SurgLLM. Extensive experiments performed on diverse surgical video understanding tasks, including captioning, general VQA, and temporal VQA, demonstrate significant improvements over the state-of-the-art approaches, validating the effectiveness of our SurgLLM in versatile surgical video understanding. The source code is available at https://github.com/franciszchen/SurgLLM.

[163] A Multimodal Head and Neck Cancer Dataset for AI-Driven Precision Oncology

Numan Saeed,Salma Hassan,Shahad Hardan,Ahmed Aly,Darya Taratynova,Umair Nawaz,Ufaq Khan,Muhammad Ridzuan,Thomas Eugene,Rapha"el Metz,M'elanie Dore,Gregory Delpon,Vijay Ram Kumar Papineni,Kareem Wahid,Cem Dede,Alaa Mohamed Shawky Ali,Carlos Sjogreen,Mohamed Naser,Clifton D. Fuller,Valentin Oreiller,Mario Jreige,John O. Prior,Catherine Cheze Le Rest,Olena Tankyevych,Pierre Decazes,Su Ruan,Stephanie Tanadini-Lang,Martin Valli`eres,Hesham Elhalawani,Ronan Abgral,Romain Floch,Kevin Kerleguer,Ulrike Schick,Maelle Mauguen,Vincent Andrearczyk,Adrien Depeursinge,Mathieu Hatt,Arman Rahmim,Mohammad Yaqub

Main category: cs.CV

TL;DR: 这篇论文介绍了一个公开的头颈癌PET/CT影像数据集，包含1123例研究，可用于肿瘤分割、生存期预测和HPV分类。

Details

Motivation: 为了促进头颈癌研究，提供一个公开的、具有专家标注的多模态影像数据集，以支持自动化和预测模型的开发。 Method: 数据集包括1123例FDG-PET/CT检查，所有检查均由经验丰富的放射肿瘤学家和放射科医生进行手动分割，并提供了分割掩码、放疗剂量分布和临床元数据。 Result: 数据集展示了真实世界临床多样性，并通过最先进的深度学习模型（如UNet、SegResNet和多模态预后框架）提供了基准结果。 Conclusion: 该论文提供了一个公开的、多中心的头颈癌PET/CT影像数据集，可用于自动化肿瘤分割、无复发生存期预测和HPV状态分类等关键临床任务。 Abstract: We describe a publicly available multimodal dataset of annotated Positron Emission Tomography/Computed Tomography (PET/CT) studies for head and neck cancer research. The dataset includes 1123 FDG-PET/CT studies from patients with histologically confirmed head and neck cancer, acquired from 10 international medical centers. All examinations consisted of co-registered PET/CT scans with varying acquisition protocols, reflecting real-world clinical diversity across institutions. Primary gross tumor volumes (GTVp) and involved lymph nodes (GTVn) were manually segmented by experienced radiation oncologists and radiologists following standardized guidelines and quality control measures. We provide anonymized NifTi files of all studies, along with expert-annotated segmentation masks, radiotherapy dose distribution for a subset of patients, and comprehensive clinical metadata. This metadata includes TNM staging, HPV status, demographics (age and gender), long-term follow-up outcomes, survival times, censoring indicators, and treatment information. We demonstrate how this dataset can be used for three key clinical tasks: automated tumor segmentation, recurrence-free survival prediction, and HPV status classification, providing benchmark results using state-of-the-art deep learning models, including UNet, SegResNet, and multimodal prognostic frameworks.

[164] Two Causes, Not One: Rethinking Omission and Fabrication Hallucinations in MLLMs

Guangzong Si,Hao Yin,Xianfei Li,Qing Ding,Wenlong Liao,Tao He,Pai Peng

Main category: cs.CV

TL;DR: This paper challenges the common assumption behind hallucinations in MLLMs and proposes a new framework and calibration method (VPFC) to effectively mitigate both omission and fabrication hallucinations.

Details

Motivation: The motivation is to address persistent object hallucination issues in Multimodal Large Language Models (MLLMs), particularly the flawed assumption that omission and fabrication hallucinations have a common cause. Method: The authors conducted visual attention intervention experiments and introduced the Visual-Semantic Attention Potential Field framework, leading to the development of the Visual Potential Field Calibration (VPFC) method for hallucination mitigation. Result: The proposed Visual Potential Field Calibration (VPFC) method successfully reduces omission hallucinations without increasing fabrication hallucinations, offering a more balanced hallucination mitigation strategy. Conclusion: The study concludes that omission hallucinations in MLLMs are due to insufficient confidence in mapping visual features to linguistic expressions, while fabrication hallucinations stem from spurious associations in the cross-modal representation space. The proposed VPFC method effectively mitigates these issues. Abstract: Multimodal Large Language Models (MLLMs) have achieved impressive advances, yet object hallucination remains a persistent challenge. Existing methods, based on the flawed assumption that omission and fabrication hallucinations share a common cause, often reduce omissions only to trigger more fabrications. In this work, we overturn this view by demonstrating that omission hallucinations arise from insufficient confidence when mapping perceived visual features to linguistic expressions, whereas fabrication hallucinations result from spurious associations within the cross-modal representation space due to statistical biases in the training corpus. Building on findings from visual attention intervention experiments, we propose the Visual-Semantic Attention Potential Field, a conceptual framework that reveals how the model constructs visual evidence to infer the presence or absence of objects. Leveraging this insight, we introduce Visual Potential Field Calibration (VPFC), a plug-and-play hallucination mitigation method that effectively reduces omission hallucinations without introducing additional fabrication hallucinations. Our findings reveal a critical oversight in current object hallucination research and chart new directions for developing more robust and balanced hallucination mitigation strategies.

[165] Activation Steering Meets Preference Optimization: Defense Against Jailbreaks in Vision Language Models

Sihao Wu,Gaojie Jin,Wei Huang,Jianhong Wang,Xiaowei Huang

Main category: cs.CV

TL;DR: This paper proposes SPO-VLM, a two-stage defense framework for Vision Language Models (VLMs), which enhances robustness against adversarial attacks while preserving visual understanding. Stage I uses activation-level intervention with adaptive steering vectors, while Stage II employs sequence-level preference optimization with toxicity and visual-consistency assessments.

Details

Motivation: The motivation is to address the vulnerability of Vision Language Models (VLMs) to adversarial attacks and overcome the limitations of existing activation steering approaches, which often rely on task-specific contrastive prompts that degrade visual grounding performance. Method: The paper introduces a two-stage defense framework called Sequence-Level Preference Optimization for VLM (SPO-VLM). Stage I computes adaptive layer-specific steering vectors to suppress harmful behaviors, while Stage II optimizes these vectors using sequence-level preference, incorporating toxicity assessment and visual-consistency rewards. Result: The experiments show that SPO-VLM improves safety against adversarial attacks through activation steering and preference optimization, while maintaining strong performance on benign tasks and visual understanding. Conclusion: SPO-VLM successfully enhances the robustness and safety of Vision Language Models (VLMs) against adversarial attacks while maintaining their visual understanding capabilities. Abstract: Vision Language Models (VLMs) have demonstrated impressive capabilities in integrating visual and textual information for understanding and reasoning, but remain highly vulnerable to adversarial attacks. While activation steering has emerged as a promising defence, existing approaches often rely on task-specific contrastive prompts to extract harmful directions, which exhibit suboptimal performance and can degrade visual grounding performance. To address these limitations, we propose \textit{Sequence-Level Preference Optimization} for VLM (\textit{SPO-VLM}), a novel two-stage defense framework that combines activation-level intervention with policy-level optimization to enhance model robustness. In \textit{Stage I}, we compute adaptive layer-specific steering vectors from diverse data sources, enabling generalized suppression of harmful behaviors during inference. In \textit{Stage II}, we refine these steering vectors through a sequence-level preference optimization process. This stage integrates automated toxicity assessment, as well as visual-consistency rewards based on caption-image alignment, to achieve safe and semantically grounded text generation. The two-stage structure of SPO-VLM balances efficiency and effectiveness by combining a lightweight mitigation foundation in Stage I with deeper policy refinement in Stage II. Extensive experiments shown SPO-VLM enhances safety against attacks via activation steering and preference optimization, while maintaining strong performance on benign tasks without compromising visual understanding capabilities. We will release our code, model weights, and evaluation toolkit to support reproducibility and future research. \textcolor{red}{Warning: This paper may contain examples of offensive or harmful text and images.}

[166] Adaptive Point-Prompt Tuning: Fine-Tuning Heterogeneous Foundation Models for 3D Point Cloud Analysis

Mengke Li,Lihao Chen,Peng Zhang,Yiu-ming Cheung,Hui Huang

Main category: cs.CV

TL;DR: 本文提出了一种名为APPT的新方法，通过少量参数微调预训练模型，直接处理点云数据，为3D点云分析提供了一种高效且通用的解决方案。

Details

Motivation: 由于点云数据稀缺，预训练大型3D模型具有挑战性。已有方法在将预训练视觉模型应用于3D领域时，常导致空间几何信息丢失，且缺乏通用的适应框架。 Method: 将原始点云转换为点嵌入，并采用排列不变特征捕捉点嵌入的相对位置，引入一个与点嵌入模块共享权重的提示生成器，动态生成点提示，与冻结的基础模型连接。 Result: 该方法能够直接处理点云数据，无需异构映射，并能提供丰富的全局结构信息，弥补异构数据中结构上下文的缺失。 Conclusion: 本文提出了一种自适应点提示调优方法（APPT），通过少量参数微调预训练模型，直接处理点云数据，为3D点云分析提供了一种新的校准异构基础模型的方法。 Abstract: Parameter-efficient fine-tuning strategies for foundation models in 1D textual and 2D visual analysis have demonstrated remarkable efficacy. However, due to the scarcity of point cloud data, pre-training large 3D models remains a challenging task. While many efforts have been made to apply pre-trained visual models to 3D domains through "high-to-low" mapping, these approaches often lead to the loss of spatial geometries and lack a generalizable framework for adapting any modality to 3D. This paper, therefore, attempts to directly leverage point features to calibrate the heterogeneous foundation model of any modality for 3D point cloud analysis. Specifically, we propose the Adaptive Point-Prompt Tuning (APPT) method, which fine-tunes pre-trained models with a modest number of parameters, enabling direct point cloud processing without heterogeneous mappings. We convert raw point clouds into point embeddings by aggregating local geometry to capture spatial features followed by linear layers to ensure seamless utilization of frozen pre-trained models. Given the inherent disorder of point clouds, in contrast to the structured nature of images and language, we employ a permutation-invariant feature to capture the relative positions of point embeddings, thereby obtaining point tokens enriched with location information to optimize self-attention mechanisms. To calibrate self-attention across source domains of any modality to 3D and reduce computational overhead, we introduce a prompt generator that shares weights with the point embedding module, dynamically producing point-prompts without adding additional parameters. These prompts are then concatenated into a frozen foundation model, providing rich global structural information and compensating for the lack of structural context in the heterogeneous data.

[167] NoiseCutMix: A Novel Data Augmentation Approach by Mixing Estimated Noise in Diffusion Models

Shumpei Takezaki,Ryoma Bise,Shinnosuke Matsuo

Main category: cs.CV

TL;DR: 本研究开发了一种新的数据增强方法NoiseCutMix，该方法通过在扩散模型中引入CutMix的概念，实现了自然、高分辨率的图像生成，具有两个类别的融合特征。

Details

Motivation: CutMix和其他类似技术在组合来自多个类的图像时经常导致两个图像之间的不自然边界，因此本研究旨在解决这个问题。 Method: 提出了一种名为NoiseCutMix的新型数据增强方法，该方法将CutMix的概念引入扩散模型的生成过程中。 Result: 通过分类实验验证了所提出方法的有效性，并与传统的组合多个类的数据增强技术、使用稳定扩散的随机图像生成以及这些方法的组合进行了比较。 Conclusion: NoiseCutMix通过在扩散模型中部分结合对应于两个不同类的估计噪声，实现了自然、高分辨率图像的生成，具有两个类别的融合特征。 Abstract: In this study, we propose a novel data augmentation method that introduces the concept of CutMix into the generation process of diffusion models, thereby exploiting both the ability of diffusion models to generate natural and high-resolution images and the characteristic of CutMix, which combines features from two classes to create diverse augmented data. Representative data augmentation methods for combining images from multiple classes include CutMix and MixUp. However, techniques like CutMix often result in unnatural boundaries between the two images due to contextual differences. Therefore, in this study, we propose a method, called NoiseCutMix, to achieve natural, high-resolution image generation featuring the fused characteristics of two classes by partially combining the estimated noise corresponding to two different classes in a diffusion model. In the classification experiments, we verified the effectiveness of the proposed method by comparing it with conventional data augmentation techniques that combine multiple classes, random image generation using Stable Diffusion, and combinations of these methods. Our codes are available at: https://github.com/shumpei-takezaki/NoiseCutMix

[168] Domain Adaptation-Based Crossmodal Knowledge Distillation for 3D Semantic Segmentation

Jialiang Kang,Jiawen Wang,Dingsheng Luo

Main category: cs.CV

TL;DR: This paper proposes two crossmodal knowledge distillation methods, UDAKD and FSKD, to reduce reliance on costly 3D LiDAR annotations by leveraging 2D image models and synchronized camera-LiDAR data in autonomous driving scenarios.

Details

Motivation: Semantic segmentation of 3D LiDAR data is crucial for autonomous driving but is costly and time-consuming due to reliance on annotated data. In contrast, 2D image datasets are abundant and scalable, prompting the need for methods that reduce dependency on 3D annotations. Method: Two crossmodal knowledge distillation methods, UDAKD and FSKD, were proposed. These methods align the outputs of 2D and 3D networks using spatio-temporally synchronized data, leveraging self-calibrated convolution for domain adaptation. Result: The proposed methods were validated through rigorous experimentation, showing consistent improvements over state-of-the-art approaches in semantic segmentation of 3D LiDAR data. Conclusion: The proposed crossmodal knowledge distillation methods, UDAKD and FSKD, effectively reduce the reliance on 3D LiDAR annotations by leveraging 2D image models and synchronized camera-LiDAR data, outperforming state-of-the-art approaches. Abstract: Semantic segmentation of 3D LiDAR data plays a pivotal role in autonomous driving. Traditional approaches rely on extensive annotated data for point cloud analysis, incurring high costs and time investments. In contrast, realworld image datasets offer abundant availability and substantial scale. To mitigate the burden of annotating 3D LiDAR point clouds, we propose two crossmodal knowledge distillation methods: Unsupervised Domain Adaptation Knowledge Distillation (UDAKD) and Feature and Semantic-based Knowledge Distillation (FSKD). Leveraging readily available spatio-temporally synchronized data from cameras and LiDARs in autonomous driving scenarios, we directly apply a pretrained 2D image model to unlabeled 2D data. Through crossmodal knowledge distillation with known 2D-3D correspondence, we actively align the output of the 3D network with the corresponding points of the 2D network, thereby obviating the necessity for 3D annotations. Our focus is on preserving modality-general information while filtering out modality-specific details during crossmodal distillation. To achieve this, we deploy self-calibrated convolution on 3D point clouds as the foundation of our domain adaptation module. Rigorous experimentation validates the effectiveness of our proposed methods, consistently surpassing the performance of state-of-the-art approaches in the field.

[169] Visually Grounded Narratives: Reducing Cognitive Burden in Researcher-Participant Interaction

Runtong Wu,Jiayao Song,Fei Teng,Xianhao Ren,Yuyan Gao,Kailun Yang

Main category: cs.CV

TL;DR: This study introduces NAME, a novel paradigm for narrative inquiry that transforms research documents into story images, significantly reducing cognitive burdens and improving narrative quality and efficiency.

Details

Motivation: The immense burden of transforming various data forms into coherent narratives and engaging in member checking motivated the need for a more efficient and participant-friendly approach to narrative making and representation. Method: The study introduces NAME, a new paradigm for narrative inquiry, which includes an actor location and shape module for plausible image generation. It also proposes robust evaluation metrics to measure the perceptual quality and narrative consistency of generated characters. Performance is evaluated across different data partitioning schemes using metrics like FID. Result: The NAME paradigm reduces the FID score from 195 to 152 using only 0.96% of the available data, compared to the baseline that relies on 100%. It also achieves significant improvements in FID scores under different data splits and scores higher on the newly introduced evaluation metric (3.62 vs. 2.66). Conclusion: The proposed NAME paradigm significantly reduces the cognitive burden associated with narrative inquiry by transforming research documents into coherent story images, demonstrating superior performance in both efficiency and narrative quality. Abstract: Narrative inquiry has been one of the prominent application domains for the analysis of human experience, aiming to know more about the complexity of human society. However, researchers are often required to transform various forms of data into coherent hand-drafted narratives in storied form throughout narrative analysis, which brings an immense burden of data analysis. Participants, too, are expected to engage in member checking and presentation of these narrative products, which involves reviewing and responding to large volumes of documents. Given the dual burden and the need for more efficient and participant-friendly approaches to narrative making and representation, we made a first attempt: (i) a new paradigm is proposed, NAME, as the initial attempt to push the field of narrative inquiry. Name is able to transfer research documents into coherent story images, alleviating the cognitive burden of interpreting extensive text-based materials during member checking for both researchers and participants. (ii) We develop an actor location and shape module to facilitate plausible image generation. (iii) We have designed a set of robust evaluation metrics comprising three key dimensions to objectively measure the perceptual quality and narrative consistency of generated characters. Our approach consistently demonstrates state-of-the-art performance across different data partitioning schemes. Remarkably, while the baseline relies on the full 100% of the available data, our method requires only 0.96% yet still reduces the FID score from 195 to 152. Under identical data volumes, our method delivers substantial improvements: for the 70:30 split, the FID score decreases from 175 to 152, and for the 95:5 split, it is nearly halved from 96 to 49. Furthermore, the proposed model achieves a score of 3.62 on the newly introduced metric, surpassing the baseline score of 2.66.

[170] HERO-VQL: Hierarchical, Egocentric and Robust Visual Query Localization

Joohyun Chang,Soyeon Hong,Hyogun Lee,Seong Jong Ha,Dongho Lee,Seong Tae Kim,Jinwoo Choi

Main category: cs.CV

TL;DR: This paper introduces HERO-VQL, a novel method for visual query localization in egocentric videos, leveraging attention guidance and augmentation-based consistency training to achieve robust performance despite challenging conditions.

Details

Motivation: Egocentric videos involve frequent and abrupt viewpoint changes that cause object appearance variations and partial occlusions, which existing methods struggle to handle effectively. Method: The paper proposes HERO-VQL, which includes Top-down Attention Guidance (TAG) to refine attention using class tokens and score maps, and Egocentric Augmentation based Consistency Training (EgoACT) to enhance query diversity and simulate extreme viewpoints. Result: Extensive experiments on the VQ2D dataset demonstrate that HERO-VQL significantly outperforms baselines in handling egocentric challenges. Conclusion: HERO-VQL effectively addresses the challenges of object appearance variations and partial occlusions in egocentric videos, outperforming existing methods on the VQ2D dataset. Abstract: In this work, we tackle the egocentric visual query localization (VQL), where a model should localize the query object in a long-form egocentric video. Frequent and abrupt viewpoint changes in egocentric videos cause significant object appearance variations and partial occlusions, making it difficult for existing methods to achieve accurate localization. To tackle these challenges, we introduce Hierarchical, Egocentric and RObust Visual Query Localization (HERO-VQL), a novel method inspired by human cognitive process in object recognition. We propose i) Top-down Attention Guidance (TAG) and ii) Egocentric Augmentation based Consistency Training (EgoACT). Top-down Attention Guidance refines the attention mechanism by leveraging the class token for high-level context and principal component score maps for fine-grained localization. To enhance learning in diverse and challenging matching scenarios, EgoAug enhances query diversity by replacing the query with a randomly selected corresponding object from groundtruth annotations and simulates extreme viewpoint changes by reordering video frames. Additionally, CT loss enforces stable object localization across different augmentation scenarios. Extensive experiments on VQ2D dataset validate that HERO-VQL effectively handles egocentric challenges, significantly outperforming baselines.

[171] Double-Constraint Diffusion Model with Nuclear Regularization for Ultra-low-dose PET Reconstruction

Mengxiao Geng,Ran Hong,Bingxuan Li,Qiegen Liu

Main category: cs.CV

TL;DR: This paper introduces DCDM, a novel method for ultra-low-dose PET reconstruction that reduces radiation exposure and improves flexibility without compromising image quality.

Details

Motivation: To reduce patient radiation exposure and shorten examination times while maintaining PET image quality. Method: Double-Constraint Diffusion Model (DCDM) with Nuclear Transformer Constraint (NTC) and Encoding Nexus Constraint (ENC). Result: DCDM outperforms state-of-the-art methods on known and unknown dose reduction factors, even at ultra-low dose levels. Conclusion: DCDM is a promising approach for ultra-low-dose PET reconstruction, offering superior performance and flexibility without full retraining. Abstract: Ultra-low-dose positron emission tomography (PET) reconstruction holds significant potential for reducing patient radiation exposure and shortening examination times. However, it may also lead to increased noise and reduced imaging detail, which could decrease the image quality. In this study, we present a Double-Constraint Diffusion Model (DCDM), which freezes the weights of a pre-trained diffusion model and injects a trainable double-constraint controller into the encoding architecture, greatly reducing the number of trainable parameters for ultra-low-dose PET reconstruction. Unlike full fine-tuning models, DCDM can adapt to different dose levels without retraining all model parameters, thereby improving reconstruction flexibility. Specifically, the two constraint modules, named the Nuclear Transformer Constraint (NTC) and the Encoding Nexus Constraint (ENC), serve to refine the pre-trained diffusion model. The NTC leverages the nuclear norm as an approximation for matrix rank minimization, integrates the low-rank property into the Transformer architecture, and enables efficient information extraction from low-dose images and conversion into compressed feature representations in the latent space. Subsequently, the ENC utilizes these compressed feature representations to encode and control the pre-trained diffusion model, ultimately obtaining reconstructed PET images in the pixel space. In clinical reconstruction, the compressed feature representations from NTC help select the most suitable ENC for efficient unknown low-dose PET reconstruction. Experiments conducted on the UDPET public dataset and the Clinical dataset demonstrated that DCDM outperforms state-of-the-art methods on known dose reduction factors (DRF) and generalizes well to unknown DRF scenarios, proving valuable even at ultra-low dose levels, such as 1% of the full dose.

[172] DAOVI: Distortion-Aware Omnidirectional Video Inpainting

Ryosuke Seshimo,Mariko Isogawa

Main category: cs.CV

TL;DR: This paper proposes DAOVI, a deep learning model for omnidirectional video inpainting that effectively tackles geometric distortion, outperforming current methods.

Details

Motivation: Omnidirectional videos often contain unwanted objects due to their wide field of view, and existing video inpainting methods do not account for the geometric distortion inherent in equirectangular projections of such videos. Method: The paper introduces a deep learning model named Distortion-Aware Omnidirectional Video Inpainting (DAOVI), which incorporates a module for evaluating temporal motion information based on geodesic distance and a depth-aware feature propagation module. Result: Experimental results show that DAOVI outperforms existing methods both quantitatively and qualitatively in addressing geometric distortion in omnidirectional video inpainting. Conclusion: The proposed DAOVI method effectively addresses the issue of geometric distortion in omnidirectional video inpainting and outperforms existing methods. Abstract: Omnidirectional videos that capture the entire surroundings are employed in a variety of fields such as VR applications and remote sensing. However, their wide field of view often causes unwanted objects to appear in the videos. This problem can be addressed by video inpainting, which enables the natural removal of such objects while preserving both spatial and temporal consistency. Nevertheless, most existing methods assume processing ordinary videos with a narrow field of view and do not tackle the distortion in equirectangular projection of omnidirectional videos. To address this issue, this paper proposes a novel deep learning model for omnidirectional video inpainting, called Distortion-Aware Omnidirectional Video Inpainting (DAOVI). DAOVI introduces a module that evaluates temporal motion information in the image space considering geodesic distance, as well as a depth-aware feature propagation module in the feature space that is designed to address the geometric distortion inherent to omnidirectional videos. The experimental results demonstrate that our proposed method outperforms existing methods both quantitatively and qualitatively.

[173] DevilSight: Augmenting Monocular Human Avatar Reconstruction through a Virtual Perspective

Yushuo Chen,Ruizhi Shao,Youxin Pang,Hongwen Zhang,Xinyi Wu,Rihui Wu,Yebin Liu

Main category: cs.CV

TL;DR: This paper proposes a novel framework using Human4DiT and additional strategies to improve the reconstruction of human avatars from monocular videos, outperforming existing methods.

Details

Motivation: The motivation is to overcome the limitations in capturing fine-grained dynamic details and generating plausible details at novel viewpoints in existing approaches due to limited representational capacity and insufficient data. Method: The method involves using the Human4DiT model to generate human motions from alternative perspectives as supervision signals, injecting physical identity through video fine-tuning for motion consistency, and employing a patch-based denoising algorithm for higher-resolution outputs. Result: The experimental results show that the proposed method outperforms state-of-the-art approaches, effectively enriches details in unseen regions, and mitigates artifacts in avatar representation. Conclusion: The proposed framework successfully reconstructs human avatars from monocular videos by leveraging the video generative model Human4DiT and introduces strategies to enhance video generation, outperforming recent state-of-the-art approaches. Abstract: We present a novel framework to reconstruct human avatars from monocular videos. Recent approaches have struggled either to capture the fine-grained dynamic details from the input or to generate plausible details at novel viewpoints, which mainly stem from the limited representational capacity of the avatar model and insufficient observational data. To overcome these challenges, we propose to leverage the advanced video generative model, Human4DiT, to generate the human motions from alternative perspective as an additional supervision signal. This approach not only enriches the details in previously unseen regions but also effectively regularizes the avatar representation to mitigate artifacts. Furthermore, we introduce two complementary strategies to enhance video generation: To ensure consistent reproduction of human motion, we inject the physical identity into the model through video fine-tuning. For higher-resolution outputs with finer details, a patch-based denoising algorithm is employed. Experimental results demonstrate that our method outperforms recent state-of-the-art approaches and validate the effectiveness of our proposed strategies.

[174] LightVLM: Acceleraing Large Multimodal Models with Pyramid Token Merging and KV Cache Compression

Lianyu Hu,Fanhua Shang,Wei Feng,Liang Wan

Main category: cs.CV

TL;DR: LightVLM加速视觉语言模型的推理过程，在编码阶段使用金字塔token合并减少token数量，在解码阶段使用KV Cache压缩，有效提升模型效率和网络吞吐量。

Details

Motivation: 现有的视觉语言模型（VLMs）推理过程缓慢，限制了其在实际应用中的部署。 Method: 将VLMs的推理过程分为编码和解码两个阶段。在编码阶段，使用金字塔token合并方法减少不同层的token数量；在解码阶段，使用KV Cache压缩技术减少不必要的缓存。 Result: LightVLM在仅保留35%图像token的情况下成功保持100%性能，在保留3%图像token时保持约98%性能。LightVLM可将网络吞吐量提高2.02倍，预填充时间减少3.65倍。在生成长文本序列时，推理时间减少3.21倍。 Conclusion: LightVLM是一种有效的VLM推理加速方法，能够显著提升模型效率和性能，有望促进大规模VLM的实际应用。 Abstract: In this paper, we introduce LightVLM, a simple but effective method that can be seamlessly deployed upon existing Vision-Language Models (VLMs) to greatly accelerate the inference process in a training-free manner. We divide the inference procedure of VLMs into two stages, i.e., encoding and decoding, and propose to simultaneously accelerate VLMs in both stages to largely improve model efficiency. During encoding, we propose pyramid token merging to reduce tokens of different LLM layers in a hierarchical manner by finally only keeping a few dominant tokens to achieve high efficiency. During decoding, aimed at reducing the high latency of outputting long sequences, we propose KV Cache compression to remove unnecessary caches to increase the network throughput. Experimental results show that LightVLM successfully retains 100% performance when only preserving 35% image tokens, and maintains around 98% performance when keeping only 3% image tokens. LightVLM could 2.02$\times$ the network throughput and reduce the prefilling time by 3.65$\times$. LightVLM also makes large VLMs faster again by enabling a heavy model (e.g., InternVL2.5 26B) to infer faster than significantly smaller models (e.g., InternVL2.5 8B), hopefully facilitating the real-world deployment. When generating long text sequences (e.g., 4096 tokens), LightVLM could reduce the inference time by 3.21$\times$, largely outperforming existing methods.

[175] Mixture of Global and Local Experts with Diffusion Transformer for Controllable Face Generation

Xuechao Zou,Shun Zhang,Xing Fu,Yue Li,Kai Li,Yushe Cao,Congyan Lang,Pin Tao,Junliang Xing

Main category: cs.CV

TL;DR: This paper proposes Face-MoGLE, a novel framework for controllable face generation that balances semantic controllability and photorealism, offering high-quality results and robust zero-shot generalization.

Details

Motivation: Controllable face generation poses critical challenges in generative modeling due to the intricate balance required between semantic controllability and photorealism. Existing approaches struggle with disentangling semantic controls from generation pipelines. Method: Face-MoGLE introduces three features: semantic-decoupled latent modeling through mask-conditioned space factorization, a mixture of global and local experts, and a dynamic gating network producing time-dependent coefficients. Result: Extensive experiments demonstrate its effectiveness in multimodal and monomodal face generation settings and its robust zero-shot generalization capability. Conclusion: Face-MoGLE provides a powerful and flexible solution for high-quality, controllable face generation, with strong potential in generative modeling and security applications. Abstract: Controllable face generation poses critical challenges in generative modeling due to the intricate balance required between semantic controllability and photorealism. While existing approaches struggle with disentangling semantic controls from generation pipelines, we revisit the architectural potential of Diffusion Transformers (DiTs) through the lens of expert specialization. This paper introduces Face-MoGLE, a novel framework featuring: (1) Semantic-decoupled latent modeling through mask-conditioned space factorization, enabling precise attribute manipulation; (2) A mixture of global and local experts that captures holistic structure and region-level semantics for fine-grained controllability; (3) A dynamic gating network producing time-dependent coefficients that evolve with diffusion steps and spatial locations. Face-MoGLE provides a powerful and flexible solution for high-quality, controllable face generation, with strong potential in generative modeling and security applications. Extensive experiments demonstrate its effectiveness in multimodal and monomodal face generation settings and its robust zero-shot generalization capability. Project page is available at https://github.com/XavierJiezou/Face-MoGLE.

[176] SemaMIL: Semantic Reordering with Retrieval-Guided State Space Modeling for Whole Slide Image Classification

Lubin Gan,Xiaoman Wu,Jing Zhang,Zhifeng Wang,Linhao Qu,Siying Wu,Xiaoyan Sun

Main category: cs.CV

TL;DR: 本文提出了一种名为SemaMIL的新方法，用于计算病理学中的特征提取，通过结合语义重排序和语义引导检索状态空间模块，解决了现有方法的局限性，并在准确性、计算效率和参数数量方面表现出色。

Details

Motivation: 现有的MIL方法、基于注意力的MIL方法以及Transformer模型存在忽视上下文关系、计算成本高、过拟合倾向等问题，状态空间模型受困于线性复杂度及可解释性问题。 Method: 引入了SemaMIL，结合了语义重排序（SR）和语义引导检索状态空间模块（SRSM），以增强全局建模能力。 Result: 在四个WSI亚型数据集上的评估表明，SemaMIL相较于强基线方法表现出色。 Conclusion: SemaMIL在计算病理学中优于现有方法，实现了最先进的准确性，并且具有更少的FLOPs和参数。 Abstract: Multiple instance learning (MIL) has become the leading approach for extracting discriminative features from whole slide images (WSIs) in computational pathology. Attention-based MIL methods can identify key patches but tend to overlook contextual relationships. Transformer models are able to model interactions but require quadratic computational cost and are prone to overfitting. State space models (SSMs) offer linear complexity, yet shuffling patch order disrupts histological meaning and reduces interpretability. In this work, we introduce SemaMIL, which integrates Semantic Reordering (SR), an adaptive method that clusters and arranges semantically similar patches in sequence through a reversible permutation, with a Semantic-guided Retrieval State Space Module (SRSM) that chooses a representative subset of queries to adjust state space parameters for improved global modeling. Evaluation on four WSI subtype datasets shows that, compared to strong baselines, SemaMIL achieves state-of-the-art accuracy with fewer FLOPs and parameters.

[177] Stage-wise Adaptive Label Distribution for Facial Age Estimation

Bo Wu,Zhiqi Ai,Jun Jiang,Congcong Zhu,Shugong Xu

Main category: cs.CV

TL;DR: 本文提出SA-LDL算法，通过考虑年龄标签模糊的阶段模式，显著提升了年龄估计的准确性与鲁棒性。

Details

Motivation: 现有的年龄估计方法忽略了不同年龄阶段的标签模糊程度差异，需要更精确地捕捉标签模糊的复杂结构。 Method: Stage-wise Adaptive Label Distribution Learning (SA-LDL)算法，结合阶段自适应方差建模和加权损失函数。 Result: 在MORPH-II和FG-NET数据集上分别取得了1.74和2.15的MAE，表现出具有竞争力的性能。 Conclusion: SA-LDL算法通过阶段自适应标签分布学习，有效解决了年龄估计中的标签模糊问题，实现了更准确和鲁棒的年龄估计。 Abstract: Label ambiguity poses a significant challenge in age estimation tasks. Most existing methods address this issue by modeling correlations between adjacent age groups through label distribution learning. However, they often overlook the varying degrees of ambiguity present across different age stages. In this paper, we propose a Stage-wise Adaptive Label Distribution Learning (SA-LDL) algorithm, which leverages the observation -- revealed through our analysis of embedding similarities between an anchor and all other ages -- that label ambiguity exhibits clear stage-wise patterns. By jointly employing stage-wise adaptive variance modeling and weighted loss function, SA-LDL effectively captures the complex and structured nature of label ambiguity, leading to more accurate and robust age estimation. Extensive experiments demonstrate that SA-LDL achieves competitive performance, with MAE of 1.74 and 2.15 on the MORPH-II and FG-NET datasets.

[178] Encoder-Only Image Registration

Xiang Chen,Renjiu Hu,Jinwei Zhang,Yuxi Zhang,Xinyao Yue,Min Liu,Yaonan Wang,Hang Zhang

Main category: cs.CV

TL;DR: 本文提出了一种新的图像配准框架EOIR，通过分离特征学习与流估计，提高了准确性与效率的平衡。

Details

Motivation: 基于卷积神经网络在图像配准中的作用，旨在解决计算复杂度高和处理大变形的问题。 Method: 提出了仅编码器的图像配准框架（EOIR），将特征学习与流估计分离，采用3层卷积网络提取特征，并使用3层流估计器构建拉普拉斯特征金字塔。 Result: 实验结果表明，EOIR在准确性、效率和平滑度方面优于现有方法，且源代码已公开。 Conclusion: EOIR实现了更好的精度效率和精度平滑度的权衡，在不同模态和解剖区域的五个数据集中表现出色。 Abstract: Learning-based techniques have significantly improved the accuracy and speed of deformable image registration. However, challenges such as reducing computational complexity and handling large deformations persist. To address these challenges, we analyze how convolutional neural networks (ConvNets) influence registration performance using the Horn-Schunck optical flow equation. Supported by prior studies and our empirical experiments, we observe that ConvNets play two key roles in registration: linearizing local intensities and harmonizing global contrast variations. Based on these insights, we propose the Encoder-Only Image Registration (EOIR) framework, designed to achieve a better accuracy-efficiency trade-off. EOIR separates feature learning from flow estimation, employing only a 3-layer ConvNet for feature extraction and a set of 3-layer flow estimators to construct a Laplacian feature pyramid, progressively composing diffeomorphic deformations under a large-deformation model. Results on five datasets across different modalities and anatomical regions demonstrate EOIR's effectiveness, achieving superior accuracy-efficiency and accuracy-smoothness trade-offs. With comparable accuracy, EOIR provides better efficiency and smoothness, and vice versa. The source code of EOIR will be publicly available on https://github.com/XiangChen1994/EOIR.

[179] Exploring Decision-Making Capabilities of LLM Agents: An Experimental Study on Jump-Jump Game

Juwu Li

Main category: cs.CV

TL;DR: Jump-Jump游戏用于测试LLM在复杂决策中的能力，包括空间推理、物理建模和战略规划。

Details

Motivation: Jump-Jump游戏简单但具有挑战性，能够有效测试LLM在多任务处理中的决策能力。 Method: 通过分析Jump-Jump游戏的玩法机制，探讨LLM在空间推理、物理建模和战略规划方面的表现。 Result: 研究表明Jump-Jump游戏涉及多种认知能力，是评估LLM性能的理想环境。 Conclusion: Jump-Jump游戏为研究大型语言模型（LLM）决策能力提供了一个理想的测试环境，突出了其在复杂任务中的应用潜力。 Abstract: The Jump-Jump game, as a simple yet challenging casual game, provides an ideal testing environment for studying LLM decision-making capabilities. The game requires players to precisely control jumping force based on current position and target platform distance, involving multiple cognitive aspects including spatial reasoning, physical modeling, and strategic planning. It illustrates the basic gameplay mechanics of the Jump-Jump game, where the player character (red circle) must jump across platforms with appropriate force to maximize score.

[180] VideoRewardBench: Comprehensive Evaluation of Multimodal Reward Models for Video Understanding

Zhihong Zhang,Xiaojian Huang,Jin Xu,Zhuodong Luo,Xinzhi Wang,Jiansheng Wei,Xuejin Chen

Main category: cs.CV

TL;DR: This paper introduces VideoRewardBench, a new comprehensive benchmark for evaluating multimodal reward models (MRMs) in video understanding. It presents a large-scale dataset and evaluates 28 MRMs, revealing key insights about their performance and generalization capabilities.

Details

Motivation: Existing benchmarks for evaluating multimodal reward models (MRMs) in the video domain have limitations in question diversity, evaluation dimensions, and model assessment. This work aims to address these gaps with a more robust and comprehensive benchmark. Method: The authors introduce VideoRewardBench, a comprehensive benchmark for video understanding, and curate a high-quality preference dataset using an AI-assisted data pipeline. They evaluate 28 multimodal reward models across three categories: generative, discriminative, and semi-scalar. Result: Even the top-performing model, GPT-4o, achieved only 57.0% overall accuracy, while the state-of-the-art open-source model Qwen2.5-VL-72B achieved merely 53.3%. The study revealed insights into the performance of MRMs, including the effects of reinforcement learning, inference-time scaling, and input video frame variations. Conclusion: VideoRewardBench offers a challenging and valuable benchmark for advancing the evaluation and development of MRMs in the video domain. Abstract: Multimodal reward models (MRMs) play a crucial role in the training, inference, and evaluation of Large Vision Language Models (LVLMs) by assessing response quality. However, existing benchmarks for evaluating MRMs in the video domain suffer from a limited number and diversity of questions, a lack of comprehensive evaluation dimensions, and inadequate evaluation of diverse types of MRMs. To address these gaps, we introduce VideoRewardBench, the first comprehensive benchmark covering four core aspects of video understanding: perception, knowledge, reasoning, and safety. Through our AI-assisted data pipeline, we curate a high-quality preference dataset of 1,563 annotated samples, including 1,482 unique videos and 1,559 distinct questions--15 times the number found in the most question-rich prior benchmark. Each sample is a triplet consisting of a video-text prompt, a chosen response, and a rejected response. We also conduct a comprehensive evaluation across 28 multimodal reward models spanning three categories: generative, discriminative, and semi-scalar. Results show that even the top-performing model GPT-4o achieves only 57.0% overall accuracy, and the state-of-the-art open-source model Qwen2.5-VL-72B reaches merely 53.3%. Our analysis further reveals three key insights: (i) MRMs trained with reinforcement learning (RL) do not necessarily exhibit stronger cross-modal generalization than those trained without RL; (ii) except for discriminative MRMs, other types of MRMs across varying model capacities can benefit from inference-time scaling; and (iii) variations in input video frame count have different effects on different types of MRMs. We believe VideoRewardBench offers a challenging and valuable benchmark for advancing the evaluation and development of MRMs in the video domain.

[181] Multi-Focused Video Group Activities Hashing

Zhongmiao Qi,Yan Jiang,Bolin Zhang,Lijun Guo,Chong Wang,Qiangbo Qian

Main category: cs.CV

TL;DR: This paper proposes two novel video hashing techniques, STVH and M-STVH, to improve the efficiency of retrieving group activities from complex video data.

Details

Motivation: With the explosive growth of video data, quickly retrieving group activities has become urgent, but many tasks can only retrieve videos at the video level, not the activity level. Method: The paper proposes a unified framework called STVH to model individual object dynamics and group interactions, and an enhanced version M-STVH to handle multi-focused feature retrieval tasks. Result: The proposed methods (STVH and M-STVH) achieve excellent results on publicly available datasets. Conclusion: STVH and M-STVH methods can achieve excellent results in video retrieval tasks, proving their effectiveness in handling complex scenarios. Abstract: With the explosive growth of video data in various complex scenarios, quickly retrieving group activities has become an urgent problem. However, many tasks can only retrieve videos focusing on an entire video, not the activity granularity. To solve this problem, we propose a new STVH (spatiotemporal interleaved video hashing) technique for the first time. Through a unified framework, the STVH simultaneously models individual object dynamics and group interactions, capturing the spatiotemporal evolution on both group visual features and positional features. Moreover, in real-life video retrieval scenarios, it may sometimes require activity features, while at other times, it may require visual features of objects. We then further propose a novel M-STVH (multi-focused spatiotemporal video hashing) as an enhanced version to handle this difficult task. The advanced method incorporates hierarchical feature integration through multi-focused representation learning, allowing the model to jointly focus on activity semantics features and object visual features. We conducted comparative experiments on publicly available datasets, and both STVH and M-STVH can achieve excellent results.

[182] TRUST: Token-dRiven Ultrasound Style Transfer for Cross-Device Adaptation

Nhat-Tuong Do-Tran,Ngoc-Hoang-Lam Le,Ian Chiu,Po-Tsun Paul Kuo,Ching-Chun Huang

Main category: cs.CV

TL;DR: TRUST is a novel framework for unpaired image-to-image translation that improves ultrasound image adaptation for downstream tasks by preserving content and transferring relevant style features effectively.

Details

Motivation: Ultrasound images from different devices show diverse styles, which degrade the performance of downstream tasks. Existing UI2I methods lack explicit filtering of relevant style features, leading to misalignment with task needs. Method: TRUST utilizes a token-driven dual-stream framework with a Token-dRiven (TR) module and auxiliary prompts to filter relevant style features while preserving source content. Result: Experimental results show that TRUST surpasses existing UI2I methods in both visual quality and downstream task performance on ultrasound datasets. Conclusion: TRUST effectively bridges the style gap in ultrasound images from different devices, outperforming existing methods in visual quality and downstream task performance. Abstract: Ultrasound images acquired from different devices exhibit diverse styles, resulting in decreased performance of downstream tasks. To mitigate the style gap, unpaired image-to-image (UI2I) translation methods aim to transfer images from a source domain, corresponding to new device acquisitions, to a target domain where a frozen task model has been trained for downstream applications. However, existing UI2I methods have not explicitly considered filtering the most relevant style features, which may result in translated images misaligned with the needs of downstream tasks. In this work, we propose TRUST, a token-driven dual-stream framework that preserves source content while transferring the common style of the target domain, ensuring that content and style remain unblended. Given multiple styles in the target domain, we introduce a Token-dRiven (TR) module that operates from two perspectives: (1) a data view--selecting "suitable" target tokens corresponding to each source token, and (2) a model view--identifying ``optimal" target tokens for the downstream model, guided by a behavior mirror loss. Additionally, we inject auxiliary prompts into the source encoder to match content representation with downstream behavior. Experimental results on ultrasound datasets demonstrate that TRUST outperforms existing UI2I methods in both visual quality and downstream task performance.

[183] Make me an Expert: Distilling from Generalist Black-Box Models into Specialized Models for Semantic Segmentation

Yasser Benigmim,Subhankar Roy,Khalid Oublal,Imad Eddine Marouf,Slim Essid,Vicky Kalogeiton,Stéphane Lathuilière

Main category: cs.CV

TL;DR: This paper introduces ATGC, a method for adapting local models using black-box APIs by dynamically selecting optimal input resolutions through attention-guided scaling, achieving strong performance with minimal API outputs.

Details

Motivation: The paper addresses the challenge of training local models using black-box AI models that only provide one-hot predictions, without access to weights, training data, or logits. This is a practical constraint in domain adaptation scenarios, especially with the increasing use of AI as a Service (AIaaS). Method: The proposed method, ATtention-Guided sCaler (ATGC), uses DINOv2 attention maps and scores them with entropy to identify informative scales for pseudo-labelling, which helps in dynamically selecting optimal scales for black-box model inference. Result: Experiments show that ATGC achieves substantial improvements in black-box supervision across multiple datasets while only requiring one-hot API predictions, demonstrating its effectiveness in overcoming the 'curse of resolution'. Conclusion: ATGC effectively addresses the 'curse of resolution' in the Black-Box Distillation (B2D) setting by leveraging DINOv2 attention maps to dynamically select optimal scales for pseudo-labelling, enabling effective distillation with only one-hot API predictions. Abstract: The rise of Artificial Intelligence as a Service (AIaaS) democratizes access to pre-trained models via Application Programming Interfaces (APIs), but also raises a fundamental question: how can local models be effectively trained using black-box models that do not expose their weights, training data, or logits, a constraint in which current domain adaptation paradigms are impractical ? To address this challenge, we introduce the Black-Box Distillation (B2D) setting, which enables local model adaptation under realistic constraints: (1) the API model is open-vocabulary and trained on large-scale general-purpose data, and (2) access is limited to one-hot predictions only. We identify that open-vocabulary models exhibit significant sensitivity to input resolution, with different object classes being segmented optimally at different scales, a limitation termed the "curse of resolution". Our method, ATtention-Guided sCaler (ATGC), addresses this challenge by leveraging DINOv2 attention maps to dynamically select optimal scales for black-box model inference. ATGC scores the attention maps with entropy to identify informative scales for pseudo-labelling, enabling effective distillation. Experiments demonstrate substantial improvements under black-box supervision across multiple datasets while requiring only one-hot API predictions. Our code is available at https://github.com/yasserben/ATGC.

[184] Learning Yourself: Class-Incremental Semantic Segmentation with Language-Inspired Bootstrapped Disentanglement

Ruitao Wu,Yifan Zhao,Jia Li

Main category: cs.CV

TL;DR: The paper introduces a new framework called LBD to solve the catastrophic semantic entanglement problem in incremental semantic segmentation by leveraging language models, achieving excellent results on benchmark datasets.

Details

Motivation: The motivation is to overcome the challenge of catastrophic semantic entanglement in Class-Incremental Semantic Segmentation, where existing methods suffer from noise and errors due to semantic misalignment and dynamic data evolution. Method: The method involves a Language-inspired Bootstrapped Disentanglement framework that uses Language-guided Prototypical Disentanglement and Manifold Mutual Background Disentanglement, along with soft prompt tuning and encoder adaptation to improve performance. Result: The framework achieves state-of-the-art performance on Pascal VOC and ADE20k datasets, particularly in multi-step scenarios. Conclusion: The proposed LBD framework effectively addresses the catastrophic semantic entanglement in CISS by leveraging pre-trained visual-language models and achieves state-of-the-art performance. Abstract: Class-Incremental Semantic Segmentation (CISS) requires continuous learning of newly introduced classes while retaining knowledge of past classes. By abstracting mainstream methods into two stages (visual feature extraction and prototype-feature matching), we identify a more fundamental challenge termed catastrophic semantic entanglement. This phenomenon involves Prototype-Feature Entanglement caused by semantic misalignment during the incremental process, and Background-Increment Entanglement due to dynamic data evolution. Existing techniques, which rely on visual feature learning without sufficient cues to distinguish targets, introduce significant noise and errors. To address these issues, we introduce a Language-inspired Bootstrapped Disentanglement framework (LBD). We leverage the prior class semantics of pre-trained visual-language models (e.g., CLIP) to guide the model in autonomously disentangling features through Language-guided Prototypical Disentanglement and Manifold Mutual Background Disentanglement. The former guides the disentangling of new prototypes by treating hand-crafted text features as topological templates, while the latter employs multiple learnable prototypes and mask-pooling-based supervision for background-incremental class disentanglement. By incorporating soft prompt tuning and encoder adaptation modifications, we further bridge the capability gap of CLIP between dense and sparse tasks, achieving state-of-the-art performance on both Pascal VOC and ADE20k, particularly in multi-step scenarios.

[185] A Modality-agnostic Multi-task Foundation Model for Human Brain Imaging

Peirong Liu,Oula Puonti,Xiaoling Hu,Karthik Gopinath,Annabel Sorby-Adams,Daniel C. Alexander,W. Taylor Kimberly,Juan E. Iglesias

Main category: cs.CV

TL;DR: BrainFM is a versatile and robust vision foundation model for brain imaging that addresses the limitations of previous approaches in handling uncalibrated modalities.

Details

Motivation: Recent learning-based approaches struggle to generalize in uncalibrated modalities like magnetic resonance (MR) imaging, which limits their applicability to diverse clinical protocols. Method: BrainFM utilizes a 'mild-to-severe' intra-subject generation and 'real-synth' mix-up training strategy to be resilient to variations in image appearance. Result: BrainFM can be directly applied to five fundamental brain imaging tasks and shows robustness and effectiveness across all tasks and input modalities, evaluated on eleven public datasets. Conclusion: BrainFM is a modality-agnostic, multi-task vision foundation model for human brain imaging that demonstrates robustness and effectiveness across various tasks and input modalities. Abstract: Recent learning-based approaches have made astonishing advances in calibrated medical imaging like computerized tomography (CT), yet they struggle to generalize in uncalibrated modalities -- notably magnetic resonance (MR) imaging, where performance is highly sensitive to the differences in MR contrast, resolution, and orientation. This prevents broad applicability to diverse real-world clinical protocols. Here we introduce BrainFM, a modality-agnostic, multi-task vision foundation model for human brain imaging. With the proposed "mild-to-severe" intra-subject generation and "real-synth" mix-up training strategy, BrainFM is resilient to the appearance of acquired images (e.g., modality, contrast, deformation, resolution, artifacts), and can be directly applied to five fundamental brain imaging tasks, including image synthesis for CT and T1w/T2w/FLAIR MRI, anatomy segmentation, scalp-to-cortical distance, bias field estimation, and registration. We evaluate the efficacy of BrainFM on eleven public datasets, and demonstrate its robustness and effectiveness across all tasks and input modalities. Code is available at https://github.com/jhuldr/BrainFM.

[186] C-DiffDet+: Fusing Global Scene Context with Generative Denoising for High-Fidelity Object Detection

Abdellah Zakaria Sellam,Ilyes Benaissa,Salah Eddine Bekhouche,Abdenour Hadid,Vito Renó,Cosimo Distante

Main category: cs.CV

TL;DR: This paper proposes Context-Aware Fusion (CAF) to enhance fine-grained object detection by integrating global scene context with local features, achieving superior performance on the CarDD benchmark for vehicle damage assessment.

Details

Motivation: The motivation behind this research is the challenge of fine-grained object detection in complex visual domains like vehicle damage assessment, where existing methods like DiffusionDet are limited by local feature conditioning. Method: The paper introduces Context-Aware Fusion (CAF), which utilizes cross-attention mechanisms to integrate global scene context with local proposal features. This is achieved through a dedicated encoder that captures comprehensive environmental information, allowing each object proposal to attend to scene-level understanding. Result: Experimental results show that the proposed framework improves performance over state-of-the-art models on the CarDD benchmark, setting new performance standards for context-aware object detection in fine-grained domains. Conclusion: The paper concludes that their proposed Context-Aware Fusion (CAF) approach significantly enhances the generative detection paradigm for fine-grained object detection in challenging visual domains, particularly demonstrating improved performance on the CarDD benchmark for vehicle damage assessment. Abstract: Fine-grained object detection in challenging visual domains, such as vehicle damage assessment, presents a formidable challenge even for human experts to resolve reliably. While DiffusionDet has advanced the state-of-the-art through conditional denoising diffusion, its performance remains limited by local feature conditioning in context-dependent scenarios. We address this fundamental limitation by introducing Context-Aware Fusion (CAF), which leverages cross-attention mechanisms to integrate global scene context with local proposal features directly. The global context is generated using a separate dedicated encoder that captures comprehensive environmental information, enabling each object proposal to attend to scene-level understanding. Our framework significantly enhances the generative detection paradigm by enabling each object proposal to attend to comprehensive environmental information. Experimental results demonstrate an improvement over state-of-the-art models on the CarDD benchmark, establishing new performance benchmarks for context-aware object detection in fine-grained domains

[187] DGL-RSIS: Decoupling Global Spatial Context and Local Class Semantics for Training-Free Remote Sensing Image Segmentation

Boyi Li,Ce Zhang,Richard M. Timmerman,Wenxuan Bao

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的DGL-RSIS框架，通过解耦视觉与文本输入并在局部语义与全局上下文层面进行视觉-语言对齐，有效提升了遥感图像分割性能。

Details

Motivation: 由于遥感数据集中的类别多样性有限以及自然图像与遥感图像之间的领域差异，将视觉语言模型（VLMs）从自然图像领域迁移到遥感分割仍然具有挑战性。 Method: DGL-RSIS框架包括三个主要步骤：1）通过自然语言处理技术将文本输入分解为局部类别名词和全局修饰符，并通过无监督掩码提议网络将图像输入划分为类别无关的掩码提议；2）通过上下文感知裁剪策略提取具有适当边界的图像块，并引入遥感特定知识以增强文本特征；3）提出Cross-Scale Grad-CAM模块，利用全局修饰符的上下文信息优化Grad-CAM图，并通过掩码选择模块将像素级激活集成到掩码级分割输出中。 Result: DGL-RSIS框架在开放词汇语义分割和指代表达分割任务中实现了准确且可解释的跨全局和局部维度的视觉-语言对齐。 Conclusion: DGL-RSIS框架在遥感图像分割中实现了准确且可解释的视觉-语言对齐，同时支持开放词汇语义分割和指代表达分割任务。 Abstract: The emergence of vision language models (VLMs) has bridged vision and language, enabling joint multimodal understanding beyond traditional visual-only deep learning models. However, transferring VLMs from the natural image domain to remote sensing (RS) segmentation remains challenging due to the limited category diversity in RS datasets and the domain gap between natural and RS imagery. Here, we propose a training-free framework, DGL-RSIS, that decouples visual and textual inputs, performing visual-language alignment at both the local semantic and global contextual levels through tailored strategies. Specifically, we first introduce a global-local decoupling (GLD) module, where text inputs are divided into local class nouns and global modifiers using natural language processing (NLP) techniques; image inputs are partitioned into a set of class-agnostic mask proposals via unsupervised mask proposal networks. Second, visual and textual features are aligned at local scale, through a novel context-aware cropping strategy for extracting image patches with proper boundaries and introducing RS-specific knowledge to enrich the text inputs. By matching the enhanced text features with mask-guided visual features, we enable the mask classification, supporting open-vocabulary semantic segmentation (OVSS). Third, at the global scale, we propose a Cross-Scale Grad-CAM module to refine Grad-CAM maps using contextual information from global modifiers. A subsequent mask selection module integrates pixel-level Grad-CAM activations into the mask-level segmentation output, such that accurate and interpretable alignment can be realized across global and local dimensions for referring expression segmentation (RES).

[188] Towards Methane Detection Onboard Satellites

Maggie Chen,Hala Lambdouar,Luca Marini,Laura Martínez-Ferrer,Chris Bridges,Giacomo Acciarini

Main category: cs.CV

TL;DR: This paper introduces UnorthoDOS, a novel approach for methane detection using ML on unorthorectified satellite data, achieving comparable performance to models using traditional preprocessing methods.

Details

Motivation: Methane is a potent greenhouse gas, and its timely detection is crucial for climate change mitigation. Machine learning onboard satellites can enable rapid detection while reducing costs and supporting faster response systems. Method: The study introduces a new approach called UnorthoDOS that uses unorthorectified data to bypass conventional preprocessing steps. ML models are trained on both orthorectified and unorthorectified datasets and their performance is compared. Result: ML models trained on unorthorectified data show comparable performance to those trained on orthorectified data. Models trained on orthorectified data outperform the matched filter baseline (mag1c). Conclusion: ML models trained on unorthorectified data can achieve performance comparable to those trained on orthorectified data, and models trained on orthorectified data can outperform the matched filter baseline. Abstract: Methane is a potent greenhouse gas and a major driver of climate change, making its timely detection critical for effective mitigation. Machine learning (ML) deployed onboard satellites can enable rapid detection while reducing downlink costs, supporting faster response systems. Conventional methane detection methods often rely on image processing techniques, such as orthorectification to correct geometric distortions and matched filters to enhance plume signals. We introduce a novel approach that bypasses these preprocessing steps by using \textit{unorthorectified} data (UnorthoDOS). We find that ML models trained on this dataset achieve performance comparable to those trained on orthorectified data. Moreover, we also train models on an orthorectified dataset, showing that they can outperform the matched filter baseline (mag1c). We release model checkpoints and two ML-ready datasets comprising orthorectified and unorthorectified hyperspectral images from the Earth Surface Mineral Dust Source Investigation (EMIT) sensor at https://huggingface.co/datasets/SpaceML/UnorthoDOS , along with code at https://github.com/spaceml-org/plume-hunter.

[189] MV-SSM: Multi-View State Space Modeling for 3D Human Pose Estimation

Aviral Chharia,Wenbo Gou,Haoye Dong

Main category: cs.CV

TL;DR: This paper introduces MV-SSM, a novel framework for multi-view 3D human pose estimation that improves spatial modeling and generalization performance across new camera configurations.

Details

Motivation: Existing attention-based transformers struggle with spatial modeling of keypoints and overfitting to specific camera setups, leading to poor generalization in new environments. Method: The paper proposes a Multi-View State Space Modeling framework named MV-SSM, which includes a Projective State Space (PSS) block and a Grid Token-guided Bidirectional Scanning (GTBS) mechanism to model joint spatial arrangements at both feature and keypoint levels. Result: MV-SSM achieved significant improvements: +10.8 on AP25 (+24%) in a three-camera setting on CMU Panoptic, +7.0 on AP25 (+13%) across varying camera arrangements, and +15.3 PCP (+38%) on Campus A1 in cross-dataset evaluations. Conclusion: MV-SSM is a robust framework for multi-view 3D human pose estimation that generalizes well to new camera configurations and outperforms state-of-the-art methods. Abstract: While significant progress has been made in single-view 3D human pose estimation, multi-view 3D human pose estimation remains challenging, particularly in terms of generalizing to new camera configurations. Existing attention-based transformers often struggle to accurately model the spatial arrangement of keypoints, especially in occluded scenarios. Additionally, they tend to overfit specific camera arrangements and visual scenes from training data, resulting in substantial performance drops in new settings. In this study, we introduce a novel Multi-View State Space Modeling framework, named MV-SSM, for robustly estimating 3D human keypoints. We explicitly model the joint spatial sequence at two distinct levels: the feature level from multi-view images and the person keypoint level. We propose a Projective State Space (PSS) block to learn a generalized representation of joint spatial arrangements using state space modeling. Moreover, we modify Mamba's traditional scanning into an effective Grid Token-guided Bidirectional Scanning (GTBS), which is integral to the PSS block. Multiple experiments demonstrate that MV-SSM achieves strong generalization, outperforming state-of-the-art methods: +10.8 on AP25 (+24%) on the challenging three-camera setting in CMU Panoptic, +7.0 on AP25 (+13%) on varying camera arrangements, and +15.3 PCP (+38%) on Campus A1 in cross-dataset evaluations. Project Website: https://aviralchharia.github.io/MV-SSM

[190] Face4FairShifts: A Large Image Benchmark for Fairness and Robust Learning across Visual Domains

Yumeng Lin,Dong Li,Xintao Wu,Minglai Shao,Xujiang Zhao,Zhong Chen,Chen Zhao

Main category: cs.CV

TL;DR: 本文介绍了Face4FairShifts，一个大规模面部图像基准数据集，旨在系统评估公平感知学习和领域泛化。

Details

Motivation: 在机器学习模型中确保公平性和鲁棒性仍然是一个挑战，尤其是在领域转移的情况下。 Method: 提出了Face4FairShifts，一个包含100,000张图像的大规模面部图像基准数据集，涵盖四个视觉不同的领域和39个属性注释。 Result: 通过大量实验，分析了模型在分布转移下的性能，并发现了显著差距。 Conclusion: Face4FairShifts提供了一个全面的测试平台，用于推进公平和可靠的AI系统。 Abstract: Ensuring fairness and robustness in machine learning models remains a challenge, particularly under domain shifts. We present Face4FairShifts, a large-scale facial image benchmark designed to systematically evaluate fairness-aware learning and domain generalization. The dataset includes 100,000 images across four visually distinct domains with 39 annotations within 14 attributes covering demographic and facial features. Through extensive experiments, we analyze model performance under distribution shifts and identify significant gaps. Our findings emphasize the limitations of existing related datasets and the need for more effective fairness-aware domain adaptation techniques. Face4FairShifts provides a comprehensive testbed for advancing equitable and reliable AI systems. The dataset is available online at https://meviuslab.github.io/Face4FairShifts/.

[191] Automatic Identification and Description of Jewelry Through Computer Vision and Neural Networks for Translators and Interpreters

Jose Manuel Alcalde-Llergo,Aurora Ruiz-Mezcua,Rocio Avila-Ramirez,Andrea Zingoni,Juri Taborri,Enrique Yeguas-Bolivar

Main category: cs.CV

TL;DR: This study presents an automated jewelry identification and description system using neural networks, achieving high accuracy and assisting non-experts like translators in understanding jewelry.

Details

Motivation: The motivation stems from the challenge of precisely describing jewelry due to diverse styles, which is typically limited to experts, while translators and interpreters also require such knowledge. Method: The research employed neural networks, specifically image captioning architectures, including encoder-decoder models, to identify and describe jewelry at three hierarchical levels of detail. Result: The model effectively recognizes various types of jewelry and generates natural language descriptions with high accuracy, aiding non-experts in understanding jewelry pieces. Conclusion: The study successfully developed a neural network model that accurately identifies and describes jewelry pieces, achieving over 90% captioning accuracy. Abstract: Identifying jewelry pieces presents a significant challenge due to the wide range of styles and designs. Currently, precise descriptions are typically limited to industry experts. However, translators and interpreters often require a comprehensive understanding of these items. In this study, we introduce an innovative approach to automatically identify and describe jewelry using neural networks. This method enables translators and interpreters to quickly access accurate information, aiding in resolving queries and gaining essential knowledge about jewelry. Our model operates at three distinct levels of description, employing computer vision techniques and image captioning to emulate expert analysis of accessories. The key innovation involves generating natural language descriptions of jewelry across three hierarchical levels, capturing nuanced details of each piece. Different image captioning architectures are utilized to detect jewels in images and generate descriptions with varying levels of detail. To demonstrate the effectiveness of our approach in recognizing diverse types of jewelry, we assembled a comprehensive database of accessory images. The evaluation process involved comparing various image captioning architectures, focusing particularly on the encoder decoder model, crucial for generating descriptive captions. After thorough evaluation, our final model achieved a captioning accuracy exceeding 90 per cent.

[192] Fusion to Enhance: Fusion Visual Encoder to Enhance Multimodal Language Model

Yifei She,Huangxuan Wu

Main category: cs.CV

TL;DR: 本文提出了一种新的视觉塔框架FtZ，通过组合两个不同类型的视觉编码器来解决当前多模态大语言模型在视觉感知瓶颈的问题，并在多个挑战性基准测试中显示出优于现有方法的性能。

Details

Motivation: 现有的多模态大语言模型虽然在复杂的语义理解方面表现出色，但在需要精确细节感知的基本视觉任务上常常失败。这是因为这些模型普遍依赖于优化高层语义对齐的单一视觉编码器，从而牺牲了捕捉细粒度视觉信息的能力。 Method: FtZ模型采用了轻量级的多头交叉注意力机制，创新地组合了两个不同类型的视觉编码器。 Result: 实验结果表明，FtZ模型在多个需要细粒度视觉理解的挑战性基准测试中显著优于仅使用单个编码器或现有特征融合方法的基线模型。 Conclusion: 本文提出了一种新的视觉塔框架FtZ，通过组合语义能力强的锚定编码器和感知丰富的增强编码器来解决当前多模态大语言模型在视觉感知瓶颈的问题。 Abstract: Multimodal Large Language Models (MLLMs) have made significant progress in bridging visual perception with high-level textual reasoning. However, they face a fundamental contradiction: while excelling at complex semantic understanding, these models often fail at basic visual tasks that require precise detail perception. This deficiency primarily stems from the prevalent architectural reliance on a single vision encoder optimized for high-level semantic alignment, which inherently sacrifices the ability to capture fine-grained visual information. To address this issue, we introduce Fusion to Enhance (FtZ), a novel vision tower framework. FtZ moves beyond the single-encoder design by innovatively composing a semantically powerful anchor encoder with a perception-rich augmenting encoder via a lightweight Multi-Head Cross-Attention mechanism. Experimental results demonstrate that on several challenging benchmarks demanding fine-grained visual understanding, such as TextVQA, POPE, MMMU, MME and MM-Vet, our FtZ model significantly outperforms baselines that use only a single encoder or existing feature fusion methods. This work proves that composing heterogeneous expert encoders is an efficient and effective path to overcoming the visual perception bottleneck in current MLLMs, offering a new design paradigm for building next-generation AI systems with stronger perceptual capabilities.

[193] ER-LoRA: Effective-Rank Guided Adaptation for Weather-Generalized Depth Estimation

Weilong Yan,Xin Zhang,Robby T. Tan

Main category: cs.CV

TL;DR: This paper proposes STM, a Parameter-Efficient Fine-Tuning strategy for Vision Foundation Models, achieving superior performance in monocular depth estimation under adverse weather conditions.

Details

Motivation: Monocular depth estimation in adverse weather is challenging due to unreliable ground truth and domain gaps in synthetic data. Method: The STM strategy decomposes pretrained weights using entropy-rank and stable-rank to balance adaptation and knowledge preservation. Result: STM outperforms PEFT methods, full fine-tuning, and synthetic data-trained methods on four real-world benchmarks. Conclusion: The proposed Selecting--Tuning--Maintaining (STM) strategy effectively adapts Vision Foundation Models (VFMs) for weather-generalized depth estimation, outperforming existing methods. Abstract: Monocular depth estimation under adverse weather conditions (e.g.\ rain, fog, snow, and nighttime) remains highly challenging due to the lack of reliable ground truth and the difficulty of learning from unlabeled real-world data. Existing methods often rely on synthetic adverse data with pseudo-labels, which suffer from domain gaps, or employ self-supervised learning, which violates photometric assumptions in adverse scenarios. In this work, we propose to achieve weather--generalized depth estimation by Parameter--Efficient Fine--Tuning (PEFT) of Vision Foundation Models (VFMs), using only a small amount of high--visibility (normal) data. While PEFT has shown strong performance in semantic tasks such as segmentation, it remains underexplored for geometry--centric tasks like depth estimation -- especially in terms of balancing effective adaptation with the preservation of pretrained knowledge. To this end, we introduce the Selecting--Tuning--Maintaining (STM) strategy, which structurally decomposes the pretrained weights of VFMs based on two kinds of effective ranks (entropy--rank and stable--rank). In the tuning phase, we adaptively select the proper rank number as well as the task--aware singular directions for initialization, based on the entropy--rank and full--tuned weight; while in the maintaining stage, we enforce a principal direction regularization based on the stable--rank. This design guarantees flexible task adaptation while preserving the strong generalization capability of the pretrained VFM. Extensive experiments on four real--world benchmarks across diverse weather conditions demonstrate that STM not only outperforms existing PEFT methods and full fine--tuning but also surpasses methods trained with adverse synthetic data, and even the depth foundation model

[194] LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

Xiyao Wang,Chunyuan Li,Jianwei Yang,Kai Zhang,Bo Liu,Tianyi Xiong,Furong Huang

Main category: cs.CV

TL;DR: This study challenges the separation of critic and policy models in vision-language modeling by training a multimodal critic to optimize preference judgments while retaining full generation ability, leading to a unified model that excels both in evaluation and generation tasks.

Details

Motivation: The motivation behind this study is to challenge the convention in vision-language modeling where critic models are separate from policy models, aiming to unify these roles for improved performance. Method: The researchers reorganized preference-labeled critic datasets into verifiable training signals and conducted reinforcement learning directly on a base generative model to produce LLaVA-Critic-R1, which was further extended to LLaVA-Critic-R1+. Result: LLaVA-Critic-R1 emerged not only as a top-performing critic but also as a competitive policy model, achieving significant gains across multiple benchmarks. The extended model, LLaVA-Critic-R1+, achieved SoTA performance on MMMU at the 7B scale, and self-critique during inference improved performance on reasoning tasks. Conclusion: The study concludes that RL training on critic data can produce a unified model excelling at both evaluation and generation, offering a simple path toward scalable, self-improving multimodal systems. Abstract: In vision-language modeling, critic models are typically trained to evaluate outputs -- assigning scalar scores or pairwise preferences -- rather than to generate responses. This separation from policy models, which produce the responses, is so entrenched that critics are rarely considered for direct policy use. In this work, we challenge this convention. We propose to reorganize preference-labeled critic datasets into verifiable training signals and perform reinforcement learning directly on a base generative model, producing LLaVA-Critic-R1, a multimodal critic trained to optimize preference judgments while retaining full generation ability. Surprisingly, LLaVA-Critic-R1 emerges not only as a top-performing critic but also as a competitive policy model -- matching or surpassing specialized reasoning VLMs trained with in-domain data across 26 visual reasoning and understanding benchmarks, with an average gain of +5.7% over its base model (Qwen-2.5-VL-7B). Extending this approach to existing strong reasoning VLMs yields LLaVA-Critic-R1+, which further advances policy performance without sacrificing critic quality, achieving a SoTA performance of 71.9 on MMMU at the 7B scale. Finally, we show that the enhanced critic ability benefits inference: applying self-critique at test time yields an average +13.8% improvement on five representative reasoning tasks without additional training. Our results reveal that RL training on critic data can produce a unified model excelling at both evaluation and generation, offering a simple path toward scalable, self-improving multimodal systems.

[195] CSFMamba: Cross State Fusion Mamba Operator for Multimodal Remote Sensing Image Classification

Qingyu Wang,Xue Jiang,Guozheng Xu

Main category: cs.CV

TL;DR: This paper proposes CSFMamba, a remote sensing image classification method that combines CNN and Mamba to achieve efficient multimodal fusion.

Details

Motivation: Prior methods based on CNN and Transformer suffer from high computational complexity when modeling long-range dependencies in spatial-spectral features. Mamba offers lower computational burden but lacks direct feature fusion capability. Method: The authors design a preprocessing module for Mamba, integrate it with CNN for multi-layer feature extraction, and propose a cross-state fusion module based on the Mamba operator. Result: Experiments on MUUFL and Houston2018 datasets show that the proposed method outperforms Transformer-based approaches with reduced training burden. Conclusion: The proposed CSFMamba Network combines the advantages of Mamba and CNN to achieve efficient and effective multimodal fusion for remote sensing image classification. Abstract: Multimodal fusion has made great progress in the field of remote sensing image classification due to its ability to exploit the complementary spatial-spectral information. Deep learning methods such as CNN and Transformer have been widely used in these domains. State Space Models recently highlighted that prior methods suffer from quadratic computational complexity. As a result, modeling longer-range dependencies of spatial-spectral features imposes an overwhelming burden on the network. Mamba solves this problem by incorporating time-varying parameters into ordinary SSM and performing hardware optimization, but it cannot perform feature fusion directly. In order to make full use of Mamba's low computational burden and explore the potential of internal structure in multimodal feature fusion, we propose Cross State Fusion Mamba (CSFMamba) Network. Specifically, we first design the preprocessing module of remote sensing image information for the needs of Mamba structure, and combine it with CNN to extract multi-layer features. Secondly, a cross-state module based on Mamba operator is creatively designed to fully fuse the feature of the two modalities. The advantages of Mamba and CNN are combined by designing a more powerful backbone. We capture the fusion relationship between HSI and LiDAR modalities with stronger full-image understanding. The experimental results on two datasets of MUUFL and Houston2018 show that the proposed method outperforms the experimental results of Transformer under the premise of reducing the network training burden.

[196] CascadeFormer: A Family of Two-stage Cascading Transformers for Skeleton-based Human Action Recognition

Yusen Peng,Alper Yilmaz

Main category: cs.CV

TL;DR: 本文提出了CascadeFormer，一种基于变压器的两阶段模型，用于骨架数据的人类动作识别，在多个数据集上表现优异。

Details

Motivation: 尽管图卷积网络（GCNs）在骨架数据处理中占主导地位，但变压器模型和掩码预训练框架的最新进展为表示学习开辟了新途径。 Method: 提出了一种名为CascadeFormer的两阶段级联变压器模型，包括掩码预训练阶段和级联微调阶段。 Result: 在Penn Action、N-UCLA和NTU RGB+D 60三个基准数据集上进行了评估，结果均具有竞争力。 Conclusion: CascadeFormer实现了基于骨架的人类动作识别，通过两阶段级联变压器模型，在三个基准数据集上均取得了具有竞争力的性能。 Abstract: Skeleton-based human action recognition leverages sequences of human joint coordinates to identify actions performed in videos. Owing to the intrinsic spatiotemporal structure of skeleton data, Graph Convolutional Networks (GCNs) have been the dominant architecture in this field. However, recent advances in transformer models and masked pretraining frameworks open new avenues for representation learning. In this work, we propose CascadeFormer, a family of two-stage cascading transformers for skeleton-based human action recognition. Our framework consists of a masked pretraining stage to learn generalizable skeleton representations, followed by a cascading fine-tuning stage tailored for discriminative action classification. We evaluate CascadeFormer across three benchmark datasets (Penn Action N-UCLA, and NTU RGB+D 60), achieving competitive performance on all tasks. To promote reproducibility, we release our code and model checkpoints.

[197] Prompt the Unseen: Evaluating Visual-Language Alignment Beyond Supervision

Raehyuk Jung,Seungjun Yu,Hyunjung Shim

Main category: cs.CV

TL;DR: 该论文介绍了一种新的评估框架，用于测试视觉-语言模型中投影层的泛化能力，并发现投影层在处理未见概念时表现出非同一般的泛化能力。

Details

Motivation: 尽管视觉-语言模型中的投影层对于将视觉特征映射到语言模型的嵌入空间至关重要，但其对未见视觉概念的泛化能力尚未得到系统评估。 Method: 提出了一种用于评估投影层泛化能力的基准测试方法，将目标检测数据集改编为提示格式，并设计了具有不相交标签集的训练/测试分割。 Result: 实验结果显示，即使在没有对齐监督的情况下，投影层在各种设置下对未见类别仍保持较高的性能水平。 Conclusion: 研究发现投影层在未见类别上保留了约79%至88%的性能，表明其具有一定的泛化能力，并且投影层的行为类似于键值存储，处理已见和未见标记的方式相似。 Abstract: Vision-Language Models (VLMs) combine a vision encoder and a large language model (LLM) through alignment training, showing strong performance on multimodal tasks. A central component in this architecture is the projection layer, which maps visual features into the LLM's embedding space. Despite its importance, its ability to generalize to unseen visual concepts has not been systematically evaluated. To address this, we propose a benchmark for evaluating projection-layer generalization. We adapt object detection datasets (rich in fine-grained annotations) into a prompting format and design train/test splits with disjoint label sets, enabling precise control over seen and unseen concept separation. Experimental results show that the projection layer retains about 79 to 88 percent of the performance on unseen classes compared to seen ones across various settings, suggesting a non-trivial level of generalization even without explicit alignment supervision on those concepts. We further analyze this behavior through a mechanistic interpretability lens. Our findings indicate that the feed-forward network in the projection layer functions like a key-value memory, processing seen and unseen tokens in similar ways. This study introduces a new evaluation framework for alignment generalization and highlights the potential for efficient VLM training with limited aligned data.

[198] Enhancing Fairness in Skin Lesion Classification for Medical Diagnosis Using Prune Learning

Kuniko Paxton,Koorosh Aslansefat,Dhavalkumar Thakker,Yiannis Papadopoulos,Tanaya Maslekar

Main category: cs.CV

TL;DR: 为了解决不同肤色诊断公平性的问题，我们提出了一种新的公平性算法。

Details

Motivation: 深度学习的最新进展显著提高了皮肤病变分类模型的准确性，但对与肤色相关的潜在偏见仍存在担忧，这可能会影响诊断结果。 Method: 通过计算VGG（视觉几何组）网络卷积层中的特征图偏度以及Vision Transformer的patches和heads，我们的方法减少了与肤色相关的不必要的通道，而专注于病变区域。 Result: 这种方法降低了计算成本并减轻了偏见，而无需依赖传统的统计方法。 Conclusion: 本文提出了一种用于皮肤病变分类的公平性算法，该算法通过计算VGG网络卷积层中特征图的偏度以及Vision Transformer中的patches和heads，减少了与肤色相关的不必要的通道，而专注于病变区域。 Abstract: Recent advances in deep learning have significantly improved the accuracy of skin lesion classification models, supporting medical diagnoses and promoting equitable healthcare. However, concerns remain about potential biases related to skin color, which can impact diagnostic outcomes. Ensuring fairness is challenging due to difficulties in classifying skin tones, high computational demands, and the complexity of objectively verifying fairness. To address these challenges, we propose a fairness algorithm for skin lesion classification that overcomes the challenges associated with achieving diagnostic fairness across varying skin tones. By calculating the skewness of the feature map in the convolution layer of the VGG (Visual Geometry Group) network and the patches and the heads of the Vision Transformer, our method reduces unnecessary channels related to skin tone, focusing instead on the lesion area. This approach lowers computational costs and mitigates bias without relying on conventional statistical methods. It potentially reduces model size while maintaining fairness, making it more practical for real-world applications.

[199] Causal Interpretation of Sparse Autoencoder Features in Vision

Sangyu Han,Yearim Kim,Nojun Kwak

Main category: cs.CV

TL;DR: The paper proposes CaFE, a method for more accurate and semantically meaningful interpretation of SAE features in vision transformers by identifying causal image regions through Effective Receptive Fields, overcoming limitations of traditional activation-based approaches.

Details

Motivation: The motivation is to address the limitation of traditional methods that inspect highly activated patches to understand SAE features in vision transformers, as self-attention mixes information across the entire image, making activated patches not necessarily causal. Method: The authors propose Causal Feature Explanation (CaFE), which uses Effective Receptive Field (ERF) to identify image patches that causally drive SAE feature activations. They compare ERF maps with naive activation maps and validate their approach through patch insertion tests. Result: ERF maps often differ from naive activation maps, revealing hidden context dependencies in features (e.g., a 'roaring face' feature requiring eyes and nose rather than just an open mouth). Patch insertion tests confirm that CaFE outperforms activation-ranked patches in recovering or suppressing feature activations. Conclusion: CaFE provides more accurate and meaningful explanations of vision-SAE features by identifying causal image patches through ERF, highlighting the limitations of relying solely on activation locations. Abstract: Understanding what sparse auto-encoder (SAE) features in vision transformers truly represent is usually done by inspecting the patches where a feature's activation is highest. However, self-attention mixes information across the entire image, so an activated patch often co-occurs with-but does not cause-the feature's firing. We propose Causal Feature Explanation (CaFE), which leverages Effective Receptive Field (ERF). We consider each activation of an SAE feature to be a target and apply input-attribution methods to identify the image patches that causally drive that activation. Across CLIP-ViT features, ERF maps frequently diverge from naive activation maps, revealing hidden context dependencies (e.g., a "roaring face" feature that requires the co-occurrence of eyes and nose, rather than merely an open mouth). Patch insertion tests confirm that CaFE more effectively recovers or suppresses feature activations than activation-ranked patches. Our results show that CaFE yields more faithful and semantically precise explanations of vision-SAE features, highlighting the risk of misinterpretation when relying solely on activation location.

[200] EVENT-Retriever: Event-Aware Multimodal Image Retrieval for Realistic Captions

Dinh-Khoi Vo,Van-Loc Nguyen,Minh-Triet Tran,Trung-Nghia Le

Main category: cs.CV

TL;DR: 本文提出了一种多阶段检索框架，结合了密集文章检索、事件感知的语言模型重排序和高效的图像收集，用于从自由形式的字幕中进行基于事件的图像检索。

Details

Motivation: 基于自由形式字幕的基于事件的图像检索是一项重大挑战：模型不仅要理解视觉特征，还要理解潜在的事件语义、上下文和现实世界知识。传统的视觉语言检索方法在字幕描述抽象事件、隐含因果关系、时间上下文或包含长而复杂的叙述时往往表现不佳。 Method: 本文介绍了一种多阶段检索框架，结合了密集文章检索、事件感知的语言模型重排序和高效的图像收集，随后进行字幕引导的语义匹配和排名感知选择。利用Qwen3进行文章搜索，Qwen3-Reranker进行上下文对齐，Qwen2-VL进行精确的图像评分。为了进一步提高性能和鲁棒性，使用互逆秩融合(RRF)融合多个配置的输出。 Result: 该系统在EVENTA 2025 Grand Challenge的Track 2私有测试集上获得了top-1得分，证明了结合基于语言的推理和多模态检索对于复杂、现实世界的图像理解的有效性。 Conclusion: 结合语言推理和多模态检索对于复杂、现实世界的图像理解是有效的。 Abstract: Event-based image retrieval from free-form captions presents a significant challenge: models must understand not only visual features but also latent event semantics, context, and real-world knowledge. Conventional vision-language retrieval approaches often fall short when captions describe abstract events, implicit causality, temporal context, or contain long, complex narratives. To tackle these issues, we introduce a multi-stage retrieval framework combining dense article retrieval, event-aware language model reranking, and efficient image collection, followed by caption-guided semantic matching and rank-aware selection. We leverage Qwen3 for article search, Qwen3-Reranker for contextual alignment, and Qwen2-VL for precise image scoring. To further enhance performance and robustness, we fuse outputs from multiple configurations using Reciprocal Rank Fusion (RRF). Our system achieves the top-1 score on the private test set of Track 2 in the EVENTA 2025 Grand Challenge, demonstrating the effectiveness of combining language-based reasoning and multimodal retrieval for complex, real-world image understanding. The code is available at https://github.com/vdkhoi20/EVENT-Retriever.

[201] Multi-Level CLS Token Fusion for Contrastive Learning in Endoscopy Image Classification

Y Hop Nguyen,Doan Anh Phan Huu,Trung Thai Tran,Nhat Nam Mai,Van Toi Giap,Thao Thi Phuong Dao,Trung-Nghia Le

Main category: cs.CV

TL;DR: 本文提出了一种适用于ENT内窥镜图像分析的统一视觉-语言框架，能够同时处理图像分类、图像到图像检索和文本到图像检索三个任务。

Details

Motivation: 传统的CNN方法在捕捉跨模态语义方面存在困难，而医疗数据通常有限，需要一种高效且具有强大多模态表示能力的方法。 Method: 该方法基于CLIP ViT-B/16主干网络，并通过低秩适应、多级CLS token聚合和球面特征插值进行增强。此外，引入了类别特定的自然语言提示，通过结合监督分类与对比学习的联合训练目标来指导图像编码器。 Result: 该框架在ACM MM'25 ENTRep挑战赛中验证，图像分类准确率和F1分数达到95%，图像到图像和文本到图像检索的Recall@1分别为0.93和0.92，MRR分数分别为0.97和0.96。消融实验验证了每个组件的有效性。 Conclusion: 所提出的框架在资源有限的临床环境中实现了强大的多模态医学理解能力，具有较高的性能和良好的泛化能力。 Abstract: We present a unified vision-language framework tailored for ENT endoscopy image analysis that simultaneously tackles three clinically-relevant tasks: image classification, image-to-image retrieval, and text-to-image retrieval. Unlike conventional CNN-based pipelines that struggle to capture cross-modal semantics, our approach leverages the CLIP ViT-B/16 backbone and enhances it through Low-Rank Adaptation, multi-level CLS token aggregation, and spherical feature interpolation. These components collectively enable efficient fine-tuning on limited medical data while improving representation diversity and semantic alignment across modalities. To bridge the gap between visual inputs and textual diagnostic context, we introduce class-specific natural language prompts that guide the image encoder through a joint training objective combining supervised classification with contrastive learning. We validated our framework through participation in the ACM MM'25 ENTRep Grand Challenge, achieving 95% accuracy and F1-score in classification, Recall@1 of 0.93 and 0.92 for image-to-image and text-to-image retrieval respectively, and MRR scores of 0.97 and 0.96. Ablation studies demonstrated the incremental benefits of each architectural component, validating the effectiveness of our design for robust multimodal medical understanding in low-resource clinical settings.

[202] MarkSplatter: Generalizable Watermarking for 3D Gaussian Splatting Model via Splatter Image Structure

Xiufeng Huang,Ziyuan Luo,Qi Song,Ruofei Wang,Renjie Wan

Main category: cs.CV

TL;DR: 本文提出了一种名为GaussianBridge的通用可推广水印框架，通过单次前向传递实现对基于点绘图像的3D高斯溅射模型的有效保护。

Details

Motivation: 当前3DGS水印方法依赖于计算成本高昂的微调过程，且每个预定义消息都需要单独处理，因此需要一种更高效的版权保护方法。 Method: 提出GaussianBridge框架，将非结构化的3D高斯数据转换为点绘图像格式，从而实现任意消息的直接神经处理嵌入。同时设计了高斯-不确定性-感知热图预测策略以保证视觉质量，并开发了基于密集分割的提取机制以确保水印信息的可靠恢复。 Result: 实现了高效的水印嵌入和提取，即使在水印对象占据渲染视图极小区域的情况下也能保持可靠的恢复能力。 Conclusion: GaussianBridge为3D高斯溅射模型提供了一个高效、通用且具有视觉无感性的水印解决方案，具有良好的应用前景。 Abstract: The growing popularity of 3D Gaussian Splatting (3DGS) has intensified the need for effective copyright protection. Current 3DGS watermarking methods rely on computationally expensive fine-tuning procedures for each predefined message. We propose the first generalizable watermarking framework that enables efficient protection of Splatter Image-based 3DGS models through a single forward pass. We introduce GaussianBridge that transforms unstructured 3D Gaussians into Splatter Image format, enabling direct neural processing for arbitrary message embedding. To ensure imperceptibility, we design a Gaussian-Uncertainty-Perceptual heatmap prediction strategy for preserving visual quality. For robust message recovery, we develop a dense segmentation-based extraction mechanism that maintains reliable extraction even when watermarked objects occupy minimal regions in rendered views. Project page: https://kevinhuangxf.github.io/marksplatter.

[203] No More Sibling Rivalry: Debiasing Human-Object Interaction Detection

Bin Yang,Yulin Zhang,Hong-Yu Zhou,Sibei Yang

Main category: cs.CV

TL;DR: This paper introduces a solution to the 'Toxic Siblings' bias in detection transformers for human-object interaction detection, achieving significant improvements in performance.

Details

Motivation: The study aims to address the 'Toxic Siblings' bias in HOI detection, which hinders the interaction decoder's learning due to interference among similar yet distinct HOI triplets. Method: Two debiasing objectives-'contrastive-then-calibration' and 'merge-then-split'-were proposed to address the input and output perspectives, respectively, by managing sibling-like incorrect HOI triplets and refining intra-group differentiation. Result: The proposed method outperforms the baseline by +9.18% mAP on HICO-Det and the state-of-the-art by +3.59% mAP across various settings. Conclusion: The study successfully addresses the 'Toxic Siblings' bias in detection transformers for HOI detection by introducing two debiasing learning objectives, significantly outperforming the baseline and state-of-the-art methods. Abstract: Detection transformers have been applied to human-object interaction (HOI) detection, enhancing the localization and recognition of human-action-object triplets in images. Despite remarkable progress, this study identifies a critical issue-"Toxic Siblings" bias-which hinders the interaction decoder's learning, as numerous similar yet distinct HOI triplets interfere with and even compete against each other both input side and output side to the interaction decoder. This bias arises from high confusion among sibling triplets/categories, where increased similarity paradoxically reduces precision, as one's gain comes at the expense of its toxic sibling's decline. To address this, we propose two novel debiasing learning objectives-"contrastive-then-calibration" and "merge-then-split"-targeting the input and output perspectives, respectively. The former samples sibling-like incorrect HOI triplets and reconstructs them into correct ones, guided by strong positional priors. The latter first learns shared features among sibling categories to distinguish them from other groups, then explicitly refines intra-group differentiation to preserve uniqueness. Experiments show that we significantly outperform both the baseline (+9.18% mAP on HICO-Det) and the state-of-the-art (+3.59% mAP) across various settings.

[204] InterPose: Learning to Generate Human-Object Interactions from Large-Scale Web Videos

Yangsong Zhang,Abdul Ahad Butt,Gül Varol,Ivan Laptev

Main category: cs.CV

TL;DR: This paper proposes the InterPose dataset, generated through an automatic motion extraction pipeline, to address the lack of diverse datasets for human-object interactions, achieving significant improvements in motion generation and enabling zero-shot animation with an LLM-based agent.

Details

Motivation: Synthesizing realistic human-object interactions in complex 3D scenes remains a critical challenge due to the lack of large-scale datasets with diverse object manipulations. Method: An automatic motion extraction pipeline was proposed to collect interaction-rich human motions, resulting in the creation of the InterPose dataset containing 73.8K sequences of 3D human motions and text captions. Result: The experiments demonstrated that InterPose brings significant improvements to state-of-the-art methods for human motion generation. Conclusion: InterPose enables significant improvements in human motion generation and supports the development of an LLM-based agent for zero-shot animation involving diverse objects and scenes. Abstract: Human motion generation has shown great advances thanks to the recent diffusion models trained on large-scale motion capture data. Most of existing works, however, currently target animation of isolated people in empty scenes. Meanwhile, synthesizing realistic human-object interactions in complex 3D scenes remains a critical challenge in computer graphics and robotics. One obstacle towards generating versatile high-fidelity human-object interactions is the lack of large-scale datasets with diverse object manipulations. Indeed, existing motion capture data is typically restricted to single people and manipulations of limited sets of objects. To address this issue, we propose an automatic motion extraction pipeline and use it to collect interaction-rich human motions. Our new dataset InterPose contains 73.8K sequences of 3D human motions and corresponding text captions automatically obtained from 45.8K videos with human-object interactions. We perform extensive experiments and demonstrate InterPose to bring significant improvements to state-of-the-art methods for human motion generation. Moreover, using InterPose we develop an LLM-based agent enabling zero-shot animation of people interacting with diverse objects and scenes.

[205] Secure and Scalable Face Retrieval via Cancelable Product Quantization

Haomiao Tang,Wenjie Li,Yixiang Qiu,Genping Wang,Shu-Tao Xia

Main category: cs.CV

TL;DR: The paper introduces Cancelable Product Quantization, a secure and efficient face retrieval framework that addresses privacy concerns while maintaining real-world applicability.

Details

Motivation: The motivation is to address the privacy risks in face retrieval systems when the retrieval stage is outsourced, leveraging homomorphic encryption while overcoming its computational inefficiency. Method: The paper proposes a hierarchical two-stage framework, including a cancelable PQ indexing module for fast filtering and a cipher-space retrieval module for precise ranking, combined with a tailored protection mechanism. Result: Experiments show that the proposed method achieves a good balance between retrieval effectiveness, computational efficiency, and security on benchmark datasets. Conclusion: Cancelable Product Quantization is a promising solution for secure and efficient face retrieval, achieving a decent trade-off between accuracy, efficiency, and privacy protection. Abstract: Despite the ubiquity of modern face retrieval systems, their retrieval stage is often outsourced to third-party entities, posing significant risks to user portrait privacy. Although homomorphic encryption (HE) offers strong security guarantees by enabling arithmetic computations in the cipher space, its high computational inefficiency makes it unsuitable for real-time, real-world applications. To address this issue, we propose Cancelable Product Quantization, a highly efficient framework for secure face representation retrieval. Our hierarchical two-stage framework comprises: (i) a high-throughput cancelable PQ indexing module for fast candidate filtering, and (ii) a fine-grained cipher-space retrieval module for final precise face ranking. A tailored protection mechanism is designed to secure the indexing module for cancelable biometric authentication while ensuring efficiency. Experiments on benchmark datasets demonstrate that our method achieves an decent balance between effectiveness, efficiency and security.

[206] Aligned Anchor Groups Guided Line Segment Detector

Zeyu Li,Annan Shu

Main category: cs.CV

TL;DR: 本文提出了一种名为AAGLSD的新型线段检测器，通过分层方法提取具有不同显著性水平的候选像素，并利用对齐锚组引导线段检测，具有高精度和完整性。

Details

Motivation: 现有的线段检测方法在复杂场景下可能无法完整且准确地提取线段，因此需要一种更高效、精确的线段检测算法。 Method: AAGLSD采用分层方法提取候选像素，包括常规锚点和对齐锚组。算法从这些锚组出发，依次连接锚点并同时更新当前预测的线段。最终通过简单验证和相邻线段合并得到最终结果，避免复杂的优化策略。 Result: 在多个数据集上的定量实验表明，AAGLSD相比其他先进线段检测方法能更有效地提取完整的线段，并具有较高的精度。 Conclusion: AAGLSD是一种高效且准确的线段检测方法，适用于复杂图像中的线段提取，具有实际应用价值。 Abstract: This paper introduces a novel line segment detector, the Aligned Anchor Groups guided Line Segment Detector (AAGLSD), designed to detect line segments from images with high precision and completeness. The algorithm employs a hierarchical approach to extract candidate pixels with different saliency levels, including regular anchors and aligned anchor groups. AAGLSD initiates from these aligned anchor groups, sequentially linking anchors and updating the currently predicted line segment simultaneously. The final predictions are derived through straightforward validation and merging of adjacent line segments, avoiding complex refinement strategies. AAGLSD is evaluated on various datasets and quantitative experiments demonstrate that the proposed method can effectively extract complete line segments from input images compared to other advanced line segment detectors. The implementation is available at https://github.com/LLiDaBao/AAGLSD.

[207] Diffusion-Based Image-to-Brain Signal Generation with Cross-Attention Mechanisms for Visual Prostheses

Ganxi Xu,Jinyi Long,Jia Zhang

Main category: cs.CV

TL;DR: This paper proposes a novel image-to-brain framework using diffusion models with cross-attention to generate realistic brain signals for visual prostheses.

Details

Motivation: Visual prostheses require brain signals with high biological similarity for effective vision restoration, but existing approaches lack supervised signals from real brain responses to validate the plausibility of predicted stimuli. Method: The framework combines a pre-trained CLIP visual encoder for extracting semantic representations from images and a cross-attention enhanced U-Net diffusion model that reconstructs brain signals through iterative denoising. Cross-attention modules enable dynamic interaction between visual features and brain signal representations, allowing for fine-grained alignment. Result: The proposed framework successfully generates biologically plausible brain signals and demonstrates intra-subject and inter-subject variations through visualization of M/EEG topographies on both THINGS-EEG2 and THINGS-MEG datasets. Conclusion: The proposed image-to-brain framework based on DDPMs with cross-attention mechanisms demonstrates effectiveness in generating biologically plausible brain signals, as validated on two multimodal datasets (THINGS-EEG2 and THINGS-MEG). Abstract: Visual prostheses have shown great potential in restoring vision for blind individuals. On the one hand, researchers have been continuously improving the brain decoding framework of visual prostheses by leveraging the powerful image generation capabilities of diffusion models. On the other hand, the brain encoding stage of visual prostheses struggles to generate brain signals with sufficient biological similarity. Although existing works have recognized this problem, the quality of predicted stimuli still remains a critical issue, as existing approaches typically lack supervised signals from real brain responses to validate the biological plausibility of predicted stimuli. To address this issue, we propose a novel image-to-brain framework based on denoising diffusion probabilistic models (DDPMs) enhanced with cross-attention mechanisms. Our framework consists of two key architectural components: a pre-trained CLIP visual encoder that extracts rich semantic representations from input images, and a cross-attention enhanced U-Net diffusion model that learns to reconstruct biologically plausible brain signals through iterative denoising. Unlike conventional generative models that rely on simple concatenation for conditioning, our cross-attention modules enable dynamic interaction between visual features and brain signal representations, facilitating fine-grained alignment during the generation process. We evaluate our framework on two multimodal datasets (THINGS-EEG2 and THINGS-MEG) to demonstrate its effectiveness in generating biologically plausible brain signals. Moreover, we visualize the training and test M/EEG topographies for all subjects on both datasets to intuitively demonstrate the intra-subject variations and inter-subject variations in M/EEG signals.

[208] OmniReason: A Temporal-Guided Vision-Language-Action Framework for Autonomous Driving

Pei Liu,Qingtian Ning,Xinyan Lu,Haipeng Liu,Weiliang Ma,Dangen She,Peng Jia,Xianpeng Lang,Jun Ma

Main category: cs.CV

TL;DR: 本文提出了OmniReason框架，通过构建包含时空标注的VLA数据集和设计具备记忆模块的OmniReason-Agent模型，增强了自动驾驶系统在动态环境中的时空推理与决策解释能力。

Details

Motivation: 现有的视觉-语言模型（VLMs）在自动驾驶中表现出良好的空间推理能力，但主要关注静态场景理解，忽略了真实驾驶场景中的时间维度。 Method: 1. 提出OmniReason-Data数据集，通过减少幻觉的自动标注方法生成具有时空标注和自然语言解释的大规模数据集；2. 设计OmniReason-Agent架构，整合稀疏时间记忆模块以建模持续场景上下文，并利用时空知识蒸馏方法开发可解释决策模型。 Result: 实验结果显示，OmniReason-Agent在开环规划任务和视觉问答（VQA）基准测试中表现卓越，同时实现了在复杂动态环境中可解释且具有时间感知能力的自动驾驶新能力。 Conclusion: OmniReason框架通过结合动态三维环境建模与决策过程，有效解决了现有自动驾驶模型在时空推理方面的不足，为未来研究提供了新的方向和基准。 Abstract: Recent advances in vision-language models (VLMs) have demonstrated impressive spatial reasoning capabilities for autonomous driving, yet existing methods predominantly focus on static scene understanding while neglecting the essential temporal dimension of real-world driving scenarios. To address this critical limitation, we propose the OmniReason framework, which establishes robust spatiotemporal reasoning by jointly modeling dynamic 3D environments and their underlying decision-making processes. Our work makes two fundamental advances: (1) We introduce OmniReason-Data, two large-scale vision-language-action (VLA) datasets with dense spatiotemporal annotations and natural language explanations, generated through a novel hallucination-mitigated auto-labeling pipeline that ensures both physical plausibility and temporal coherence; (2) We develop the OmniReason-Agent architecture, which integrates a sparse temporal memory module for persistent scene context modeling and an explanation generator that produces human-interpretable decision rationales, facilitated by our spatiotemporal knowledge distillation approach that effectively captures spatiotemporal causal reasoning patterns. Comprehensive experiments demonstrate state-of-the-art performance, where OmniReason-Agent achieves significant improvements in both open-loop planning tasks and visual question answering (VQA) benchmarks, while establishing new capabilities for interpretable, temporally-aware autonomous vehicles operating in complex, dynamic environments.

[209] Multimodal Iterative RAG for Knowledge Visual Question Answering

Changin Choi,Wonseok Lee,Jungmin Ko,Wonjong Rhee

Main category: cs.CV

TL;DR: MI-RAG通过多模态迭代检索和推理，改进了知识密集型视觉问答的效果。

Details

Motivation: 传统RAG框架在单次检索中无法获取足够的外部知识，限制了模型在知识密集型视觉问题上的表现。 Method: 提出了MI-RAG，一种多模态迭代RAG框架，利用推理增强检索，并跨模态更新推理。 Result: 在Encyclopedic VQA、InfoSeek和OK-VQA等基准测试中，MI-RAG显著提高了检索召回率和答案准确性。 Conclusion: MI-RAG是一个可扩展的方法，在知识密集型视觉问答中提升了检索召回率和答案准确性。 Abstract: While Multimodal Large Language Models (MLLMs) have significantly advanced multimodal understanding, their performance remains limited on knowledge-intensive visual questions that require external knowledge beyond the image. Retrieval-Augmented Generation (RAG) has become a promising solution for providing models with external knowledge, its conventional single-pass framework often fails to gather sufficient knowledge. To overcome this limitation, we propose MI-RAG, a Multimodal Iterative RAG framework that leverages reasoning to enhance retrieval and update reasoning over newly retrieved knowledge across modalities. At each iteration, MI-RAG leverages an accumulated reasoning record to dynamically formulate a multi-query. These queries then drive a joint search across heterogeneous knowledge bases containing both visually-grounded and textual knowledge. The newly acquired knowledge is synthesized into the reasoning record, progressively refining understanding across iterations. Experiments on challenging benchmarks, including Encyclopedic VQA, InfoSeek, and OK-VQA, show that MI-RAG significantly improves both retrieval recall and answer accuracy, establishing a scalable approach for compositional reasoning in knowledge-intensive VQA.

[210] SWAGSplatting: Semantic-guided Water-scene Augmented Gaussian Splatting

Zhuodong Jiang,Haoran Wang,Guoxi Huang,Brett Seymour,Nantheera Anantrasirichai

Main category: cs.CV

TL;DR: 本文提出了一種新的框架，利用多模態跨知識創建語義引導的3D高斯點繪，實現魯棒且高保真的深海場景重建。

Details

Motivation: 由於光線扭曲、渾濁和能見度有限等問題，準確的水下3D重建仍然是一個複雜的挑戰。現有的基於AI的方法尚未完全發揮AI的潛力，尤其是在整合語言模型與視覺處理方面。 Method: 通過在每個高斯基元中嵌入額外的語義特徵，並由CLIP提取的語義特徵進行監督，我們的方法在整個訓練過程中強化了語義和結構意識。專門設計的語義一致性損失確保與高層次場景理解的對齊。此外，我們提出了一種新的分階段訓練策略，結合粗到精的學習與後期階段的參數精調，以進一步增強穩定性和重建質量。 Result: 廣泛的實驗結果表明，我們的方法在SeaThru-NeRF和Submerged3D數據集上持續優越於現有的最先進方法，在三種指標上的平均PSNR提升了高達3.09 dB。 Conclusion: 我們的方法在水下探索和海洋感知應用中是一個強大的候選方案。 Abstract: Accurate 3D reconstruction in underwater environments remains a complex challenge due to issues such as light distortion, turbidity, and limited visibility. AI-based techniques have been applied to address these issues, however, existing methods have yet to fully exploit the potential of AI, particularly in integrating language models with visual processing. In this paper, we propose a novel framework that leverages multimodal cross-knowledge to create semantic-guided 3D Gaussian Splatting for robust and high-fidelity deep-sea scene reconstruction. By embedding an extra semantic feature into each Gaussian primitive and supervised by the CLIP extracted semantic feature, our method enforces semantic and structural awareness throughout the training. The dedicated semantic consistency loss ensures alignment with high-level scene understanding. Besides, we propose a novel stage-wise training strategy, combining coarse-to-fine learning with late-stage parameter refinement, to further enhance both stability and reconstruction quality. Extensive results show that our approach consistently outperforms state-of-the-art methods on SeaThru-NeRF and Submerged3D datasets across three metrics, with an improvement of up to 3.09 dB on average in terms of PSNR, making it a strong candidate for applications in underwater exploration and marine perception.

[211] Adaptive Contrast Adjustment Module: A Clinically-Inspired Plug-and-Play Approach for Enhanced Fetal Plane Classification

Yang Chen,Sanglin Zhao,Baoyu Chen,Mans Gustaf

Main category: cs.CV

TL;DR: 本文提出一种新的自适应对比度调整模块（ACAM），用于提升胎儿超声标准平面分类的准确率和鲁棒性，通过内容感知的对比度增强和多视角融合策略，取得了显著的性能提升。

Details

Motivation: 胎儿超声图像因组织对比度低、边界模糊和操作者依赖的质量差异而面临分类挑战，需要一种更可靠的图像预处理与分析方法。 Method: 提出了一种插件式的自适应对比度调整模块（ACAM），利用浅层纹理敏感网络预测临床合理的对比度参数，并通过可微映射生成多个增强视角进行融合分类。 Result: 在包含12,400张图像的多中心数据集上验证，ACAM模块使轻量模型、传统模型和最先进模型的准确率分别提升了2.02%、1.29%和1.15%。 Conclusion: ACAM模块通过内容感知的对比度调整和多视角融合策略，有效提升了胎儿超声标准平面分类的准确性和鲁棒性，为医学图像分析提供了新范式。 Abstract: Fetal ultrasound standard plane classification is essential for reliable prenatal diagnosis but faces inherent challenges, including low tissue contrast, boundary ambiguity, and operator-dependent image quality variations. To overcome these limitations, we propose a plug-and-play adaptive contrast adjustment module (ACAM), whose core design is inspired by the clinical practice of doctors adjusting image contrast to obtain clearer and more discriminative structural information. The module employs a shallow texture-sensitive network to predict clinically plausible contrast parameters, transforms input images into multiple contrast-enhanced views through differentiable mapping, and fuses them within downstream classifiers. Validated on a multi-center dataset of 12,400 images across six anatomical categories, the module consistently improves performance across diverse models, with accuracy of lightweight models increasing by 2.02 percent, accuracy of traditional models increasing by 1.29 percent, and accuracy of state-of-the-art models increasing by 1.15 percent. The innovation of the module lies in its content-aware adaptation capability, replacing random preprocessing with physics-informed transformations that align with sonographer workflows while improving robustness to imaging heterogeneity through multi-view fusion. This approach effectively bridges low-level image features with high-level semantics, establishing a new paradigm for medical image analysis under real-world image quality variations.

[212] Sequential Difference Maximization: Generating Adversarial Examples via Multi-Stage Optimization

Xinlei Liu,Tao Hu,Peng Yi,Weitao Han,Jichao Xie,Baolin Li

Main category: cs.CV

TL;DR: The paper proposes a new gradient-based attack method, Sequential Difference Maximization (SDM), for generating adversarial examples in computer vision models, which demonstrates stronger attack performance and higher attack cost-effectiveness than current state-of-the-art methods.

Details

Motivation: Efficient adversarial attack methods are critical for assessing the robustness of computer vision models. Method: The paper proposes a gradient-based attack method termed Sequential Difference Maximization (SDM) which establishes a three-layer optimization framework of "cycle-stage-step". In the initial stage, the negative probability of the true label is used as the loss function to compress the solution space. In subsequent stages, the Directional Probability Difference Ratio (DPDR) loss function is introduced to gradually increase the non-true labels' probability upper bound by compressing the irrelevant labels' probabilities. Result: Experiments demonstrate that SDM not only exhibits stronger attack performance but also achieves higher attack cost-effectiveness compared with previous SOTA methods. Conclusion: SDM not only exhibits stronger attack performance but also achieves higher attack cost-effectiveness compared with previous SOTA methods. Additionally, SDM can be combined with adversarial training methods to enhance their defensive effects. Abstract: Efficient adversarial attack methods are critical for assessing the robustness of computer vision models. In this paper, we reconstruct the optimization objective for generating adversarial examples as "maximizing the difference between the non-true labels' probability upper bound and the true label's probability," and propose a gradient-based attack method termed Sequential Difference Maximization (SDM). SDM establishes a three-layer optimization framework of "cycle-stage-step." The processes between cycles and between iterative steps are respectively identical, while optimization stages differ in terms of loss functions: in the initial stage, the negative probability of the true label is used as the loss function to compress the solution space; in subsequent stages, we introduce the Directional Probability Difference Ratio (DPDR) loss function to gradually increase the non-true labels' probability upper bound by compressing the irrelevant labels' probabilities. Experiments demonstrate that compared with previous SOTA methods, SDM not only exhibits stronger attack performance but also achieves higher attack cost-effectiveness. Additionally, SDM can be combined with adversarial training methods to enhance their defensive effects. The code is available at https://github.com/X-L-Liu/SDM.

[213] Surface Defect Detection with Gabor Filter Using Reconstruction-Based Blurring U-Net-ViT

Jongwook Si,Sungyoung Kim

Main category: cs.CV

TL;DR: This paper introduces a novel U-Net-ViT model enhanced with Gabor filters and a Gaussian filter-based loss function to improve surface defect detection accuracy in textured images, achieving high performance across multiple datasets.

Details

Motivation: To improve the accuracy and reliability of texture-based surface defect detection, especially in noisy environments where background noise and texture complexity challenge traditional methods. Method: The paper uses a blurring U-Net-ViT model to combine local feature training with global processing, integrates a Gaussian filter-based loss function to remove noise, applies Salt-and-Pepper masking to reinforce defect boundaries, and utilizes Gabor filters in post-processing to emphasize defect characteristics. Result: The model achieves an average AUC of 0.939 across multiple datasets, including MVTec-AD, Surface Crack Detection, and Marble Surface Anomaly Dataset, with ablation studies confirming the effectiveness of key parameters like filter size and noise probability. Conclusion: The proposed method combining U-Net-ViT with Gabor filters and a Gaussian filter-based loss function achieves high accuracy in texture-based surface defect detection, validated by ablation studies and performance metrics. Abstract: This paper proposes a novel approach to enhance the accuracy and reliability of texture-based surface defect detection using Gabor filters and a blurring U-Net-ViT model. By combining the local feature training of U-Net with the global processing of the Vision Transformer(ViT), the model effectively detects defects across various textures. A Gaussian filter-based loss function removes background noise and highlights defect patterns, while Salt-and-Pepper(SP) masking in the training process reinforces texture-defect boundaries, ensuring robust performance in noisy environments. Gabor filters are applied in post-processing to emphasize defect orientation and frequency characteristics. Parameter optimization, including filter size, sigma, wavelength, gamma, and orientation, maximizes performance across datasets like MVTec-AD, Surface Crack Detection, and Marble Surface Anomaly Dataset, achieving an average Area Under the Curve(AUC) of 0.939. The ablation studies validate that the optimal filter size and noise probability significantly enhance defect detection performance.

[214] UPGS: Unified Pose-aware Gaussian Splatting for Dynamic Scene Deblurring

Zhijing Wu,Longguang Wang

Main category: cs.CV

TL;DR: This paper proposes a unified optimization framework for reconstructing dynamic 3D scenes from monocular video, effectively addressing motion blur issues and improving both reconstruction quality and pose estimation accuracy.

Details

Motivation: Motion blur caused by camera and object motion often leads to failure in reconstructing dynamic 3D scenes from monocular video. Existing two-step methods suffer from pose estimation errors that accumulate and result in inferior reconstructions. Method: A unified optimization framework is introduced, incorporating camera poses as learnable parameters alongside 3D Gaussian attributes. A three-stage training schedule optimizes camera poses and Gaussians either separately or together. Result: Extensive experiments on the Stereo Blur dataset and real-world sequences show significant improvements over prior dynamic deblurring methods. Conclusion: The proposed method significantly improves the reconstruction quality and pose estimation accuracy in dynamic 3D scenes from monocular video, especially in the presence of motion blur. Abstract: Reconstructing dynamic 3D scenes from monocular video has broad applications in AR/VR, robotics, and autonomous navigation, but often fails due to severe motion blur caused by camera and object motion. Existing methods commonly follow a two-step pipeline, where camera poses are first estimated and then 3D Gaussians are optimized. Since blurring artifacts usually undermine pose estimation, pose errors could be accumulated to produce inferior reconstruction results. To address this issue, we introduce a unified optimization framework by incorporating camera poses as learnable parameters complementary to 3DGS attributes for end-to-end optimization. Specifically, we recast camera and object motion as per-primitive SE(3) affine transformations on 3D Gaussians and formulate a unified optimization objective. For stable optimization, we introduce a three-stage training schedule that optimizes camera poses and Gaussians alternatively. Particularly, 3D Gaussians are first trained with poses being fixed, and then poses are optimized with 3D Gaussians being untouched. Finally, all learnable parameters are optimized together. Extensive experiments on the Stereo Blur dataset and challenging real-world sequences demonstrate that our method achieves significant gains in reconstruction quality and pose estimation accuracy over prior dynamic deblurring methods.

[215] SegDINO: An Efficient Design for Medical and Natural Image Segmentation with DINO-V3

Sicheng Yang,Hongqiu Wang,Zhaohu Xing,Sixiang Chen,Lei Zhu

Main category: cs.CV

TL;DR: SegDINO 是一种高效的图像分割框架，利用冻结的 DINOv3 主干和轻量级解码器，在多个数据集上实现了先进的性能，同时减少了模型复杂性。

Details

Motivation: 尽管 DINO 系列自监督视觉模型表现出色，但现有的分割适应方法通常依赖复杂的解码器结构，导致参数和计算成本增加，因此需要一种更高效的方法。 Method: SegDINO 采用了一个轻量级解码器，直接预测分割掩码，并通过对齐多级特征的分辨率和通道宽度，最大限度地减少了可训练参数。 Result: 在包括三个医学数据集和三个自然图像数据集的六个基准测试中，SegDINO 始终优于现有方法，同时保持较少的可训练参数。 Conclusion: SegDINO 提出了一个高效的分割框架，通过结合冻结的 DINOv3 主干和轻量级解码器，在多个数据集上实现了最先进的性能，同时减少了可训练参数的数量。 Abstract: The DINO family of self-supervised vision models has shown remarkable transferability, yet effectively adapting their representations for segmentation remains challenging. Existing approaches often rely on heavy decoders with multi-scale fusion or complex upsampling, which introduce substantial parameter overhead and computational cost. In this work, we propose SegDINO, an efficient segmentation framework that couples a frozen DINOv3 backbone with a lightweight decoder. SegDINO extracts multi-level features from the pretrained encoder, aligns them to a common resolution and channel width, and utilizes a lightweight MLP head to directly predict segmentation masks. This design minimizes trainable parameters while preserving the representational power of foundation features. Extensive experiments across six benchmarks, including three medical datasets (TN3K, Kvasir-SEG, ISIC) and three natural image datasets (MSD, VMD-D, ViSha), demonstrate that SegDINO consistently achieves state-of-the-art performance compared to existing methods. Code is available at https://github.com/script-Yang/SegDINO.

[216] Satellite Image Utilization for Dehazing with Swin Transformer-Hybrid U-Net and Watershed loss

Jongwook Si,Sungyoung Kim

Main category: cs.CV

TL;DR: This paper proposes SUFERNOBWA, a hybrid dehazing framework that integrates Swin Transformer and U-Net, for mitigating atmospheric interference in satellite imagery. It achieves superior performance on RICE and SateHaze1K datasets compared to existing methods.

Details

Motivation: Satellite imagery plays a crucial role in various fields; however, atmospheric interference and haze significantly degrade image clarity and reduce the accuracy of information extraction. Method: This paper proposes a hybrid dehazing framework that integrates Swin Transformer and U-Net to balance global context learning and local detail restoration, called SUFERNOBWA. The proposed network employs SwinRRDB, a Swin Transformer-based Residual-in-Residual Dense Block, in both the encoder and decoder to effectively extract features. This architecture enables robust dehazing under diverse atmospheric conditions while maintaining structural consistency across restored images. Result: Experimental results demonstrate that the proposed method outperforms state-of-the-art models on both the RICE and SateHaze1K datasets. Specifically, on the RICE dataset, the proposed approach achieved a PSNR of 33.24 dB and an SSIM of 0.967, which is a significant improvement over existing method. Conclusion: This paper concludes that SUFERNOBWA provides an effective solution for mitigating atmospheric interference in satellite imagery and highlights its potential applicability across diverse remote sensing applications. Abstract: Satellite imagery plays a crucial role in various fields; however, atmospheric interference and haze significantly degrade image clarity and reduce the accuracy of information extraction. To address these challenges, this paper proposes a hybrid dehazing framework that integrates Swin Transformer and U-Net to balance global context learning and local detail restoration, called SUFERNOBWA. The proposed network employs SwinRRDB, a Swin Transformer-based Residual-in-Residual Dense Block, in both the encoder and decoder to effectively extract features. This module enables the joint learning of global contextual information and fine spatial structures, which is crucial for structural preservation in satellite image. Furthermore, we introduce a composite loss function that combines L2 loss, guided loss, and a novel watershed loss, which enhances structural boundary preservation and ensures pixel-level accuracy. This architecture enables robust dehazing under diverse atmospheric conditions while maintaining structural consistency across restored images. Experimental results demonstrate that the proposed method outperforms state-of-the-art models on both the RICE and SateHaze1K datasets. Specifically, on the RICE dataset, the proposed approach achieved a PSNR of 33.24 dB and an SSIM of 0.967, which is a significant improvement over existing method. This study provides an effective solution for mitigating atmospheric interference in satellite imagery and highlights its potential applicability across diverse remote sensing applications.

[217] Look Beyond: Two-Stage Scene View Generation via Panorama and Video Diffusion

Xueyang Kang,Zhengkang Xiang,Zezheng Zhang,Kourosh Khoshelham

Main category: cs.CV

TL;DR: 本文提出了一种基于全景扩散模型和视频扩散模型的新视角合成方法，在长距离或循环轨迹下实现了更优的全局一致性和视角连贯性。

Details

Motivation: 由于未观察区域较大，尤其是偏离输入视角较大的情况下，单图像新视角合成（NVS）具有高度不适定性。现有方法虽关注源视图和生成视图之间的一致性，但在长距离或循环轨迹下难以保持连贯性和正确视角对齐。 Method: 将单视角NVS分解为360度场景外推和新视图插值两个阶段：第一阶段使用全景扩散模型学习输入视角图像的场景先验，第二阶段从生成的全景图中提取并扭曲关键帧作为锚帧，通过预训练视频扩散模型和空间噪声扩散过程生成新的视角。 Result: 在多样场景数据集上的实验表明，该方法能够生成沿用户定义轨迹的连贯且全局一致的新视角，尤其在环闭场景中表现突出。 Conclusion: 实验结果表明，与现有方法相比，该方法在生成沿用户定义轨迹的连贯视图方面表现更优，尤其在环闭场景中实现了全局一致性，并支持灵活的相机控制。 Abstract: Novel view synthesis (NVS) from a single image is highly ill-posed due to large unobserved regions, especially for views that deviate significantly from the input. While existing methods focus on consistency between the source and generated views, they often fail to maintain coherence and correct view alignment across long-range or looped trajectories. We propose a model that addresses this by decomposing single-view NVS into a 360-degree scene extrapolation followed by novel view interpolation. This design ensures long-term view and scene consistency by conditioning on keyframes extracted and warped from a generated panoramic representation. In the first stage, a panorama diffusion model learns the scene prior from the input perspective image. Perspective keyframes are then sampled and warped from the panorama and used as anchor frames in a pre-trained video diffusion model, which generates novel views through a proposed spatial noise diffusion process. Compared to prior work, our method produces globally consistent novel views -- even in loop closure scenarios -- while enabling flexible camera control. Experiments on diverse scene datasets demonstrate that our approach outperforms existing methods in generating coherent views along user-defined trajectories. Our implementation is available at https://github.com/YiGuYT/LookBeyond.

[218] Quantization Meets OOD: Generalizable Quantization-aware Training from a Flatness Perspective

Jiacheng Jiang,Yuan Meng,Chen Tang,Han Yu,Qun Li,Zhi Wang,Wenwu Zhu

Main category: cs.CV

TL;DR: 本文提出了FQAT方法，通过解决梯度冲突和层间干扰问题，改善了量化感知训练在分布外数据上的泛化性能。

Details

Motivation: 现有的QAT方法在提升量化模型在分布内数据上的性能时，忽视了其在分布外数据上可能的性能下降。 Method: 提出了FQAT方法，包括逐层冻结机制和无序引导的自适应冻结算法，以解决梯度冲突和层间干扰问题。 Result: 验证了QAT会导致损失景观变尖锐，从而影响OOD泛化性能，并提出了FQAT以改善损失景观的平坦性。 Conclusion: 实验结果表明，FQAT在I.D和OOD图像分类任务中均优于现有最先进的基线方法。 Abstract: Current quantization-aware training (QAT) methods primarily focus on enhancing the performance of quantized models on in-distribution (I.D) data, while overlooking the potential performance degradation on out-of-distribution (OOD) data. In this paper, we first substantiate this problem through rigorous experiment, showing that QAT can lead to a significant OOD generalization performance degradation. Further, we find the contradiction between the perspective that flatness of loss landscape gives rise to superior OOD generalization and the phenomenon that QAT lead to a sharp loss landscape, can cause the above problem. Therefore, we propose a flatness-oriented QAT method, FQAT, to achieve generalizable QAT. Specifically, i) FQAT introduces a layer-wise freezing mechanism to mitigate the gradient conflict issue between dual optimization objectives (i.e., vanilla QAT and flatness). ii) FQAT proposes an disorder-guided adaptive freezing algorithm to dynamically determines which layers to freeze at each training step, effectively addressing the challenges caused by interference between layers. A gradient disorder metric is designed to help the algorithm identify unstable layers during training. Extensive experiments on influential OOD benchmark demonstrate the superiority of our method over state-of-the-art baselines under both I.D and OOD image classification tasks.

[219] Pose as Clinical Prior: Learning Dual Representations for Scoliosis Screening

Zirui Zhou,Zizhao Peng,Dongyang Jin,Chao Fan,Fengwei An,Shiqi Yu

Main category: cs.CV

TL;DR: This paper introduces Scoliosis1K-Pose, a large-scale 2D pose dataset for scoliosis screening, and proposes the Dual Representation Framework (DRF) that effectively integrates continuous skeleton maps and discrete Postural Asymmetry Vectors (PAV) to achieve state-of-the-art performance in scoliosis screening.

Details

Motivation: Traditional scoliosis screening methods rely on postural asymmetries, which are often neglected in AI-based approaches that focus on silhouette datasets. Pose data can provide better clinical interpretability, but pose-based scoliosis screening is underexplored due to limited annotated datasets and the difficulty in modeling subtle asymmetries from raw pose coordinates. Method: The authors introduced the Scoliosis1K-Pose dataset, a large-scale 2D pose annotation set for scoliosis screening. They also proposed the Dual Representation Framework (DRF), which combines a continuous skeleton map with a discrete Postural Asymmetry Vector (PAV). The PAV-Guided Attention (PGA) module uses the PAV as a clinical prior to direct feature extraction from the skeleton map. Result: The Dual Representation Framework (DRF) achieves state-of-the-art performance in scoliosis screening. The visualizations confirm that the model effectively uses clinical asymmetry cues to guide feature extraction and synergizes the dual representations. Conclusion: The study concludes that the proposed Dual Representation Framework (DRF) achieves state-of-the-art performance in pose-based scoliosis screening by effectively leveraging clinically relevant asymmetry cues, and that the Scoliosis1K-Pose dataset addresses the lack of annotated pose datasets in this domain. Abstract: Recent AI-based scoliosis screening methods primarily rely on large-scale silhouette datasets, often neglecting clinically relevant postural asymmetries-key indicators in traditional screening. In contrast, pose data provide an intuitive skeletal representation, enhancing clinical interpretability across various medical applications. However, pose-based scoliosis screening remains underexplored due to two main challenges: (1) the scarcity of large-scale, annotated pose datasets; and (2) the discrete and noise-sensitive nature of raw pose coordinates, which hinders the modeling of subtle asymmetries. To address these limitations, we introduce Scoliosis1K-Pose, a 2D human pose annotation set that extends the original Scoliosis1K dataset, comprising 447,900 frames of 2D keypoints from 1,050 adolescents. Building on this dataset, we introduce the Dual Representation Framework (DRF), which integrates a continuous skeleton map to preserve spatial structure with a discrete Postural Asymmetry Vector (PAV) that encodes clinically relevant asymmetry descriptors. A novel PAV-Guided Attention (PGA) module further uses the PAV as clinical prior to direct feature extraction from the skeleton map, focusing on clinically meaningful asymmetries. Extensive experiments demonstrate that DRF achieves state-of-the-art performance. Visualizations further confirm that the model leverages clinical asymmetry cues to guide feature extraction and promote synergy between its dual representations. The dataset and code are publicly available at https://zhouzi180.github.io/Scoliosis1K/.

[220] Spotlighter: Revisiting Prompt Tuning from a Representative Mining View

Yutong Gao,Maoyuan Shao,Xinyang Huang,Chuang Zhu,Lijuan Sun,Yu Weng,Xuan Liu,Guoshun Nan

Main category: cs.CV

TL;DR: Spotlighter improves prompt tuning accuracy and efficiency by selecting the most relevant visual tokens and refining selection through a semantic memory bank and dynamic ranking.

Details

Motivation: CLIP's prompt tuning has shown success in cross-modal semantic alignment, but it suffers from noise and computational costs due to redundant or weakly relevant features. Method: Spotlighter uses a token-selection framework that evaluates visual tokens based on sample-wise and semantic-wise activation, retaining top-scoring tokens with the help of a semantic memory bank and a two-level ranking mechanism. Result: Spotlighter outperforms CLIP by up to 11.19% in harmonic mean accuracy and achieves up to 0.8K additional FPS, with only 21 extra parameters across 11 few-shot benchmarks. Conclusion: Spotlighter is an effective and scalable baseline for prompt tuning that improves accuracy and efficiency. Abstract: CLIP's success has demonstrated that prompt tuning can achieve robust cross-modal semantic alignment for tasks ranging from open-domain recognition to fine-grained classification. However, redundant or weakly relevant feature components introduce noise and incur unnecessary computational costs. In this work, we propose Spotlighter, a lightweight token-selection framework that simultaneously enhances accuracy and efficiency in prompt tuning. Spotlighter evaluates each visual token's activation from both sample-wise and semantic-wise perspectives and retains only the top-scoring tokens for downstream prediction. A class-specific semantic memory bank of learned prototypes refines this selection, ensuring semantic representativeness and compensating for discarded features. To further prioritize informative signals, we introduce a two-level ranking mechanism that dynamically weights token--prototype interactions. Across 11 few-shot benchmarks, Spotlighter outperforms CLIP by up to 11.19\% in harmonic mean accuracy and achieves up to 0.8K additional FPS, with only 21 extra parameters. These results establish Spotlighter as an effective and scalable baseline for prompt tuning. Code for our method will be available at https://github.com/greatest-gourmet/Spotlighter.

[221] DarkVRAI: Capture-Condition Conditioning and Burst-Order Selective Scan for Low-light RAW Video Denoising

Youngjin Oh,Junhyeong Kwon,Junyoung Park,Nam Ik Cho

Main category: cs.CV

TL;DR: DarkVRAI通过条件方案和BOSS机制解决了低光RAW视频去噪问题，在AIM 2025挑战赛中表现卓越。

Details

Motivation: 低光RAW视频去噪是一项具有挑战性的任务，因为高传感器增益和短曝光时间导致信号严重退化，而视频帧率要求又限制了这些参数。 Method: 提出了DarkVRAI框架，包括一个条件方案和Burst-Order Selective Scan（BOSS）机制。 Result: DarkVRAI在AIM 2025低光RAW视频去噪挑战赛中获得第一名，并在严格和现实的基准数据集上展示了最先进的性能。 Conclusion: DarkVRAI通过结合条件方案和BOSS机制，在低光视频去噪任务中表现出色，为该领域设定了新标准。 Abstract: Low-light RAW video denoising is a fundamentally challenging task due to severe signal degradation caused by high sensor gain and short exposure times, which are inherently limited by video frame rate requirements. To address this, we propose DarkVRAI, a novel framework that achieved first place in the AIM 2025 Low-light RAW Video Denoising Challenge. Our method introduces two primary contributions: (1) a successful application of a conditioning scheme for image denoising, which explicitly leverages capture metadata, to video denoising to guide the alignment and denoising processes, and (2) a Burst-Order Selective Scan (BOSS) mechanism that effectively models long-range temporal dependencies within the noisy video sequence. By synergistically combining these components, DarkVRAI demonstrates state-of-the-art performance on a rigorous and realistic benchmark dataset, setting a new standard for low-light video denoising.

[222] Seeing More, Saying More: Lightweight Language Experts are Dynamic Video Token Compressors

Xiangchen Wang,Jinrui Zhang,Teng Wang,Haigang Zhang,Feng Zheng

Main category: cs.CV

TL;DR: LangDC通过语言感知的动态压缩策略，有效减少视频处理中的计算量，同时保持性能。

Details

Motivation: 现有视频标记压缩方法使用固定压缩比率，无法适应不同视频片段的语义密度差异，导致信息丢失或计算浪费。 Method: 提出LangDC，利用轻量级语言模型生成软字幕标记作为视觉表示，并通过语义密度感知的监督训练动态调整压缩比率。 Result: LangDC相比VideoGPT+减少了49%的FLOPs，在保持竞争力的同时实现了更高效的视频理解。 Conclusion: LangDC实现了视频视觉标记的高效压缩，通过基于语义密度的动态调整策略，显著减少了计算量并保持了良好的性能表现。 Abstract: Recent advancements in large video-language models have revolutionized video understanding tasks. However, their efficiency is significantly constrained by processing high volumes of visual tokens. Existing token compression strategies apply a fixed compression ratio, ignoring the variability in semantic density among different video clips. Consequently, this lead to inadequate representation of information-rich clips due to insufficient tokens and unnecessary computation on static or content-poor ones. To address this, we propose LangDC, a Language-aware Dynamic Token Compressor. LangDC leverages a lightweight language model to describe video clips, converting them into soft caption tokens as visual representations. Trained with our proposed semantic density-aware supervision, LangDC aims to 1) cover key visual cues necessary for downstream task reasoning and 2) dynamically adjust compression ratios based on scene richness, reflected by descriptions length. Our design mimics how humans dynamically express what they see: complex scenes (seeing more) elicit more detailed language to convey nuances (saying more), whereas simpler scenes are described with fewer words. Experimental results show that our method reduces FLOPs by 49% compared to VideoGPT+ while maintaining competitive performance. Furthermore, qualitative results demonstrate our approach adaptively adjusts the token compression ratio based on video segment richness.

[223] Towards Integrating Multi-Spectral Imaging with Gaussian Splatting

Josef Grün,Lukas Meyer,Maximilian Weiherer,Bernhard Egger,Marc Stamminger,Linus Franke

Main category: cs.CV

TL;DR: Error

Details

Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: We present a study of how to integrate color (RGB) and multi-spectral imagery (red, green, red-edge, and near-infrared) into the 3D Gaussian Splatting (3DGS) framework, a state-of-the-art explicit radiance-field-based method for fast and high-fidelity 3D reconstruction from multi-view images. While 3DGS excels on RGB data, naive per-band optimization of additional spectra yields poor reconstructions due to inconsistently appearing geometry in the spectral domain. This problem is prominent, even though the actual geometry is the same, regardless of spectral modality. To investigate this, we evaluate three strategies: 1) Separate per-band reconstruction with no shared structure. 2) Splitting optimization, in which we first optimize RGB geometry, copy it, and then fit each new band to the model by optimizing both geometry and band representation. 3) Joint, in which the modalities are jointly optimized, optionally with an initial RGB-only phase. We showcase through quantitative metrics and qualitative novel-view renderings on multi-spectral datasets the effectiveness of our dedicated optimized Joint strategy, increasing overall spectral reconstruction as well as enhancing RGB results through spectral cross-talk. We therefore suggest integrating multi-spectral data directly into the spherical harmonics color components to compactly model each Gaussian's multi-spectral reflectance. Moreover, our analysis reveals several key trade-offs in when and how to introduce spectral bands during optimization, offering practical insights for robust multi-modal 3DGS reconstruction.

[224] Weather-Dependent Variations in Driver Gaze Behavior: A Case Study in Rainy Conditions

Ghazal Farhani,Taufiq Rahman,Dominique Charlebois

Main category: cs.CV

TL;DR: This study analyzes how drivers' gaze behavior changes in rainy conditions, providing insights to improve driver monitoring and assistance systems.

Details

Motivation: Rainy weather increases road accident risks, making it important to understand how drivers adapt their visual perception to improve driver monitoring and assistance systems. Method: A two-step clustering approach was used to analyze gaze behavior, including clustering gaze points within 10-second intervals and aggregating cluster centroids into meta-clusters. Markov transition matrices and metrics like fixation duration, gaze elevation, and azimuth distributions were also examined. Result: The study found that under rainy conditions, drivers exhibit more frequent dashboard glances, longer fixation durations, and higher gaze elevation, while maintaining consistent road focus and mirror checks. Conclusion: The study concludes that analyzing drivers' gaze behavior in rainy conditions provides insights into increased cognitive focus, which can be leveraged to improve ADAS and DMS systems. Abstract: Rainy weather significantly increases the risk of road accidents due to reduced visibility and vehicle traction. Understanding how experienced drivers adapt their visual perception through gaze behavior under such conditions is critical for designing robust driver monitoring systems (DMS) and for informing advanced driver assistance systems (ADAS). This case study investigates the eye gaze behavior of a driver operating the same highway route under both clear and rainy conditions. To this end, gaze behavior was analyzed by a two-step clustering approach: first, clustering gaze points within 10-second intervals, and then aggregating cluster centroids into meta-clusters. This, along with Markov transition matrices and metrics such as fixation duration, gaze elevation, and azimuth distributions, reveals meaningful behavioral shifts. While the overall gaze behavior focused on the road with occasional mirror checks remains consistent, rainy conditions lead to more frequent dashboard glances, longer fixation durations, and higher gaze elevation, indicating increased cognitive focus. These findings offer valuable insight into visual attention patterns under adverse conditions and highlight the potential of leveraging gaze modeling to aid in the design of more robust ADAS and DMS.

[225] AI-driven Dispensing of Coral Reseeding Devices for Broad-scale Restoration of the Great Barrier Reef

Scarlett Raine,Benjamin Moshirian,Tobias Fischer

Main category: cs.CV

TL;DR: 研究提出了一种基于人工智能、计算机视觉和机器人技术的珊瑚礁修复方法，有效提高了修复工作的效率和范围。

Details

Motivation: 珊瑚礁面临崩溃的危险，由于气候变化、海洋酸化和污染，预计在未来十年内将损失70-90%的珊瑚物种。 Method: 利用人工智能、计算机视觉和机器人技术进行自动基质分类，并部署珊瑚重新播种设备。 Result: 在大堡礁进行的现实世界测试实现了77.8%的部署准确率、89.1%的子图像块分类准确率和每秒5.5帧的实时模型推理。 Conclusion: 自动化部署珊瑚重新播种设备在提高珊瑚礁修复效率和范围方面具有巨大潜力，并为未来研究提供了宝贵数据。 Abstract: Coral reefs are on the brink of collapse, with climate change, ocean acidification, and pollution leading to a projected 70-90% loss of coral species within the next decade. Restoration efforts are crucial, but their success hinges on introducing automation to upscale efforts. We present automated deployment of coral re-seeding devices powered by artificial intelligence, computer vision, and robotics. Specifically, we perform automated substrate classification, enabling detection of areas of the seafloor suitable for coral growth, thus significantly reducing reliance on human experts and increasing the range and efficiency of restoration. Real-world testing of the algorithms on the Great Barrier Reef leads to deployment accuracy of 77.8%, sub-image patch classification of 89.1%, and real-time model inference at 5.5 frames per second. Further, we present and publicly contribute a large collection of annotated substrate image data to foster future research in this area.

[226] CompSlider: Compositional Slider for Disentangled Multiple-Attribute Image Generation

Zixin Zhu,Kevin Duarte,Mamshad Nayeem Rizve,Chengyuan Xu,Ratheesh Kalarot,Junsong Yuan

Main category: cs.CV

TL;DR: CompSlider offers a solution for precise and independent control of multiple attributes in text-to-image generation by disentangling attributes in the latent space without retraining the model.

Details

Motivation: The challenge of achieving fine-grained, independent control over multiple image attributes in text-to-image generation, despite detailed prompts, motivates this study. Existing slider-based methods suffer from attribute interference due to separate training of adapters. Method: The authors proposed CompSlider, which generates a conditional prior for the T2I model to control multiple attributes simultaneously, using novel disentanglement and structure losses. It operates in the latent space of the conditional prior. Result: CompSlider enables more reliable and independent manipulation of multiple attributes in image generation, maintaining structural consistency and extending its applicability to video generation. Conclusion: CompSlider allows for the simultaneous and independent control of multiple image attributes in text-to-image generation, reducing interference and computational burden without retraining the foundation model. Abstract: In text-to-image (T2I) generation, achieving fine-grained control over attributes - such as age or smile - remains challenging, even with detailed text prompts. Slider-based methods offer a solution for precise control of image attributes. Existing approaches typically train individual adapter for each attribute separately, overlooking the entanglement among multiple attributes. As a result, interference occurs among different attributes, preventing precise control of multiple attributes together. To address this challenge, we aim to disentangle multiple attributes in slider-based generation to enbale more reliable and independent attribute manipulation. Our approach, CompSlider, can generate a conditional prior for the T2I foundation model to control multiple attributes simultaneously. Furthermore, we introduce novel disentanglement and structure losses to compose multiple attribute changes while maintaining structural consistency within the image. Since CompSlider operates in the latent space of the conditional prior and does not require retraining the foundation model, it reduces the computational burden for both training and inference. We evaluate our approach on a variety of image attributes and highlight its generality by extending to video generation.

[227] Seeing through Unclear Glass: Occlusion Removal with One Shot

Qiang Li,Yuanming Cao

Main category: cs.CV

TL;DR: This paper introduces an all-in-one model with one-shot test-time adaptation to restore images degraded by various real-world glass contaminants, outperforming existing methods.

Details

Motivation: The motivation is to address the challenge of restoring images degraded by a wide range of contaminants on glass surfaces, which existing methods mainly focused on synthetic data or specific contaminants like raindrops. Method: The method involves a self-supervised auxiliary learning task to update the trained model for the unique occlusion type of each test image, facilitating real-time adaptation for different contaminants. Result: Experimental results show that the proposed method outperforms state-of-the-art methods quantitatively and qualitatively in cleaning realistic contaminated images, especially unseen ones. Conclusion: The paper concludes that their proposed all-in-one model with a one-shot test-time adaptation mechanism effectively restores images taken through glasses contaminated by various occluders, outperforming state-of-the-art methods quantitatively and qualitatively. Abstract: Images taken through window glass are often degraded by contaminants adhered to the glass surfaces. Such contaminants cause occlusions that attenuate the incoming light and scatter stray light towards the camera. Most of existing deep learning methods for neutralizing the effects of contaminated glasses relied on synthetic training data. Few researchers used real degraded and clean image pairs, but they only considered removing or alleviating the effects of rain drops on glasses. This paper is concerned with the more challenging task of learning the restoration of images taken through glasses contaminated by a wide range of occluders, including muddy water, dirt and other small foreign particles found in reality. To facilitate the learning task we have gone to a great length to acquire real paired images with and without glass contaminants. More importantly, we propose an all-in-one model to neutralize contaminants of different types by utilizing the one-shot test-time adaptation mechanism. It involves a self-supervised auxiliary learning task to update the trained model for the unique occlusion type of each test image. Experimental results show that the proposed method outperforms the state-of-the-art methods quantitatively and qualitatively in cleaning realistic contaminated images, especially the unseen ones.

[228] A Unified Low-level Foundation Model for Enhancing Pathology Image Quality

Ziyi Liu,Zhe Xu,Jiabo Ma,Wenqaing Li,Junlin Hou,Fuxiang Huang,Xi Wang,Ronald Cheong Kin Chan,Terence Tsz Wai Wong,Hao Chen

Main category: cs.CV

TL;DR: The paper introduces LPFM, a unified low-level pathology foundation model that enhances image quality through restoration and translation tasks, outperforming state-of-the-art methods in most cases.

Details

Motivation: Despite the success of foundation models in high-level diagnostic tasks, low-level image enhancement in pathology remains challenging due to issues like noise, blur, low resolution, and variability in staining. Current methods lack versatility to address these diverse challenges. Method: LPFM utilizes a contrastive pre-trained encoder to learn stain-invariant feature representations and employs a unified conditional diffusion process that adapts to specific tasks via textual prompts. Result: LPFM achieves statistically significant improvements (p<0.01) over state-of-the-art methods in most evaluated tasks (56/66), with PSNR gains of 10-15% for image restoration and SSIM improvements of 12-18% for virtual staining. Conclusion: LPFM demonstrates significant improvements over state-of-the-art methods in image restoration and virtual staining tasks, showing its versatility and adaptability in handling diverse low-level vision challenges in pathology. Abstract: Foundation models have revolutionized computational pathology by achieving remarkable success in high-level diagnostic tasks, yet the critical challenge of low-level image enhancement remains largely unaddressed. Real-world pathology images frequently suffer from degradations such as noise, blur, and low resolution due to slide preparation artifacts, staining variability, and imaging constraints, while the reliance on physical staining introduces significant costs, delays, and inconsistency. Although existing methods target individual problems like denoising or super-resolution, their task-specific designs lack the versatility to handle the diverse low-level vision challenges encountered in practice. To bridge this gap, we propose the first unified Low-level Pathology Foundation Model (LPFM), capable of enhancing image quality in restoration tasks, including super-resolution, deblurring, and denoising, as well as facilitating image translation tasks like virtual staining (H&E and special stains), all through a single adaptable architecture. Our approach introduces a contrastive pre-trained encoder that learns transferable, stain-invariant feature representations from 190 million unlabeled pathology images, enabling robust identification of degradation patterns. A unified conditional diffusion process dynamically adapts to specific tasks via textual prompts, ensuring precise control over output quality. Trained on a curated dataset of 87,810 whole slied images (WSIs) across 34 tissue types and 5 staining protocols, LPFM demonstrates statistically significant improvements (p<0.01) over state-of-the-art methods in most tasks (56/66), achieving Peak Signal-to-Noise Ratio (PSNR) gains of 10-15% for image restoration and Structural Similarity Index Measure (SSIM) improvements of 12-18% for virtual staining.

[229] SpectMamba: Integrating Frequency and State Space Models for Enhanced Medical Image Detection

Yao Wang,Dong Yang,Zhi Qiao,Wenjian Huang,Liuzhi Yang,Zhen Qian

Main category: cs.CV

TL;DR: SpectMamba is a novel Mamba-based architecture for medical image detection that improves global context capture and long-range dependencies through innovative modules and techniques.

Details

Motivation: The motivation is to overcome the limitations of CNNs and Transformers in medical imaging tasks, particularly regarding receptive field constraints and computational costs. Method: SpectMamba employs a Hybrid Spatial-Frequency Attention (HSFA) block and a Visual State-Space Module (VSSM) with Hilbert Curve Scanning to enhance spatial and frequency-domain feature learning. Result: SpectMamba achieves state-of-the-art performance in medical image detection tasks, proving to be both effective and efficient. Conclusion: SpectMamba is a promising solution for abnormality detection in medical imaging that effectively captures global context and improves long-range dependencies. Abstract: Abnormality detection in medical imaging is a critical task requiring both high efficiency and accuracy to support effective diagnosis. While convolutional neural networks (CNNs) and Transformer-based models are widely used, both face intrinsic challenges: CNNs have limited receptive fields, restricting their ability to capture broad contextual information, and Transformers encounter prohibitive computational costs when processing high-resolution medical images. Mamba, a recent innovation in natural language processing, has gained attention for its ability to process long sequences with linear complexity, offering a promising alternative. Building on this foundation, we present SpectMamba, the first Mamba-based architecture designed for medical image detection. A key component of SpectMamba is the Hybrid Spatial-Frequency Attention (HSFA) block, which separately learns high- and low-frequency features. This approach effectively mitigates the loss of high-frequency information caused by frequency bias and correlates frequency-domain features with spatial features, thereby enhancing the model's ability to capture global context. To further improve long-range dependencies, we propose the Visual State-Space Module (VSSM) and introduce a novel Hilbert Curve Scanning technique to strengthen spatial correlations and local dependencies, further optimizing the Mamba framework. Comprehensive experiments show that SpectMamba achieves state-of-the-art performance while being both effective and efficient across various medical image detection tasks.

[230] Bidirectional Sparse Attention for Faster Video Diffusion Training

Chenlu Zhan,Wen Li,Chuyu Shen,Jun Zhang,Suhui Wu,Hao Zhang

Main category: cs.CV

TL;DR: The paper proposes a Bidirectional Sparse Attention (BSA) framework that improves the training and inference efficiency of video diffusion Transformer (DiT) models while maintaining or enhancing generative quality.

Details

Motivation: Video DiT models face computational bottlenecks due to the quadratic complexity of full attention, which leads to high training and inference costs. Method: BSA dynamically sparsifies both Queries and Key-Value pairs within 3D full attention using semantic similarity and a dynamic spatial-time training strategy for Query sparsity, and a statistical dynamic threshold for KV sparsity. Result: BSA reduces FLOPs by up to 20x and achieves 17.79x faster attention training across long sequences. Conclusion: BSA maintains or improves generative quality while significantly enhancing training and inference efficiency of video DiT models. Abstract: Video diffusion Transformer (DiT) models excel in generative quality but hit major computational bottlenecks when producing high-resolution, long-duration videos. The quadratic complexity of full attention leads to prohibitively high training and inference costs. Full attention inefficiency stems from two key challenges: excessive computation due to the inherent sparsity of Queries and Key-Value pairs, and redundant computation as fixed sparse patterns fail to leverage DiT's dynamic attention. To overcome this limitation, we propose a Bidirectional Sparse Attention (BSA) framework for faster video DiT training, the first to dynamically sparsify both Queries and Key-Value pairs within 3D full attention, thereby substantially improving training and inference efficiency. BSA addresses these issues through two key components. Query sparsity is optimized by selecting the most informative query tokens via semantic similarity and with a dynamic spatial-time training strategy, while KV sparsity is achieved by computing a statistical dynamic threshold to retain only the most salient KV blocks for computation. Extensive experiments demonstrate that BSA significantly accelerates DiT training across long sequences, reducing FLOPs by up to 20x and achieving 17.79x faster attention training, while preserving or even surpassing the generative quality of full attention.

[231] An End-to-End Framework for Video Multi-Person Pose Estimation

Zhihong Wei

Main category: cs.CV

TL;DR: VEPE is an end-to-end video pose estimation framework that improves performance and inference speed by leveraging spatio-temporal Transformers and an instance consistency mechanism.

Details

Motivation: Existing video-based human pose estimation approaches separate spatial and temporal dimensions, relying on detectors and post-processing steps that reduce efficiency and effectiveness. A more integrated and end-to-end approach is needed to better utilize spatio-temporal context. Method: VEPE utilizes three spatio-temporal Transformer components: the Spatio-Temporal Pose Encoder (STPE), the Spatio-Temporal Deformable Memory Encoder (STDME), and the Spatio-Temporal Pose Decoder (STPD). It also incorporates an instance consistency mechanism to enhance cross-frame pose query matching. Result: Experiments on the Posetrack dataset show that VEPE outperforms existing two-stage methods and significantly improves inference efficiency. Conclusion: The proposed VEPE framework demonstrates superior performance over most two-stage models and improves inference efficiency by 300%. Abstract: Video-based human pose estimation models aim to address scenarios that cannot be effectively solved by static image models such as motion blur, out-of-focus and occlusion. Most existing approaches consist of two stages: detecting human instances in each image frame and then using a temporal model for single-person pose estimation. This approach separates the spatial and temporal dimensions and cannot capture the global spatio-temporal context between spatial instances for end-to-end optimization. In addition, it relies on separate detectors and complex post-processing such as RoI cropping and NMS, which reduces the inference efficiency of the video scene. To address the above problems, we propose VEPE (Video End-to-End Pose Estimation), a simple and flexible framework for end-to-end pose estimation in video. The framework utilizes three crucial spatio-temporal Transformer components: the Spatio-Temporal Pose Encoder (STPE), the Spatio-Temporal Deformable Memory Encoder (STDME), and the Spatio-Temporal Pose Decoder (STPD). These components are designed to effectively utilize temporal context for optimizing human body pose estimation. Furthermore, to reduce the mismatch problem during the cross-frame pose query matching process, we propose an instance consistency mechanism, which aims to enhance the consistency and discrepancy of the cross-frame instance query and realize the instance tracking function, which in turn accurately guides the pose query to perform cross-frame matching. Extensive experiments on the Posetrack dataset show that our approach outperforms most two-stage models and improves inference efficiency by 300%.

[232] PVINet: Point-Voxel Interlaced Network for Point Cloud Compression

Xuan Deng,Xingtao Wang,Xiandong Meng,Xiaopeng Fan,Debin Zhao

Main category: cs.CV

TL;DR: 本文提出了一种新的点云压缩方法PVINet，通过点-体素交互网络同时提取全局和局部特征，并通过条件稀疏卷积实现特征交互，从而提高点云重建效率和质量。

Details

Motivation: 现有的点云压缩方法通常顺序处理全局和局部信息，缺乏两者之间的有效通信，导致重建点云的质量受限。 Method: 提出了一种点-体素交错网络（PVINet），包含基于体素的编码器（Ev）和基于点的编码器（Ep），引入了一种新的条件稀疏卷积，用于动态定制体素特征提取的核，并在解码过程中使用条件稀疏卷积结合点嵌入进行点云重建。 Result: 实验表明，PVINet在基准数据集上表现优异，具有与当前最先进方法相媲美的性能。 Conclusion: PVINet通过点云和体素的交互网络，提高了点云压缩中特征感知效率，并在基准数据集上展现出与现有最先进方法相竞争的性能。 Abstract: In point cloud compression, the quality of a reconstructed point cloud relies on both the global structure and the local context, with existing methods usually processing global and local information sequentially and lacking communication between these two types of information. In this paper, we propose a point-voxel interlaced network (PVINet), which captures global structural features and local contextual features in parallel and performs interactions at each scale to enhance feature perception efficiency. Specifically, PVINet contains a voxel-based encoder (Ev) for extracting global structural features and a point-based encoder (Ep) that models local contexts centered at each voxel. Particularly, a novel conditional sparse convolution is introduced, which applies point embeddings to dynamically customize kernels for voxel feature extraction, facilitating feature interactions from Ep to Ev. During decoding, a voxel-based decoder employs conditional sparse convolutions to incorporate point embeddings as guidance to reconstruct the point cloud. Experiments on benchmark datasets show that PVINet delivers competitive performance compared to state-of-the-art methods.

[233] FICGen: Frequency-Inspired Contextual Disentanglement for Layout-driven Degraded Image Generation

Wenzhuang Wang,Yifan Zhao,Mingcan Ma,Ming Liu,Zhonglin Jiang,Yong Chen,Jia Li

Main category: cs.CV

TL;DR: FICGen improves layout-to-image generation in degraded scenes by leveraging frequency knowledge, achieving better fidelity, alignment, and trainability compared to existing methods.

Details

Motivation: Existing layout-to-image generation methods struggle with fidelity and alignment in degraded scenes due to a 'contextual illusion dilemma'. This work aims to address this issue by incorporating frequency knowledge into the generation process. Method: FICGen uses a dual-query mechanism and frequency resampling to extract contextual frequency prototypes, enhanced with visual-frequency attention and an instance coherence map to manage disentanglement and reconstruct degraded scenes. Result: FICGen outperforms current methods across five benchmarks involving various degraded scenarios, from severe low-light to mild blur. Conclusion: FICGen demonstrates superior performance in layout-to-image generation for degraded scenes, outperforming existing methods in fidelity, alignment, and auxiliary trainability. Abstract: Layout-to-image (L2I) generation has exhibited promising results in natural domains, but suffers from limited generative fidelity and weak alignment with user-provided layouts when applied to degraded scenes (i.e., low-light, underwater). We primarily attribute these limitations to the "contextual illusion dilemma" in degraded conditions, where foreground instances are overwhelmed by context-dominant frequency distributions. Motivated by this, our paper proposes a new Frequency-Inspired Contextual Disentanglement Generative (FICGen) paradigm, which seeks to transfer frequency knowledge of degraded images into the latent diffusion space, thereby facilitating the rendering of degraded instances and their surroundings via contextual frequency-aware guidance. To be specific, FICGen consists of two major steps. Firstly, we introduce a learnable dual-query mechanism, each paired with a dedicated frequency resampler, to extract contextual frequency prototypes from pre-collected degraded exemplars in the training set. Secondly, a visual-frequency enhanced attention is employed to inject frequency prototypes into the degraded generation process. To alleviate the contextual illusion and attribute leakage, an instance coherence map is developed to regulate latent-space disentanglement between individual instances and their surroundings, coupled with an adaptive spatial-frequency aggregation module to reconstruct spatial-frequency mixed degraded representations. Extensive experiments on 5 benchmarks involving a variety of degraded scenarios-from severe low-light to mild blur-demonstrate that FICGen consistently surpasses existing L2I methods in terms of generative fidelity, alignment and downstream auxiliary trainability.

[234] GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation

Zhengqiang Zhang,Rongyuan Wu,Lingchen Sun,Lei Zhang

Main category: cs.CV

TL;DR: GPSToken introduces a novel spatially-adaptive tokenization method using Gaussian parameterization, significantly improving image representation and generation performance.

Details

Motivation: Conventional uniform grid tokenization methods are inflexible for representing varying shapes, textures, and locations in images, limiting their feature representation efficacy. This motivates the need for a more adaptive tokenization approach. Method: The paper proposes GPSToken, a Gaussian Parameterized Spatially-adaptive Tokenization framework that uses parametric 2D Gaussians to model image regions. It partitions images using an entropy-driven algorithm, parameterizes regions as Gaussians with texture features, and employs a transformer to optimize parameters. A differentiable renderer reconstructs tokens into feature maps. Result: GPSToken achieves state-of-the-art performance with rFID and FID scores of 0.65 and 1.50, respectively, on image reconstruction and generation tasks using 128 tokens. Conclusion: The paper concludes that GPSToken offers an effective and efficient approach to image tokenization, achieving state-of-the-art results in image reconstruction and generation. Abstract: Effective and efficient tokenization plays an important role in image representation and generation. Conventional methods, constrained by uniform 2D/1D grid tokenization, are inflexible to represent regions with varying shapes and textures and at different locations, limiting their efficacy of feature representation. In this work, we propose $\textbf{GPSToken}$, a novel $\textbf{G}$aussian $\textbf{P}$arameterized $\textbf{S}$patially-adaptive $\textbf{Token}$ization framework, to achieve non-uniform image tokenization by leveraging parametric 2D Gaussians to dynamically model the shape, position, and textures of different image regions. We first employ an entropy-driven algorithm to partition the image into texture-homogeneous regions of variable sizes. Then, we parameterize each region as a 2D Gaussian (mean for position, covariance for shape) coupled with texture features. A specialized transformer is trained to optimize the Gaussian parameters, enabling continuous adaptation of position/shape and content-aware feature extraction. During decoding, Gaussian parameterized tokens are reconstructed into 2D feature maps through a differentiable splatting-based renderer, bridging our adaptive tokenization with standard decoders for end-to-end training. GPSToken disentangles spatial layout (Gaussian parameters) from texture features to enable efficient two-stage generation: structural layout synthesis using lightweight networks, followed by structure-conditioned texture generation. Experiments demonstrate the state-of-the-art performance of GPSToken, which achieves rFID and FID scores of 0.65 and 1.50 on image reconstruction and generation tasks using 128 tokens, respectively. Codes and models of GPSToken can be found at $\href{https://github.com/xtudbxk/GPSToken}{https://github.com/xtudbxk/GPSToken}$.

[235] MetaSSL: A General Heterogeneous Loss for Semi-Supervised Medical Image Segmentation

Weiren Zhao,Lanfeng Zhong,Xin Liao,Wenjun Liao,Sichuan Zhang,Shaoting Zhang,Guotai Wang

Main category: cs.CV

TL;DR: 本文提出了一种通用的半监督学习框架MetaSSL，通过空间异构损失函数提升医学图像分割性能，有效利用参考预测与监督预测中的信息，取得了显著的实验效果。

Details

Motivation: 现有的半监督学习方法（如Mean Teacher、FixMatch和CPS）主要关注生成参考预测的策略，忽略了标记数据中的噪声以及未标记像素之间的异质性价值。因此，本文提出通过损失函数有效挖掘参考预测与监督预测中的信息，以提升半监督学习效果。 Method: MetaSSL基于一种空间异构损失函数，将未标记数据的预测分为四个区域（UC、US、DC、DS），并根据一致性与不确定性信息为像素分配不同权重，同时考虑标记数据中潜在的标注噪声。 Result: 实验结果表明，MetaSSL在不同数据集上的现有SSL框架中集成后，显著提升了分割性能。 Conclusion: MetaSSL是一个通用且即插即用的半监督学习框架，通过空间异构损失函数有效地挖掘参考预测和监督预测中的信息，显著提升了医学图像分割的性能。 Abstract: Semi-Supervised Learning (SSL) is important for reducing the annotation cost for medical image segmentation models. State-of-the-art SSL methods such as Mean Teacher, FixMatch and Cross Pseudo Supervision (CPS) are mainly based on consistency regularization or pseudo-label supervision between a reference prediction and a supervised prediction. Despite the effectiveness, they have overlooked the potential noise in the labeled data, and mainly focus on strategies to generate the reference prediction, while ignoring the heterogeneous values of different unlabeled pixels. We argue that effectively mining the rich information contained by the two predictions in the loss function, instead of the specific strategy to obtain a reference prediction, is more essential for SSL, and propose a universal framework MetaSSL based on a spatially heterogeneous loss that assigns different weights to pixels by simultaneously leveraging the uncertainty and consistency information between the reference and supervised predictions. Specifically, we split the predictions on unlabeled data into four regions with decreasing weights in the loss: Unanimous and Confident (UC), Unanimous and Suspicious (US), Discrepant and Confident (DC), and Discrepant and Suspicious (DS), where an adaptive threshold is proposed to distinguish confident predictions from suspicious ones. The heterogeneous loss is also applied to labeled images for robust learning considering the potential annotation noise. Our method is plug-and-play and general to most existing SSL methods. The experimental results showed that it improved the segmentation performance significantly when integrated with existing SSL frameworks on different datasets. Code is available at https://github.com/HiLab-git/MetaSSL.

[236] MVTrajecter: Multi-View Pedestrian Tracking with Trajectory Motion Cost and Trajectory Appearance Cost

Taiga Yamane,Ryo Masumura,Satoshi Suzuki,Shota Orihashi

Main category: cs.CV

TL;DR: MVTrajecter是一种利用多个时间戳信息的新型端到端多视角行人跟踪方法，该方法通过引入轨迹运动成本和外观成本以及注意力机制，提高了行人跟踪的准确性。

Details

Motivation: 行人运动和外观信息对于关联很重要，但以前的端到端MVPT方法仅依赖于当前和其单一的过去时间戳，丢弃了之前的轨迹信息。 Method: 提出了一种新的端到端MVPT方法MVTrajecter，该方法利用过去多个时间戳的信息进行行人跟踪。 Result: 实验表明，MVTrajecter中每个组件的有效性，并显示其性能优于以前的最先进方法。 Conclusion: MVTrajecter通过引入轨迹运动成本和外观成本，以及利用注意力机制，有效利用多个时间戳的信息，提高了多视角行人跟踪的准确性。 Abstract: Multi-View Pedestrian Tracking (MVPT) aims to track pedestrians in the form of a bird's eye view occupancy map from multi-view videos. End-to-end methods that detect and associate pedestrians within one model have shown great progress in MVPT. The motion and appearance information of pedestrians is important for the association, but previous end-to-end MVPT methods rely only on the current and its single adjacent past timestamp, discarding the past trajectories before that. This paper proposes a novel end-to-end MVPT method called Multi-View Trajectory Tracker (MVTrajecter) that utilizes information from multiple timestamps in past trajectories for robust association. MVTrajecter introduces trajectory motion cost and trajectory appearance cost to effectively incorporate motion and appearance information, respectively. These costs calculate which pedestrians at the current and each past timestamp are likely identical based on the information between those timestamps. Even if a current pedestrian could be associated with a false pedestrian at some past timestamp, these costs enable the model to associate that current pedestrian with the correct past trajectory based on other past timestamps. In addition, MVTrajecter effectively captures the relationships between multiple timestamps leveraging the attention mechanism. Extensive experiments demonstrate the effectiveness of each component in MVTrajecter and show that it outperforms the previous state-of-the-art methods.

[237] Do Video Language Models Really Know Where to Look? Diagnosing Attention Failures in Video Language Models

Hyunjong Ok,Jaeho Lee

Main category: cs.CV

TL;DR: 研究表明当前视觉编码器在视频关键帧识别上存在不足，可能需要改进以提升MLLM效率。

Details

Motivation: 现有MLLM在视频理解任务中采用关键帧采样方法以减少计算成本，但尚不清楚这些方法是否能准确识别最具信息量的帧。 Method: 通过实证方法分析现有视觉编码器在视频理解任务中的关键帧采样效果。 Result: 发现现有视觉编码器在识别关键帧方面存在局限性，影响MLLM对视频和文本查询的处理效果。 Conclusion: 研究得出当前流行的视觉编码器在识别视频中多模态大语言模型（MLLM）应关注的帧方面能力有限，可能需要开发更优的关键帧识别技术。 Abstract: Recent advances in multimodal large language models (MLLMs) have led to much progress in video understanding tasks. To avoid the heavy computational cost of processing all frames, these models typically rely on keyframe sampling methods guided by vision-language encoders (\textit{e.g.,} SigLIP). However, it remains unclear whether such encoders can truly identify the most informative frames. In this work, we provide several empirical pieces of evidence revealing that popular vision encoders critically suffer from their limited capability to identify where the MLLM should look inside the video to handle the given textual query appropriately. Our findings suggest that the development of better keyframe identification techniques may be necessary for efficient video MLLMs.

[238] DynaMind: Reconstructing Dynamic Visual Scenes from EEG by Aligning Temporal Dynamics and Multimodal Semantics to Guided Diffusion

Junxiang Liu,Junming Lin,Jiangtong Li,Jie Li

Main category: cs.CV

TL;DR: DynaMind是一种新的框架，通过联合建模神经动态和语义特征，从EEG信号中重建视频，克服了现有方法在解决感知视觉刺激的动态连贯性和复杂语义背景方面的不足。

Details

Motivation: 由于EEG信号的低空间分辨率、神经记录与视频动态之间的时间不匹配以及脑活动内语义信息的不足使用，重建动态视觉场景仍然是脑解码的主要挑战。 Method: DynaMind框架包含三个核心模块：区域感知语义映射器（RSM）、时间感知动态对齐器（TDA）和双引导视频重建器（DGVR）。 Result: 在SEED-DV数据集上，DynaMind设定了新的最先进水平（SOTA），重建视频准确率（视频和帧基础）分别提高了12.5和10.3个百分点。 Conclusion: DynaMind通过结合神经动态和语义特征，提高了EEG信号重建视频的质量，标志着神经动态与高保真视觉语义之间的差距的缩小。 Abstract: Reconstruction dynamic visual scenes from electroencephalography (EEG) signals remains a primary challenge in brain decoding, limited by the low spatial resolution of EEG, a temporal mismatch between neural recordings and video dynamics, and the insufficient use of semantic information within brain activity. Therefore, existing methods often inadequately resolve both the dynamic coherence and the complex semantic context of the perceived visual stimuli. To overcome these limitations, we introduce DynaMind, a novel framework that reconstructs video by jointly modeling neural dynamics and semantic features via three core modules: a Regional-aware Semantic Mapper (RSM), a Temporal-aware Dynamic Aligner (TDA), and a Dual-Guidance Video Reconstructor (DGVR). The RSM first utilizes a regional-aware encoder to extract multimodal semantic features from EEG signals across distinct brain regions, aggregating them into a unified diffusion prior. In the mean time, the TDA generates a dynamic latent sequence, or blueprint, to enforce temporal consistency between the feature representations and the original neural recordings. Together, guided by the semantic diffusion prior, the DGVR translates the temporal-aware blueprint into a high-fidelity video reconstruction. On the SEED-DV dataset, DynaMind sets a new state-of-the-art (SOTA), boosting reconstructed video accuracies (video- and frame-based) by 12.5 and 10.3 percentage points, respectively. It also achieves a leap in pixel-level quality, showing exceptional visual fidelity and temporal coherence with a 9.4% SSIM improvement and a 19.7% FVMD reduction. This marks a critical advancement, bridging the gap between neural dynamics and high-fidelity visual semantics.

[239] FocusDPO: Dynamic Preference Optimization for Multi-Subject Personalized Image Generation via Adaptive Focus

Qiaoqiao Jin,Siming Fu,Dong She,Weinan Jia,Hualiang Wang,Mu Liu,Jidong Jiang

Main category: cs.CV

TL;DR: FocusDPO通过动态调整训练中的焦点区域，显著提升了多主体个性化图像生成的效果，实现了对多个主体的独立精细控制，解决了主体保真度和属性泄漏的问题。

Details

Motivation: 多主体个性化图像生成需要在没有测试时优化的情况下合成包含多个指定主体的定制图像，但目前在保持主体保真度和防止跨主体属性泄漏方面仍面临挑战。 Method: FocusDPO框架根据动态语义对应和参考图像的复杂度自适应地识别焦点区域，并在训练过程中逐步调整这些焦点区域，采用一种加权策略，奖励信息丰富的区域，同时惩罚预测置信度低的区域。 Result: 实验表明，该方法在现有预训练个性化生成模型上的表现大幅提升，不仅在单主体和多主体个性化图像合成基准任务上达到了最先进的效果，还有效减少了属性泄漏并保持了优异的主体保真度。 Conclusion: FocusDPO为可控的多主体图像合成提供了新的解决方案，推进了该领域的前沿发展。 Abstract: Multi-subject personalized image generation aims to synthesize customized images containing multiple specified subjects without requiring test-time optimization. However, achieving fine-grained independent control over multiple subjects remains challenging due to difficulties in preserving subject fidelity and preventing cross-subject attribute leakage. We present FocusDPO, a framework that adaptively identifies focus regions based on dynamic semantic correspondence and supervision image complexity. During training, our method progressively adjusts these focal areas across noise timesteps, implementing a weighted strategy that rewards information-rich patches while penalizing regions with low prediction confidence. The framework dynamically adjusts focus allocation during the DPO process according to the semantic complexity of reference images and establishes robust correspondence mappings between generated and reference subjects. Extensive experiments demonstrate that our method substantially enhances the performance of existing pre-trained personalized generation models, achieving state-of-the-art results on both single-subject and multi-subject personalized image synthesis benchmarks. Our method effectively mitigates attribute leakage while preserving superior subject fidelity across diverse generation scenarios, advancing the frontier of controllable multi-subject image synthesis.

[240] SegAssess: Panoramic quality mapping for robust and transferable unsupervised segmentation assessment

Bingnan Yang,Mi Zhang,Zhili Zhang,Zhan Zhang,Yuanxin Zhao,Xiangyun Hu,Jianya Gong

Main category: cs.CV

TL;DR: 本文提出了一种新的无监督分割质量评估框架SegAssess，它通过将SQA转化为细粒度的四类全景分割任务来生成完整的质量图，并且在多个数据集上表现出了先进的性能和良好的迁移能力。

Details

Motivation: 高质量的图像分割对于遥感中的像素级地理空间分析至关重要，但现有的基于深度学习的无监督分割质量评估方法存在评估粒度粗糙、评估不全面和迁移性差的问题。 Method: 引入了全景质量映射（PQM）作为全面的像素级SQA新范式，并提出了实现该方法的新框架SegAssess。 Result: 在32个来自6个不同数据源的数据集上进行的综合实验表明，SegAssess达到了最先进的性能，并显著提高了跨域鲁棒性和零样本迁移能力。 Conclusion: SegAssess实现了最先进的性能，并展现出对未见掩膜的惊人零样本迁移能力，为无监督SQA提供了一个强大且可迁移的解决方案。 Abstract: High-quality image segmentation is fundamental to pixel-level geospatial analysis in remote sensing, necessitating robust segmentation quality assessment (SQA), particularly in unsupervised settings lacking ground truth. Although recent deep learning (DL) based unsupervised SQA methods show potential, they often suffer from coarse evaluation granularity, incomplete assessments, and poor transferability. To overcome these limitations, this paper introduces Panoramic Quality Mapping (PQM) as a new paradigm for comprehensive, pixel-wise SQA, and presents SegAssess, a novel deep learning framework realizing this approach. SegAssess distinctively formulates SQA as a fine-grained, four-class panoramic segmentation task, classifying pixels within a segmentation mask under evaluation into true positive (TP), false positive (FP), true negative (TN), and false negative (FN) categories, thereby generating a complete quality map. Leveraging an enhanced Segment Anything Model (SAM) architecture, SegAssess uniquely employs the input mask as a prompt for effective feature integration via cross-attention. Key innovations include an Edge Guided Compaction (EGC) branch with an Aggregated Semantic Filter (ASF) module to refine predictions near challenging object edges, and an Augmented Mixup Sampling (AMS) training strategy integrating multi-source masks to significantly boost cross-domain robustness and zero-shot transferability. Comprehensive experiments across 32 datasets derived from 6 sources demonstrate that SegAssess achieves state-of-the-art (SOTA) performance and exhibits remarkable zero-shot transferability to unseen masks, establishing PQM via SegAssess as a robust and transferable solution for unsupervised SQA. The code is available at https://github.com/Yangbn97/SegAssess.

[241] PrediTree: A Multi-Temporal Sub-meter Dataset of Multi-Spectral Imagery Aligned With Canopy Height Maps

Hiyam Debary,Mustansar Fiaz,Levente Klein

Main category: cs.CV

TL;DR: PrediTree 是一个结合高分辨率 LiDAR 冠层高度图和多时相多光谱图像的开源数据集，用于训练和评估树高预测模型，实验表明其在 U-Net 架构下表现优异。

Details

Motivation: PrediTree 解决了森林监测中树高预测模型训练和评估所需数据的空白，旨在推动深度学习方法在森林生长预测中的应用。 Method: PrediTree 采用编码器-解码器框架，利用多时相多光谱图像和冠层高度图时间戳之间的相对时间差来预测冠层高度。 Result: 实验表明，基于 PrediTree 数据集训练的 U-Net 架构在掩码均方误差上表现最佳，达到 11.78%，优于 ResNet-50 约 12%，且比仅使用红、绿、蓝波段的实验误差降低了约 30%。 Conclusion: PrediTree 是一个用于训练和评估树木高度预测模型的综合性开源数据集，结合了高分辨率 LiDAR 衍生的冠层高度图与多时相、多光谱图像，适用于各种森林生态系统。 Abstract: We present PrediTree, the first comprehensive open-source dataset designed for training and evaluating tree height prediction models at sub-meter resolution. This dataset combines very high-resolution (0.5m) LiDAR-derived canopy height maps, spatially aligned with multi-temporal and multi-spectral imagery, across diverse forest ecosystems in France, totaling 3,141,568 images. PrediTree addresses a critical gap in forest monitoring capabilities by enabling the training of deep learning methods that can predict tree growth based on multiple past observations. %\sout{Initially focused on French forests, PrediTree is designed as an expanding resource with ongoing efforts to incorporate data from other countries. } To make use of this PrediTree dataset, we propose an encoder-decoder framework that requires the multi-temporal multi-spectral imagery and the relative time differences in years between the canopy height map timestamp (target) and each image acquisition date for which this framework predicts the canopy height. The conducted experiments demonstrate that a U-Net architecture trained on the PrediTree dataset provides the highest masked mean squared error of $11.78\%$, outperforming the next-best architecture, ResNet-50, by around $12\%$, and cutting the error of the same experiments but on fewer bands (red, green, blue only), by around $30\%$. This dataset is publicly available on \href{URL}{HuggingFace}, and both processing and training codebases are available on \href{URL}{GitHub}.

[242] DcMatch: Unsupervised Multi-Shape Matching with Dual-Level Consistency

Tianwei Ye,Yong Ma,Xiaoguang Mei

Main category: cs.CV

TL;DR: 本文提出DcMatch，一种基于无监督学习的多形状匹配框架，通过形状图注意力网络和循环一致性损失，在多个基准测试中实现了优于现有方法的匹配效果。

Details

Motivation: 现有的方法通常从单个形状学习规范嵌入，缺乏对整个形状集合潜在结构的考虑，导致匹配不够一致。 Method: 提出DcMatch，一种基于形状图注意力网络的无监督学习框架，用于非刚性多形状匹配，并通过新的循环一致性损失实现空间和谱域的双重一致性。 Result: DcMatch能够构建更具表达性和鲁棒性的共享潜在空间，通过宇宙预测器实现更一致的形状到宇宙对应关系，并在多个基准测试中表现出优越性。 Conclusion: 实验结果表明，DcMatch在多个具有挑战性的基准测试中始终优于现有的最先进方法，并且代码已开源。 Abstract: Establishing point-to-point correspondences across multiple 3D shapes is a fundamental problem in computer vision and graphics. In this paper, we introduce DcMatch, a novel unsupervised learning framework for non-rigid multi-shape matching. Unlike existing methods that learn a canonical embedding from a single shape, our approach leverages a shape graph attention network to capture the underlying manifold structure of the entire shape collection. This enables the construction of a more expressive and robust shared latent space, leading to more consistent shape-to-universe correspondences via a universe predictor. Simultaneously, we represent these correspondences in both the spatial and spectral domains and enforce their alignment in the shared universe space through a novel cycle consistency loss. This dual-level consistency fosters more accurate and coherent mappings. Extensive experiments on several challenging benchmarks demonstrate that our method consistently outperforms previous state-of-the-art approaches across diverse multi-shape matching scenarios. Code is available at https://github.com/YeTianwei/DcMatch.

[243] Generalizable Self-supervised Monocular Depth Estimation with Mixture of Low-Rank Experts for Diverse Endoscopic Scenes

Liangjing Shao,Benshuang Chen,Chenkang Du,Xueli Liu,Xinrong Chen

Main category: cs.CV

TL;DR: 本文针对内窥镜单目深度估计问题，提出了一种高效的自监督框架，通过创新的模块设计和训练方法，实现了优异的性能和泛化能力。

Details

Motivation: 内窥镜场景中的光照条件和场景特征的多样性仍然是可推广深度估计的主要挑战，因此需要一种高效的自监督框架来提升深度估计的性能。 Method: 提出了一种基于输入特征自适应选择不同专家进行加权推理的块状低秩专家混合模块，并设计了一个新的自监督训练框架，以应对亮度和反射率不一致的问题。 Result: 该方法在真实和模拟的内窥镜数据集上均超越了最先进的方法，并在零样本深度估计中实现了最佳泛化能力。 Conclusion: 本文提出了一种用于内窥镜单目深度估计的自监督框架，通过块状低秩专家混合模块和新的自监督训练框架，实现了在不同内窥镜场景下的高效深度估计和最佳泛化能力。 Abstract: Self-supervised monocular depth estimation is a significant task for low-cost and efficient three-dimensional scene perception in endoscopy. The variety of illumination conditions and scene features is still the primary challenge for generalizable depth estimation in endoscopic scenes. In this work, a self-supervised framework is proposed for monocular depth estimation in various endoscopy. Firstly, due to various features in endoscopic scenes with different tissues, a novel block-wise mixture of dynamic low-rank experts is proposed to efficiently finetuning the foundation model for endoscopic depth estimation. In the proposed module, based on the input feature, different experts with a small amount of trainable parameters are adaptively selected for weighted inference, from various mixture of low-rank experts which are allocated based on the training quality of each block. Moreover, a novel self-supervised training framework is proposed to jointly cope with the inconsistency of brightness and reflectance. The proposed method outperform state-of-the-art works on both realistic and simulated endoscopic datasets. Furthermore, the proposed network also achieves the best generalization based on zero-shot depth estimation on diverse endoscopic scenes. The proposed method could contribute to accurate endoscopic perception for minimally invasive measurement and surgery. The code will be released upon acceptance, while the demo video can be found on here: https://endo-gede.netlify.app/.

[244] Measuring Image-Relation Alignment: Reference-Free Evaluation of VLMs and Synthetic Pre-training for Open-Vocabulary Scene Graph Generation

Maëlic Neau,Zoe Falomir,Cédric Buche,Akihiro Sugimoto

Main category: cs.CV

TL;DR: This paper introduces a reference-free metric and synthetic data generation method to enhance evaluation and performance of Open-Vocabulary Scene Graph Generation models.

Details

Motivation: Current SGG benchmarks have limited vocabulary, making evaluation inefficient. Additionally, Open-Vocabulary SGG suffers from reliance on low-quality, weakly supervised pre-training data. Method: The authors propose a new reference-free evaluation metric and a method for generating high-quality synthetic data via region-specific prompt tuning of VLMs. Result: Experiments show that pre-training with their synthetically generated data improves the generalization capabilities of Open-Vocabulary SGG models. Conclusion: The paper concludes that their proposed reference-free metric enables fair evaluation of open-vocabulary SGG models and that synthetic data improves generalization. Abstract: Scene Graph Generation (SGG) encodes visual relationships between objects in images as graph structures. Thanks to the advances of Vision-Language Models (VLMs), the task of Open-Vocabulary SGG has been recently proposed where models are evaluated on their functionality to learn a wide and diverse range of relations. Current benchmarks in SGG, however, possess a very limited vocabulary, making the evaluation of open-source models inefficient. In this paper, we propose a new reference-free metric to fairly evaluate the open-vocabulary capabilities of VLMs for relation prediction. Another limitation of Open-Vocabulary SGG is the reliance on weakly supervised data of poor quality for pre-training. We also propose a new solution for quickly generating high-quality synthetic data through region-specific prompt tuning of VLMs. Experimental results show that pre-training with this new data split can benefit the generalization capabilities of Open-Voc SGG models.

[245] PRINTER:Deformation-Aware Adversarial Learning for Virtual IHC Staining with In Situ Fidelity

Yizhe Yuan,Bingsen Xue,Bangzheng Pu,Chengxiang Wang,Cheng Jin

Main category: cs.CV

TL;DR: PRINTER is a weakly-supervised framework that improves virtual staining by integrating prototype-driven style transfer, cyclic registration-synthesis, and adversarial learning strategies.

Details

Motivation: Current methods for tumor spatial heterogeneity analysis suffer from spatial misalignment between H&E and IHC sections, compromising in situ pathological interpretation. Method: PRINTER integrates prototype-driven content and staining pattern decoupling, GapBridge cyclic registration-synthesis framework, and deformation-aware adversarial learning. Result: PRINTER achieves superior performance in preserving H&E staining details and virtual staining fidelity compared to state-of-the-art methods. Conclusion: PRINTER provides a robust and scalable solution for virtual staining, advancing computational pathology. Abstract: Tumor spatial heterogeneity analysis requires precise correlation between Hematoxylin and Eosin H&E morphology and immunohistochemical (IHC) biomarker expression, yet current methods suffer from spatial misalignment in consecutive sections, severely compromising in situ pathological interpretation. In order to obtain a more accurate virtual staining pattern, We propose PRINTER, a weakly-supervised framework that integrates PRototype-drIven content and staiNing patTERn decoupling and deformation-aware adversarial learning strategies designed to accurately learn IHC staining patterns while preserving H&E staining details. Our approach introduces three key innovations: (1) A prototype-driven staining pattern transfer with explicit content-style decoupling; and (2) A cyclic registration-synthesis framework GapBridge that bridges H&E and IHC domains through deformable structural alignment, where registered features guide cross-modal style transfer while synthesized outputs iteratively refine the registration;(3) Deformation-Aware Adversarial Learning: We propose a training framework where a generator and deformation-aware registration network jointly adversarially optimize a style-focused discriminator. Extensive experiments demonstrate that PRINTER effectively achieves superior performance in preserving H&E staining details and virtual staining fidelity, outperforming state-of-the-art methods. Our work provides a robust and scalable solution for virtual staining, advancing the field of computational pathology.

[246] POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion

Yuan Liu,Zhongyin Zhao,Le Tian,Haicheng Wang,Xubing Ye,Yangxiu You,Zilin Yu,Chuhan Wu,Xiao Zhou,Yang Yu,Jie Zhou

Main category: cs.CV

TL;DR: 本文提出了一种无需蒸馏的自动化框架，用于构建高质量文档提取模型，通过合成数据生成和迭代自我改进，训练出性能优于现有模型的文档转换模型。

Details

Motivation: 现有的手动标注成本高且耗时，而使用现有模型自动标注的准确性不足，导致学生模型的性能受限。因此，需要一种更高效、准确的自动化方法来构建文档提取数据集和模型。 Method: 论文的方法分为两个阶段：第一阶段是生成大规模多样化的合成数据，以统一格式提取关键元素；第二阶段是一种自我改进方法，通过使用微调模型标注真实文档、应用过滤策略验证标注质量，并最终基于验证后的数据集重新训练模型。 Result: 通过训练公开的POINTS-1.5模型，获得了POINTS-Reader，其性能优于许多现有开源和专有模型（包括同等或更大规模的模型）。 Conclusion: 该论文提出了一种无需蒸馏的自动化框架，用于构建高质量的文档提取数据集和模型，且通过迭代优化显著提高了模型的转换能力和生成数据的质量。 Abstract: High-quality labeled data is essential for training accurate document conversion models, particularly in domains with complex formats such as tables, formulas, and multi-column text. However, manual annotation is both costly and time-consuming, while automatic labeling using existing models often lacks accuracy in handling such challenging scenarios. Consequently, training student models by distilling outputs from teacher models can significantly limit their performance in real-world applications. In this paper, we propose a fully automated, distillation-free framework comprising two stages for constructing high-quality document extraction datasets and models capable of handling diverse document formats and layouts. In the first stage, we introduce a method for generating large-scale, diverse synthetic data, which enables a model to extract key elements in a unified format with strong initial performance. In the second stage, we present a self-improvement approach that further adapts the model, initially trained on synthetic data, to real-world documents. Specifically, we first use the fine-tuned model to annotate real documents, then apply a suite of filtering strategies to verify annotation quality, and finally retrain the model on the verified dataset. By iteratively repeating this process, we progressively enhance both the model's conversion capabilities and the quality of the generated data. We train a public POINTS-1.5 model to obtain POINTS-Reader, which surpasses many existing public and proprietary models of comparable or larger size. Our model is available at https://github.com/Tencent/POINTS-Reader.

[247] FantasyHSI: Video-Generation-Centric 4D Human Synthesis In Any Scene through A Graph-based Multi-Agent Framework

Lingzhou Mu,Qiang Wang,Fan Jiang,Mengchao Wang,Yaqi Fan,Mu Xu,Kai Zhang

Main category: cs.CV

TL;DR: FantasyHSI提出了一种基于视频生成和多智能体系统的新框架，通过动态图建模和反馈机制，显著提升了人类-场景交互的长期一致性和物理真实性。

Details

Motivation: 为了解决人类-场景交互中长期任务和未见场景泛化的挑战，需要一种新颖的框架来生成更真实的人类行为。 Method: 将复杂的交互过程建模为动态有向图，并利用多智能体系统进行协作，同时使用Direct Preference Optimization (DPO)训练动作生成器以增强生成动作的物理真实性。 Result: 在自定义的SceneBench基准测试中，FantasyHSI在泛化能力、长期任务完成度和物理真实性方面显著优于现有方法。 Conclusion: FantasyHSI通过引入场景导航代理、规划代理和批评代理，构建了一个协作的多智能体系统，有效解决了长期任务完成和未见场景泛化的问题。 Abstract: Human-Scene Interaction (HSI) seeks to generate realistic human behaviors within complex environments, yet it faces significant challenges in handling long-horizon, high-level tasks and generalizing to unseen scenes. To address these limitations, we introduce FantasyHSI, a novel HSI framework centered on video generation and multi-agent systems that operates without paired data. We model the complex interaction process as a dynamic directed graph, upon which we build a collaborative multi-agent system. This system comprises a scene navigator agent for environmental perception and high-level path planning, and a planning agent that decomposes long-horizon goals into atomic actions. Critically, we introduce a critic agent that establishes a closed-loop feedback mechanism by evaluating the deviation between generated actions and the planned path. This allows for the dynamic correction of trajectory drifts caused by the stochasticity of the generative model, thereby ensuring long-term logical consistency. To enhance the physical realism of the generated motions, we leverage Direct Preference Optimization (DPO) to train the action generator, significantly reducing artifacts such as limb distortion and foot-sliding. Extensive experiments on our custom SceneBench benchmark demonstrate that FantasyHSI significantly outperforms existing methods in terms of generalization, long-horizon task completion, and physical realism. Ours project page: https://fantasy-amap.github.io/fantasy-hsi/

[248] RT-DETRv2 Explained in 8 Illustrations

Ethan Qi Yang Chua,Jen Hong Tan

Main category: cs.CV

TL;DR: The article simplifies the understanding of RT-DETRv2's complex object detection architecture through detailed illustrations and component breakdowns.

Details

Motivation: RT-DETRv2's architecture is difficult to understand, and existing diagrams do not adequately clarify its components' functionality and integration. Method: The article uses a series of eight carefully designed illustrations to explain the architecture of RT-DETRv2, moving from the overall pipeline down to critical components. Result: The visualization of tensor flow and explanation of module logic achieved the goal of making RT-DETRv2's architecture more comprehensible. Conclusion: The article concludes that through a series of illustrations, the complex architecture of RT-DETRv2 can be made genuinely understandable, providing a clearer mental model of its inner workings for researchers and practitioners. Abstract: Object detection architectures are notoriously difficult to understand, often more so than large language models. While RT-DETRv2 represents an important advance in real-time detection, most existing diagrams do little to clarify how its components actually work and fit together. In this article, we explain the architecture of RT-DETRv2 through a series of eight carefully designed illustrations, moving from the overall pipeline down to critical components such as the encoder, decoder, and multi-scale deformable attention. Our goal is to make the existing one genuinely understandable. By visualizing the flow of tensors and unpacking the logic behind each module, we hope to provide researchers and practitioners with a clearer mental model of how RT-DETRv2 works under the hood.

[249] Learning Correlation-aware Aleatoric Uncertainty for 3D Hand Pose Estimation

Lee Chae-Yeon,Nam Hyeon-Woo,Tae-Hyun Oh

Main category: cs.CV

TL;DR: 本文提出了一种新的不确定性建模方法，用于3D手部姿态估计，改进了现有方法的不足，并在实验中表现出色。

Details

Motivation: 现有3D手部姿态估计方法无法估计数据不确定性，并且缺乏对关节相关性的建模。 Method: 引入了数据不确定性建模，并通过一个线性层来捕捉手部关节的内在相关性。 Result: 实验表明，该方法在不确定性建模方面优于现有方法，并在3D手部姿态估计中取得了良好的准确性。 Conclusion: 本文提出了一种新的不确定性建模方法，用于3D手部姿态估计，能够更好地权衡关节相关性和计算效率，并且可以作为现有模型的附加模块使用。 Abstract: 3D hand pose estimation is a fundamental task in understanding human hands. However, accurately estimating 3D hand poses remains challenging due to the complex movement of hands, self-similarity, and frequent occlusions. In this work, we address two limitations: the inability of existing 3D hand pose estimation methods to estimate aleatoric (data) uncertainty, and the lack of uncertainty modeling that incorporates joint correlation knowledge, which has not been thoroughly investigated. To this end, we introduce aleatoric uncertainty modeling into the 3D hand pose estimation framework, aiming to achieve a better trade-off between modeling joint correlations and computational efficiency. We propose a novel parameterization that leverages a single linear layer to capture intrinsic correlations among hand joints. This is enabled by formulating the hand joint output space as a probabilistic distribution, allowing the linear layer to capture joint correlations. Our proposed parameterization is used as a task head layer, and can be applied as an add-on module on top of the existing models. Our experiments demonstrate that our parameterization for uncertainty modeling outperforms existing approaches. Furthermore, the 3D hand pose estimation model equipped with our uncertainty head achieves favorable accuracy in 3D hand pose estimation while introducing new uncertainty modeling capability to the model. The project page is available at https://hand-uncertainty.github.io/.

[250] Towards More Diverse and Challenging Pre-training for Point Cloud Learning: Self-Supervised Cross Reconstruction with Decoupled Views

Xiangdong Zhang,Shaofeng Zhang,Junchi Yan

Main category: cs.CV

TL;DR: 本文提出了一种基于两视角交叉重建的点云自监督学习方法Point-PQAE，通过解耦视图与位置编码设计，显著提升了模型性能。

Details

Motivation: 现有的点云自监督学习方法主要依赖单视角内部重建，缺乏视角间的多样性和信息量，因此需要探索更具挑战性的预训练方式。 Method: 提出Point-PQAE方法，利用两视角预训练范式，生成两个解耦视图并通过新颖的位置编码实现交叉重建。 Result: 在ScanObjectNN数据集的三个变体上，使用Mlp-Linear评估协议，性能分别提升了6.5%、7.0%和6.7%。 Conclusion: Point-PQAE通过两视角交叉重建显著提升了点云自监督学习的性能，超越了单视角自重建方法。 Abstract: Point cloud learning, especially in a self-supervised way without manual labels, has gained growing attention in both vision and learning communities due to its potential utility in a wide range of applications. Most existing generative approaches for point cloud self-supervised learning focus on recovering masked points from visible ones within a single view. Recognizing that a two-view pre-training paradigm inherently introduces greater diversity and variance, it may thus enable more challenging and informative pre-training. Inspired by this, we explore the potential of two-view learning in this domain. In this paper, we propose Point-PQAE, a cross-reconstruction generative paradigm that first generates two decoupled point clouds/views and then reconstructs one from the other. To achieve this goal, we develop a crop mechanism for point cloud view generation for the first time and further propose a novel positional encoding to represent the 3D relative position between the two decoupled views. The cross-reconstruction significantly increases the difficulty of pre-training compared to self-reconstruction, which enables our method to surpass previous single-modal self-reconstruction methods in 3D self-supervised learning. Specifically, it outperforms the self-reconstruction baseline (Point-MAE) by 6.5%, 7.0%, and 6.7% in three variants of ScanObjectNN with the Mlp-Linear evaluation protocol. The code is available at https://github.com/aHapBean/Point-PQAE.

[251] ReCap: Event-Aware Image Captioning with Article Retrieval and Semantic Gaussian Normalization

Thinh-Phuc Nguyen,Thanh-Hai Nguyen,Gia-Huy Dinh,Lam-Huy Nguyen,Minh-Triet Tran,Trung-Nghia Le

Main category: cs.CV

TL;DR: ReCap is a novel pipeline for event-enriched image retrieval and captioning that uses broader contextual information to generate narrative-rich, factually grounded captions. It includes an article retrieval system, a context extraction framework, and a caption generation system with Semantic Gaussian Normalization. Evaluated on the OpenEvents V1 dataset, ReCap achieved a strong overall score of 0.54666, ranking 2nd on the private test set.

Details

Motivation: Image captioning systems often produce generic descriptions that fail to capture event-level semantics which are crucial for applications like news reporting and digital archiving. Method: ReCap comprises three integrated components: a robust two-stage article retrieval system using DINOv2 embeddings, a context extraction framework that synthesizes information from article summaries, generic captions, and original source metadata, and a large language model-based caption generation system with Semantic Gaussian Normalization. Result: Evaluated on the OpenEvents V1 dataset as part of Track 1 in the EVENTA 2025 Grand Challenge, ReCap achieved a strong overall score of 0.54666, ranking 2nd on the private test set. Conclusion: ReCap is effective in bridging visual perception with real-world knowledge, offering a practical solution for context-aware image understanding in high-stakes domains. Abstract: Image captioning systems often produce generic descriptions that fail to capture event-level semantics which are crucial for applications like news reporting and digital archiving. We present ReCap, a novel pipeline for event-enriched image retrieval and captioning that incorporates broader contextual information from relevant articles to generate narrative-rich, factually grounded captions. Our approach addresses the limitations of standard vision-language models that typically focus on visible content while missing temporal, social, and historical contexts. ReCap comprises three integrated components: (1) a robust two-stage article retrieval system using DINOv2 embeddings with global feature similarity for initial candidate selection followed by patch-level mutual nearest neighbor similarity re-ranking; (2) a context extraction framework that synthesizes information from article summaries, generic captions, and original source metadata; and (3) a large language model-based caption generation system with Semantic Gaussian Normalization to enhance fluency and relevance. Evaluated on the OpenEvents V1 dataset as part of Track 1 in the EVENTA 2025 Grand Challenge, ReCap achieved a strong overall score of 0.54666, ranking 2nd on the private test set. These results highlight ReCap's effectiveness in bridging visual perception with real-world knowledge, offering a practical solution for context-aware image understanding in high-stakes domains. The code is available at https://github.com/Noridom1/EVENTA2025-Event-Enriched-Image-Captioning.

[252] Novel Category Discovery with X-Agent Attention for Open-Vocabulary Semantic Segmentation

Jiahao Li Yang Lu,Yachao Zhang,Fangyong Wang,Yuan Xie,Yanyun Qu

Main category: cs.CV

TL;DR: 本研究提出了X-Agent，通过潜在语义感知的agent协调跨模态注意力机制，在开放词汇语义分割（OVSS）领域实现了最先进的性能，并增强了潜在语义的显著性。

Details

Motivation: 现有的基于视觉语言模型（VLM）的方法虽然展示了良好的性能，但对潜在语义理解的基本机制仍未被充分探索，这成为OVSS的瓶颈。 Method: 构建X-Agent，一种创新的OVSS框架，使用潜在语义感知的agent来协调跨模态注意力机制，同时优化潜在语义动态并放大其可感知性。 Result: 广泛的基准评估表明，X-Agent在实现最先进性能的同时，有效增强了潜在语义的显著性。 Conclusion: X-Agent通过利用潜在语义感知的“agent”来协调跨模态注意力机制，实现了最先进的性能，并有效增强了潜在语义的显著性。 Abstract: Open-vocabulary semantic segmentation (OVSS) conducts pixel-level classification via text-driven alignment, where the domain discrepancy between base category training and open-vocabulary inference poses challenges in discriminative modeling of latent unseen category. To address this challenge, existing vision-language model (VLM)-based approaches demonstrate commendable performance through pre-trained multi-modal representations. However, the fundamental mechanisms of latent semantic comprehension remain underexplored, making the bottleneck for OVSS. In this work, we initiate a probing experiment to explore distribution patterns and dynamics of latent semantics in VLMs under inductive learning paradigms. Building on these insights, we propose X-Agent, an innovative OVSS framework employing latent semantic-aware ``agent'' to orchestrate cross-modal attention mechanisms, simultaneously optimizing latent semantic dynamic and amplifying its perceptibility. Extensive benchmark evaluations demonstrate that X-Agent achieves state-of-the-art performance while effectively enhancing the latent semantic saliency.

[253] SAR-NAS: Lightweight SAR Object Detection with Neural Architecture Search

Xinyi Yu,Zhiwei Lin,Yongtao Wang

Main category: cs.CV

TL;DR: This paper applies the lightweight YOLOv10 object detector to Synthetic Aperture Radar (SAR) object detection and enhances it using Neural Architecture Search (NAS) to optimize the network structure, particularly the backbone, for improved performance with lower computational cost.

Details

Motivation: SAR object detection faces challenges like speckle noise, small target ambiguities, and computational constraints. This paper aims to explore how NAS can enhance a lightweight detector like YOLOv10 to overcome these issues, instead of focusing solely on SAR-specific architectural designs. Method: The paper employs NAS to systematically optimize the backbone architecture of YOLOv10 by constructing an extensive search space and leveraging evolutionary search to identify a network structure that balances accuracy, parameter efficiency, and computational cost. Result: The proposed method outperforms existing SAR detection methods on the SARDet-100K dataset, achieving better detection accuracy with lower computational overhead. Conclusion: This work successfully introduces NAS to SAR object detection for the first time, offering a new perspective on improving lightweight detectors for real-world applications. Abstract: Synthetic Aperture Radar (SAR) object detection faces significant challenges from speckle noise, small target ambiguities, and on-board computational constraints. While existing approaches predominantly focus on SAR-specific architectural modifications, this paper explores the application of the existing lightweight object detector, i.e., YOLOv10, for SAR object detection and enhances its performance through Neural Architecture Search (NAS). Specifically, we employ NAS to systematically optimize the network structure, especially focusing on the backbone architecture search. By constructing an extensive search space and leveraging evolutionary search, our method identifies a favorable architecture that balances accuracy, parameter efficiency, and computational cost. Notably, this work introduces NAS to SAR object detection for the first time. The experimental results on the large-scale SARDet-100K dataset demonstrate that our optimized model outperforms existing SAR detection methods, achieving superior detection accuracy while maintaining lower computational overhead. We hope this work offers a novel perspective on leveraging NAS for real-world applications.

[254] Multi-Representation Adapter with Neural Architecture Search for Efficient Range-Doppler Radar Object Detection

Zhiwei Lin,Weicheng Zheng,Yongtao Wang

Main category: cs.CV

TL;DR: This paper introduces an efficient object detection model for radar data that achieves high accuracy and sets new performance records on two datasets.

Details

Motivation: The motivation is to develop an efficient object detection model using radar sensors that are more robust in adverse lighting and weather conditions compared to cameras. Method: The method involves representing RD radar maps with multi-representation (heatmaps and grayscale images), designing an Adapter branch, an Exchanger Module, and a Primary-Auxiliary Fusion Module to extract, exchange, and fuse features. A supernet with various width and fusion operations is constructed, and a One-Shot Neural Architecture Search method is employed to improve efficiency. Result: The model achieved state-of-the-art performance on the RADDet and CARRADA datasets with mAP@50 of 71.9 and 57.1, respectively, demonstrating a good trade-off between accuracy and efficiency. Conclusion: The paper concludes that their proposed object detection model for RD radar maps achieves a favorable balance between accuracy and efficiency, setting a new state-of-the-art performance on two datasets. Abstract: Detecting objects efficiently from radar sensors has recently become a popular trend due to their robustness against adverse lighting and weather conditions compared with cameras. This paper presents an efficient object detection model for Range-Doppler (RD) radar maps. Specifically, we first represent RD radar maps with multi-representation, i.e., heatmaps and grayscale images, to gather high-level object and fine-grained texture features. Then, we design an additional Adapter branch, an Exchanger Module with two modes, and a Primary-Auxiliary Fusion Module to effectively extract, exchange, and fuse features from the multi-representation inputs, respectively. Furthermore, we construct a supernet with various width and fusion operations in the Adapter branch for the proposed model and employ a One-Shot Neural Architecture Search method to further improve the model's efficiency while maintaining high performance. Experimental results demonstrate that our model obtains favorable accuracy and efficiency trade-off. Moreover, we achieve new state-of-the-art performance on RADDet and CARRADA datasets with mAP@50 of 71.9 and 57.1, respectively.

[255] Cross-Domain Few-Shot Segmentation via Ordinary Differential Equations over Time Intervals

Huan Ni,Qingshan Liu,Xiaonan Niu,Danfeng Hong,Lingli Zhao,Haiyan Guan

Main category: cs.CV

TL;DR: 本文提出了一种基于常微分方程和傅里叶变换的跨域少样本分割方法FSS-TIs，通过优化ODE参数实现领域无关特征空间的探索和多样化目标领域分布的模拟，表现出优于现有方法的性能和跨域适应性。

Details

Motivation: 现有的CD-FSS方法通常设计多个独立模块来增强特征表示的跨域泛化能力，但这些模块之间的独立性阻碍了知识的有效流动，难以充分发挥其潜力。 Method: 基于常微分方程和傅里叶变换提出了一种全新的、结构简洁的方法FSS-TIs，通过优化ODE的内在参数来探索领域无关特征表示空间并模拟多样化的目标领域分布。 Result: 实验结果表明FSS-TIs在跨域少样本分割任务中表现优越，并通过深入的消融研究验证了其跨域适应性。 Conclusion: FSS-TIs方法在跨域少样本分割任务中表现出色，优于现有的CD-FSS方法，并验证了其跨域适应性。 Abstract: Cross-domain few-shot segmentation (CD-FSS) not only enables the segmentation of unseen categories with very limited samples, but also improves cross-domain generalization ability within the few-shot segmentation framework. Currently, existing CD-FSS studies typically design multiple independent modules to enhance the cross-domain generalization ability of feature representations. However, the independence among these modules hinders the effective flow of knowledge, making it difficult to fully leverage their collective potential. In contrast, this paper proposes an all-in-one module based on ordinary differential equations and Fourier transform, resulting in a structurally concise method--Few-Shot Segmentation over Time Intervals (FSS-TIs). FSS-TIs assumes the existence of an ODE relationship between the spectra (including amplitude and phase spectra) of domain-specific features and domain-agnostic features. This ODE formulation yields an iterative transformation process along a sequence of time intervals, while simultaneously applying affine transformations with randomized perturbations to the spectra. In doing so, the exploration of domain-agnostic feature representation spaces and the simulation of diverse potential target-domain distributions are reformulated as an optimization process over the intrinsic parameters of the ODE. Moreover, we strictly constrain the support-sample selection during target-domain fine-tuning so that it is consistent with the requirements of real-world few-shot segmentation tasks. For evaluation, we introduce five datasets from substantially different domains and define two sets of cross-domain few-shot segmentation tasks to comprehensively analyze the performance of FSS-TIs. Experimental results demonstrate the superiority of FSS-TIs over existing CD-FSS methods, and in-depth ablation studies further validate the cross-domain adaptability of FSS-TIs.

[256] Guided Model-based LiDAR Super-Resolution for Resource-Efficient Automotive scene Segmentation

Alexandros Gkillas,Nikos Piperigkos,Aris S. Lalos

Main category: cs.CV

TL;DR: 该研究开发了一种结合LiDAR超分辨率和语义分割的轻量级端到端框架，解决了低成本传感器精度不足的问题。

Details

Motivation: 高分辨率LiDAR数据对于自动驾驶中的3D语义分割至关重要，但高端传感器成本高昂，而低成本传感器如16通道LiDAR会产生稀疏点云，降低分割精度。 Method: 提出了一种新的轻量级模型，采用联合优化训练策略，将语义信息引入超分辨率模块，并设计了新的超分辨率损失函数以关注感兴趣区域。 Result: 实验表明，所提出的方法在分割性能上可以与基于高分辨率64通道LiDAR数据的模型相媲美。 Conclusion: 该论文提出的轻量级、端到端的框架能够有效结合LiDAR超分辨率和语义分割，解决了低成本传感器点云稀疏性导致的分割精度下降问题，具有良好的应用前景。 Abstract: High-resolution LiDAR data plays a critical role in 3D semantic segmentation for autonomous driving, but the high cost of advanced sensors limits large-scale deployment. In contrast, low-cost sensors such as 16-channel LiDAR produce sparse point clouds that degrade segmentation accuracy. To overcome this, we introduce the first end-to-end framework that jointly addresses LiDAR super-resolution (SR) and semantic segmentation. The framework employs joint optimization during training, allowing the SR module to incorporate semantic cues and preserve fine details, particularly for smaller object classes. A new SR loss function further directs the network to focus on regions of interest. The proposed lightweight, model-based SR architecture uses significantly fewer parameters than existing LiDAR SR approaches, while remaining easily compatible with segmentation networks. Experiments show that our method achieves segmentation performance comparable to models operating on high-resolution and costly 64-channel LiDAR data.

[257] Prior-Guided Residual Diffusion: Calibrated and Efficient Medical Image Segmentation

Fuyou Mao,Beining Wu,Yanfeng Jiang,Han Xue,Yan Tang,Hao Zhang

Main category: cs.CV

TL;DR: 提出了Prior-Guided Residual Diffusion (PGRD)，用于医学图像分割中的扩散模型，以捕捉完整的条件分布，从而实现更高的准确性和效率。

Details

Motivation: 医学图像分割中的模糊性需要模型捕捉完整的条件分布，而非单一的点估计。 Method: PGRD通过将离散标签嵌入到连续空间中来对齐分割与扩散模型。一个粗略的先验预测器提供逐步指导，扩散网络学习先验的残差，加速收敛并提高校准。此外，深度扩散监督方案通过监督中间时间步长稳定训练。 Result: 在MRI和CT数据集上评估，PGRD的Dice分数更高，NLL/ECE值更低，同时需要更少的采样步骤即可达到优秀性能。 Conclusion: PGRD是一种有效的医学图像分割方法，能够捕捉完整的条件分布，具有更高的性能和效率。 Abstract: Ambiguity in medical image segmentation calls for models that capture full conditional distributions rather than a single point estimate. We present Prior-Guided Residual Diffusion (PGRD), a diffusion-based framework that learns voxel-wise distributions while maintaining strong calibration and practical sampling efficiency. PGRD embeds discrete labels as one-hot targets in a continuous space to align segmentation with diffusion modeling. A coarse prior predictor provides step-wise guidance; the diffusion network then learns the residual to the prior, accelerating convergence and improving calibration. A deep diffusion supervision scheme further stabilizes training by supervising intermediate time steps. Evaluated on representative MRI and CT datasets, PGRD achieves higher Dice scores and lower NLL/ECE values than Bayesian, ensemble, Probabilistic U-Net, and vanilla diffusion baselines, while requiring fewer sampling steps to reach strong performance.

[258] Image Quality Enhancement and Detection of Small and Dense Objects in Industrial Recycling Processes

Oussama Messai,Abbass Zein-Eddine,Abdelouahid Bentamou,Mickaël Picq,Nicolas Duquesne,Stéphane Puydarrieux,Yann Gavet

Main category: cs.CV

TL;DR: 该论文研究了如何利用监督深度学习方法改善工业场景中的目标检测和图像去噪，提出了一个轻量级模型并评估了其效果。

Details

Motivation: 本文的动机是解决计算机视觉中的两个关键问题：检测小而密集且重叠的目标，以及提高工业环境中噪声图像的质量。 Method: 本文使用监督深度学习方法进行目标检测和图像质量提升，并基于全连接卷积网络提出了一种轻量级模型。 Result: 本文通过分析包含超过10,000张图像和120,000个实例的新数据集，评估了不同方法的性能、准确性和计算效率，并提出了一种新的轻量级模型。 Conclusion: 本文得出的结论是，基于监督深度学习的方法能够有效应对工业环境中小而密集且重叠的目标检测和图像去噪问题。论文还提出了一种轻量级模型，为未来的研究提供了方向。 Abstract: This paper tackles two key challenges: detecting small, dense, and overlapping objects (a major hurdle in computer vision) and improving the quality of noisy images, especially those encountered in industrial environments. [1, 2]. Our focus is on evaluating methods built on supervised deep learning. We perform an analysis of these methods, using a newly de- veloped dataset comprising over 10k images and 120k in- stances. By evaluating their performance, accuracy, and com- putational efficiency, we identify the most reliable detection systems and highlight the specific challenges they address in industrial applications. This paper also examines the use of deep learning models to improve image quality in noisy industrial environments. We introduce a lightweight model based on a fully connected convolutional network. Addition- ally, we suggest potential future directions for further enhanc- ing the effectiveness of the model. The repository of the dataset and proposed model can be found at: https://github.com/o-messai/SDOOD, https://github.com/o-messai/DDSRNet

[259] Street-Level Geolocalization Using Multimodal Large Language Models and Retrieval-Augmented Generation

Yunus Serhat Bicakci,Joseph Shingleton,Anahid Basiri

Main category: cs.CV

TL;DR: 本文提出了一种基于检索增强生成的多模态大语言模型的新方法，用于从图像中进行街道级地理定位，无需昂贵的微调或重新训练，且在多个基准数据集上表现出最先进的性能。

Details

Motivation: 图像的街景地理定位对于导航、基于位置的推荐和城市规划等应用至关重要。随着社交媒体数据和智能手机摄像头的普及，应用传统计算机视觉技术来定位图像变得越来越具有挑战性，但也极具价值。 Method: 该方法使用SigLIP编码器在两个大规模数据集（EMP-16和OSV-5M）上构建向量数据库，并通过集成开放权重和公开可访问的多模态大语言模型与检索增强生成技术，利用包含相似和不同地理位置信息的提示来增强查询图像。 Result: 该方法在三个广泛使用的基准数据集（IM2GPS、IM2GPS3k和YFCC4k）上取得了比现有方法更高的准确性，并且无需昂贵的微调或重新训练，能够无缝扩展以整合新数据源。 Conclusion: 本文展示的基于检索增强生成的多模态大语言模型在地理位置估计方面提供了一种替代传统从头训练模型的方法，为GeoAI领域提供了更具可访问性和可扩展性的解决方案。 Abstract: Street-level geolocalization from images is crucial for a wide range of essential applications and services, such as navigation, location-based recommendations, and urban planning. With the growing popularity of social media data and cameras embedded in smartphones, applying traditional computer vision techniques to localize images has become increasingly challenging, yet highly valuable. This paper introduces a novel approach that integrates open-weight and publicly accessible multimodal large language models with retrieval-augmented generation. The method constructs a vector database using the SigLIP encoder on two large-scale datasets (EMP-16 and OSV-5M). Query images are augmented with prompts containing both similar and dissimilar geolocation information retrieved from this database before being processed by the multimodal large language models. Our approach has demonstrated state-of-the-art performance, achieving higher accuracy compared against three widely used benchmark datasets (IM2GPS, IM2GPS3k, and YFCC4k). Importantly, our solution eliminates the need for expensive fine-tuning or retraining and scales seamlessly to incorporate new data sources. The effectiveness of retrieval-augmented generation-based multimodal large language models in geolocation estimation demonstrated by this paper suggests an alternative path to the traditional methods which rely on the training models from scratch, opening new possibilities for more accessible and scalable solutions in GeoAI.

[260] AgroSense: An Integrated Deep Learning System for Crop Recommendation via Soil Image Analysis and Nutrient Profiling

Vishal Pandey,Ranjita Das,Debasmita Biswas

Main category: cs.CV

TL;DR: This paper proposes AgroSense, a deep-learning framework that integrates soil image classification and nutrient profiling to produce accurate and contextually relevant crop recommendations, with a fused model achieving high accuracy and efficiency.

Details

Motivation: Meeting the increasing global demand for food security and sustainable farming requires intelligent crop recommendation systems that operate in real time. Traditional soil analysis techniques are often slow, labor-intensive, and not suitable for on-field decision-making. Method: AgroSense comprises two main components: a Soil Classification Module, which leverages ResNet-18, EfficientNet-B0, and Vision Transformer architectures to categorize soil types from images; and a Crop Recommendation Module, which employs a Multi-Layer Perceptron, XGBoost, LightGBM, and TabNet to analyze structured soil data, including nutrient levels, pH, and rainfall. Result: The fused model achieves 98.0% accuracy, with a precision of 97.8%, a recall of 97.7%, and an F1-score of 96.75%, while RMSE and MAE drop to 0.32 and 0.27, respectively. Ablation studies underscore the critical role of multimodal coupling, and statistical validation via t-tests and ANOVA confirms the significance of our improvements. Conclusion: AgroSense offers a practical, scalable solution for real-time decision support in precision agriculture and paves the way for future lightweight multimodal AI systems in resource-constrained environments. Abstract: Meeting the increasing global demand for food security and sustainable farming requires intelligent crop recommendation systems that operate in real time. Traditional soil analysis techniques are often slow, labor-intensive, and not suitable for on-field decision-making. To address these limitations, we introduce AgroSense, a deep-learning framework that integrates soil image classification and nutrient profiling to produce accurate and contextually relevant crop recommendations. AgroSense comprises two main components: a Soil Classification Module, which leverages ResNet-18, EfficientNet-B0, and Vision Transformer architectures to categorize soil types from images; and a Crop Recommendation Module, which employs a Multi-Layer Perceptron, XGBoost, LightGBM, and TabNet to analyze structured soil data, including nutrient levels, pH, and rainfall. We curated a multimodal dataset of 10,000 paired samples drawn from publicly available Kaggle repositories, approximately 50,000 soil images across seven classes, and 25,000 nutrient profiles for experimental evaluation. The fused model achieves 98.0% accuracy, with a precision of 97.8%, a recall of 97.7%, and an F1-score of 96.75%, while RMSE and MAE drop to 0.32 and 0.27, respectively. Ablation studies underscore the critical role of multimodal coupling, and statistical validation via t-tests and ANOVA confirms the significance of our improvements. AgroSense offers a practical, scalable solution for real-time decision support in precision agriculture and paves the way for future lightweight multimodal AI systems in resource-constrained environments.

[261] M3Ret: Unleashing Zero-shot Multimodal Medical Image Retrieval via Self-Supervision

Che Liu,Zheng Jiang,Chengyu Fang,Heng Guo,Yan-Jie Zhou,Jiaqi Qu,Le Lu,Minfeng Xu

Main category: cs.CV

TL;DR: M3Ret是一种统一的医学图像检索模型，无需模态定制，通过自监督学习实现跨模态检索，表现优于现有方法。

Details

Motivation: 当前医学图像检索方法针对不同模态（如2D、3D、视频）采用不同的架构和训练策略，缺乏统一的表示，限制了可扩展性和性能。 Method: 构建了一个包含867,653个医学图像样本的混合模态数据集，使用MAE和SimDINO两种自监督学习方法训练统一的视觉编码器M3Ret。 Result: M3Ret在零样本图像检索任务中超越了DINOv3和BMC-CLIP等强基线模型，展现出强大的跨模态对齐能力和对未见过的MRI任务的良好泛化能力。 Conclusion: M3Ret为医学图像理解提供了一个可扩展的统一框架，推动了基于视觉自监督学习的基础模型发展。 Abstract: Medical image retrieval is essential for clinical decision-making and translational research, relying on discriminative visual representations. Yet, current methods remain fragmented, relying on separate architectures and training strategies for 2D, 3D, and video-based medical data. This modality-specific design hampers scalability and inhibits the development of unified representations. To enable unified learning, we curate a large-scale hybrid-modality dataset comprising 867,653 medical imaging samples, including 2D X-rays and ultrasounds, RGB endoscopy videos, and 3D CT scans. Leveraging this dataset, we train M3Ret, a unified visual encoder without any modality-specific customization. It successfully learns transferable representations using both generative (MAE) and contrastive (SimDINO) self-supervised learning (SSL) paradigms. Our approach sets a new state-of-the-art in zero-shot image-to-image retrieval across all individual modalities, surpassing strong baselines such as DINOv3 and the text-supervised BMC-CLIP. More remarkably, strong cross-modal alignment emerges without paired data, and the model generalizes to unseen MRI tasks, despite never observing MRI during pretraining, demonstrating the generalizability of purely visual self-supervision to unseen modalities. Comprehensive analyses further validate the scalability of our framework across model and data sizes. These findings deliver a promising signal to the medical imaging community, positioning M3Ret as a step toward foundation models for visual SSL in multimodal medical image understanding.

[262] Identity-Preserving Text-to-Video Generation via Training-Free Prompt, Image, and Guidance Enhancement

Jiayi Gao,Changcheng Hua,Qingchao Chen,Yuxin Peng,Yang Liu

Main category: cs.CV

TL;DR: This paper proposes a training-free framework for identity-preserving text-to-video generation that enhances prompts, images, and guidance. It achieves state-of-the-art results and wins first place in the ACM Multimedia 2025 Identity-Preserving Video Generation Challenge.

Details

Motivation: The motivation is to overcome the limitations of data scarcity and high tuning costs in existing identity-preserving text-to-video generation methods by introducing a training-free framework that enhances prompts, images, and guidance. Method: The paper introduces the TPIGE framework, which includes Face Aware Prompt Enhancement using GPT-4o, Prompt Aware Reference Image Enhancement using an identity-preserving image generator, and ID-Aware Spatiotemporal Guidance Enhancement using unified gradients to optimize identity preservation and video quality during generation. Result: The proposed method outperforms prior work and demonstrates strong generality and state-of-the-art performance on a 1000 video test set. Conclusion: The proposed TPIGE framework achieves state-of-the-art performance in identity-preserving text-to-video generation, validated through automatic and human evaluations, and wins first place in the ACM Multimedia 2025 Identity-Preserving Video Generation Challenge. Abstract: Identity-preserving text-to-video (IPT2V) generation creates videos faithful to both a reference subject image and a text prompt. While fine-tuning large pretrained video diffusion models on ID-matched data achieves state-of-the-art results on IPT2V, data scarcity and high tuning costs hinder broader improvement. We thus introduce a Training-Free Prompt, Image, and Guidance Enhancement (TPIGE) framework that bridges the semantic gap between the video description and the reference image and design sampling guidance that enhances identity preservation and video quality, achieving performance gains at minimal cost.Specifically, we first propose Face Aware Prompt Enhancement, using GPT-4o to enhance the text prompt with facial details derived from the reference image. We then propose Prompt Aware Reference Image Enhancement, leveraging an identity-preserving image generator to refine the reference image, rectifying conflicts with the text prompt. The above mutual refinement significantly improves input quality before video generation. Finally, we propose ID-Aware Spatiotemporal Guidance Enhancement, utilizing unified gradients to optimize identity preservation and video quality jointly during generation.Our method outperforms prior work and is validated by automatic and human evaluations on a 1000 video test set, winning first place in the ACM Multimedia 2025 Identity-Preserving Video Generation Challenge, demonstrating state-of-the-art performance and strong generality. The code is available at https://github.com/Andyplus1/IPT2V.git.

[263] Uirapuru: Timely Video Analytics for High-Resolution Steerable Cameras on Edge Devices

Guilherme H. Apostolo,Pablo Bauszat,Vinod Nigade,Henri E. Bal,Lin Wang

Main category: cs.CV

TL;DR: Uirapuru is a novel framework for real-time video analytics on high-resolution steerable cameras that improves accuracy and speed compared to existing approaches designed for static cameras.

Details

Motivation: The motivation is to improve real-time video analytics for steerable cameras, which bring significant scene dynamism compared to static viewpoint cameras, limiting the effectiveness of existing approaches like frame tiling. Method: Uirapuru uses a comprehensive understanding of camera actuation paired with fast adaptive tiling at a per-frame level to handle the dynamism introduced by steerable cameras. Result: Uirapuru achieves up to 1.45x improvement in accuracy while respecting latency budgets or provides up to 4.53x inference speedup with comparable accuracy, evaluated on high-resolution datasets with PTZ movements and real-world videos. Conclusion: Uirapuru successfully addresses the challenges of real-time video analytics on high-resolution steerable cameras by incorporating adaptive tiling and an understanding of camera actuation, outperforming static camera approaches in accuracy and speed. Abstract: Real-time video analytics on high-resolution cameras has become a popular technology for various intelligent services like traffic control and crowd monitoring. While extensive work has been done on improving analytics accuracy with timing guarantees, virtually all of them target static viewpoint cameras. In this paper, we present Uirapuru, a novel framework for real-time, edge-based video analytics on high-resolution steerable cameras. The actuation performed by those cameras brings significant dynamism to the scene, presenting a critical challenge to existing popular approaches such as frame tiling. To address this problem, Uirapuru incorporates a comprehensive understanding of camera actuation into the system design paired with fast adaptive tiling at a per-frame level. We evaluate Uirapuru on a high-resolution video dataset, augmented by pan-tilt-zoom (PTZ) movements typical for steerable cameras and on real-world videos collected from an actual PTZ camera. Our experimental results show that Uirapuru provides up to 1.45x improvement in accuracy while respecting specified latency budgets or reaches up to 4.53x inference speedup with on-par accuracy compared to state-of-the-art static camera approaches.

[264] Unsupervised Ultra-High-Resolution UAV Low-Light Image Enhancement: A Benchmark, Metric and Framework

Wei Lu,Lingyu Zhu,Si-Bao Chen

Main category: cs.CV

TL;DR: 本文提出了一种高效低光图像增强框架U3LIE，解决了无人机在低光条件下性能下降的问题。

Details

Motivation: 低光条件在关键应用中显著降低了无人机（UAVs）的性能。现有的低光图像增强方法难以应对航拍图像的独特挑战，包括超高分辨率、缺乏配对数据、严重的非均匀照明和部署限制。 Method: 我们开发了一个高效框架U3LIE，包含两种仅用于训练的设计：用于输入归一化的自适应预增强增强（APA）和用于曝光控制的亮度间隔损失（L_int）；引入了Edge Efficiency Index（EEI），这是一种新的度量标准，平衡了感知质量与关键部署因素：速度、分辨率、模型复杂度和内存占用；提出了U3D，第一个用于低光图像增强的无监督超高分辨率无人机数据集，并附带统一评估工具包。 Result: U3LIE实现了最先进的结果，在单个GPU上每秒处理23.8帧4K图像，使其适合实时机载部署。 Conclusion: U3LIE实现了最先进的结果，在单个GPU上每秒处理23.8帧4K图像，使其适合实时机载部署。总的来说，这些贡献提供了一个整体解决方案（数据集、度量和方法），推动了稳健的24/7无人机视觉的发展。代码和数据集可在https://github.com/lwCVer/U3D_Toolkit获取。 Abstract: Low light conditions significantly degrade Unmanned Aerial Vehicles (UAVs) performance in critical applications. Existing Low-light Image Enhancement (LIE) methods struggle with the unique challenges of aerial imagery, including Ultra-High Resolution (UHR), lack of paired data, severe non-uniform illumination, and deployment constraints. To address these issues, we propose three key contributions. First, we present U3D, the first unsupervised UHR UAV dataset for LIE, with a unified evaluation toolkit. Second, we introduce the Edge Efficiency Index (EEI), a novel metric balancing perceptual quality with key deployment factors: speed, resolution, model complexity, and memory footprint. Third, we develop U3LIE, an efficient framework with two training-only designs-Adaptive Pre-enhancement Augmentation (APA) for input normalization and a Luminance Interval Loss (L_int) for exposure control. U3LIE achieves SOTA results, processing 4K images at 23.8 FPS on a single GPU, making it ideal for real-time on-board deployment. In summary, these contributions provide a holistic solution (dataset, metric, and method) for advancing robust 24/7 UAV vision. The code and datasets are available at https://github.com/lwCVer/U3D_Toolkit.

[265] Enhancing Partially Relevant Video Retrieval with Robust Alignment Learning

Long Zhang,Peipei Song,Jianfeng Dong,Kun Li,Xun Yang

Main category: cs.CV

TL;DR: This paper introduces the Robust Alignment Learning (RAL) framework for Partially Relevant Video Retrieval (PRVR), which addresses data uncertainty through probabilistic modeling and dynamic similarity weighting, enhancing retrieval accuracy.

Details

Motivation: PRVR faces challenges due to data uncertainty, including query ambiguity and partial video relevance, which existing methods struggle to handle effectively. Method: The RAL framework encodes videos and queries as multivariate Gaussian distributions for probabilistic modeling and uses learnable confidence gates to dynamically weight similarity based on the informativeness of query words. Result: The RAL framework demonstrates robust performance across diverse retrieval backbones, showing its effectiveness in handling PRVR tasks. Conclusion: The proposed RAL framework effectively addresses the challenges of PRVR by explicitly modeling data uncertainty and dynamically weighting similarity, proving to be a robust and versatile solution. Abstract: Partially Relevant Video Retrieval (PRVR) aims to retrieve untrimmed videos partially relevant to a given query. The core challenge lies in learning robust query-video alignment against spurious semantic correlations arising from inherent data uncertainty: 1) query ambiguity, where the query incompletely characterizes the target video and often contains uninformative tokens, and 2) partial video relevance, where abundant query-irrelevant segments introduce contextual noise in cross-modal alignment. Existing methods often focus on enhancing multi-scale clip representations and retrieving the most relevant clip. However, the inherent data uncertainty in PRVR renders them vulnerable to distractor videos with spurious similarities, leading to suboptimal performance. To fill this research gap, we propose Robust Alignment Learning (RAL) framework, which explicitly models the uncertainty in data. Key innovations include: 1) we pioneer probabilistic modeling for PRVR by encoding videos and queries as multivariate Gaussian distributions. This not only quantifies data uncertainty but also enables proxy-level matching to capture the variability in cross-modal correspondences; 2) we consider the heterogeneous informativeness of query words and introduce learnable confidence gates to dynamically weight similarity. As a plug-and-play solution, RAL can be seamlessly integrated into the existing architectures. Extensive experiments across diverse retrieval backbones demonstrate its effectiveness.

[266] RibPull: Implicit Occupancy Fields and Medial Axis Extraction for CT Ribcage Scans

Emmanouil Nikolakakis,Amine Ouasfi,Julie Digne,Razvan Marinescu

Main category: cs.CV

TL;DR: RibPull 是一种利用隐式占用场的新方法，它在医学成像中实现了更好的稀疏数据处理和几何操作，优于传统的体素方法。

Details

Motivation: 传统体素网格在医学成像中存在分辨率限制、拓扑信息丢失和对稀疏数据处理低效的问题，因此需要一种更优的稀疏数据表示方法。 Method: 使用神经占用场预测3D点是否在物体内部，并应用基于拉普拉斯的收缩提取肋骨的中轴。 Result: 在 RibSeg 数据集的20个医学扫描上评估了该方法，并展示了连续坐标表示在几何操作上的优势。 Conclusion: RibPull 提出了一种基于隐式占用场的方法，有效结合了计算几何和医学成像，能够更好地处理稀疏和噪声数据，并在几何操作方面优于基于体素的方法。 Abstract: We present RibPull, a methodology that utilizes implicit occupancy fields to bridge computational geometry and medical imaging. Implicit 3D representations use continuous functions that handle sparse and noisy data more effectively than discrete methods. While voxel grids are standard for medical imaging, they suffer from resolution limitations, topological information loss, and inefficient handling of sparsity. Coordinate functions preserve complex geometrical information and represent a better solution for sparse data representation, while allowing for further morphological operations. Implicit scene representations enable neural networks to encode entire 3D scenes within their weights. The result is a continuous function that can implicitly compesate for sparse signals and infer further information about the 3D scene by passing any combination of 3D coordinates as input to the model. In this work, we use neural occupancy fields that predict whether a 3D point lies inside or outside an object to represent CT-scanned ribcages. We also apply a Laplacian-based contraction to extract the medial axis of the ribcage, thus demonstrating a geometrical operation that benefits greatly from continuous coordinate-based 3D scene representations versus voxel-based representations. We evaluate our methodology on 20 medical scans from the RibSeg dataset, which is itself an extension of the RibFrac dataset. We will release our code upon publication.

[267] Neural Scene Designer: Self-Styled Semantic Image Manipulation

Jianman Lin,Tianshui Chen,Chunmei Qing,Zhijing Yang,Shuangping Huang,Yuheng Ren,Liang Lin

Main category: cs.CV

TL;DR: 该论文提出了一种新的图像编辑和修复框架NSD，通过分离处理文本和风格信息的双交叉注意力机制，实现语义对齐和风格一致性，并引入PSRL模块以对比损失保证图像内部风格一致性。

Details

Motivation: 现有方法主要关注生成内容的语义控制，而忽略了保持图像风格一致性的关键任务。 Method: 提出NSD框架和PSRL模块，利用双交叉注意力机制和风格对比损失来分别处理文本和风格信息，确保语义控制和风格一致性。 Result: 通过建立的综合基准测试，实验表明该框架在图像编辑和修复任务中具有良好的效果。 Conclusion: NSD框架成功实现了图像编辑中的语义对齐和风格一致性，提出的PSRL模块有效提升了图像风格一致性的保持能力。 Abstract: Maintaining stylistic consistency is crucial for the cohesion and aesthetic appeal of images, a fundamental requirement in effective image editing and inpainting. However, existing methods primarily focus on the semantic control of generated content, often neglecting the critical task of preserving this consistency. In this work, we introduce the Neural Scene Designer (NSD), a novel framework that enables photo-realistic manipulation of user-specified scene regions while ensuring both semantic alignment with user intent and stylistic consistency with the surrounding environment. NSD leverages an advanced diffusion model, incorporating two parallel cross-attention mechanisms that separately process text and style information to achieve the dual objectives of semantic control and style consistency. To capture fine-grained style representations, we propose the Progressive Self-style Representational Learning (PSRL) module. This module is predicated on the intuitive premise that different regions within a single image share a consistent style, whereas regions from different images exhibit distinct styles. The PSRL module employs a style contrastive loss that encourages high similarity between representations from the same image while enforcing dissimilarity between those from different images. Furthermore, to address the lack of standardized evaluation protocols for this task, we establish a comprehensive benchmark. This benchmark includes competing algorithms, dedicated style-related metrics, and diverse datasets and settings to facilitate fair comparisons. Extensive experiments conducted on our benchmark demonstrate the effectiveness of the proposed framework.

[268] MILO: A Lightweight Perceptual Quality Metric for Image and Latent-Space Optimization

Uğur Çoğalan,Mojtaba Bemana,Karol Myszkowski,Hans-Peter Seidel,Colin Groth

Main category: cs.CV

TL;DR: MILO是一个轻量级、多尺度的感知图像质量评估指标，同时可作为生成模型中感知优化的实用工具，提升了任务性能并降低了计算开销。

Details

Motivation: 现有的图像质量评估指标通常依赖大规模的人类标注数据，且在实时应用中推理速度较慢。同时，在生成模型中缺乏有效的感知优化工具。MILO旨在解决这些问题，提供一个轻量级、多尺度的感知度量，适用于图像和潜在空间的优化。 Method: MILO通过伪MOS监督学习，应用可重用的失真并结合最新的质量度量，利用紧凑的架构进行高效的推理。此外，MILO在VAE编码器的潜在表示中应用空间掩码，并结合课程学习策略进行优化。 Result: MILO在标准的FR-IQA基准测试中表现优于现有指标，并在去噪、超分辨率和面部修复等任务中展示了显著的性能提升，同时支持实时应用。 Conclusion: MILO不仅是一个先进的图像质量评估指标，还能够作为生成模型中感知优化的实用工具，显著提升去噪、超分辨率和面部修复等任务的性能，同时降低计算开销。 Abstract: We present MILO (Metric for Image- and Latent-space Optimization), a lightweight, multiscale, perceptual metric for full-reference image quality assessment (FR-IQA). MILO is trained using pseudo-MOS (Mean Opinion Score) supervision, in which reproducible distortions are applied to diverse images and scored via an ensemble of recent quality metrics that account for visual masking effects. This approach enables accurate learning without requiring large-scale human-labeled datasets. Despite its compact architecture, MILO outperforms existing metrics across standard FR-IQA benchmarks and offers fast inference suitable for real-time applications. Beyond quality prediction, we demonstrate the utility of MILO as a perceptual loss in both image and latent domains. In particular, we show that spatial masking modeled by MILO, when applied to latent representations from a VAE encoder within Stable Diffusion, enables efficient and perceptually aligned optimization. By combining spatial masking with a curriculum learning strategy, we first process perceptually less relevant regions before progressively shifting the optimization to more visually distorted areas. This strategy leads to significantly improved performance in tasks like denoising, super-resolution, and face restoration, while also reducing computational overhead. MILO thus functions as both a state-of-the-art image quality metric and as a practical tool for perceptual optimization in generative pipelines.

[269] Bangladeshi Street Food Calorie Estimation Using Improved YOLOv8 and Regression Model

Aparup Dhar,MD Tamim Hossain,Pritom Barua

Main category: cs.CV

TL;DR: 该论文提出了一种针对孟加拉国街头食品的自动卡路里追踪解决方案，通过改进最先进的视觉模型YOLOv8，实现了高效的卡路里估算。

Details

Motivation: 随着肥胖率的上升，自动卡路里追踪成为维持健康生活方式的重要工具。然而，现有方法存在诸多限制，例如仅提供恒定的卡路里输出、难以识别多种食物、图像缩放和归一化困难，以及主要关注西方饮食。 Method: 首先构建了一个包含孟加拉国各地流行街头食品的多样化数据集，然后通过改进YOLOv8模型开发了一个精确的卡路里估算系统，并结合机器学习回归模型进行卡路里预测。 Result: 改进后的模型在分类和分割任务上表现出色，计算复杂度仅略有增加。卡路里估算的平均绝对误差（MAE）为6.94，均方根误差（RMSE）为11.03，R²得分为96.0%。 Conclusion: 该系统在实际应用中具有高效性和准确性，为孟加拉国街头食品的卡路里估算提供了一种有效的解决方案。 Abstract: As obesity rates continue to increase, automated calorie tracking has become a vital tool for people seeking to maintain a healthy lifestyle or adhere to a diet plan. Although numerous research efforts have addressed this issue, existing approaches often face key limitations, such as providing only constant caloric output, struggling with multiple food recognition challenges, challenges in image scaling and normalization, and a predominant focus on Western cuisines. In this paper, we propose a tailored solution that specifically targets Bangladeshi street food. We first construct a diverse dataset of popular street foods found across Bangladesh. Then, we develop a refined calorie estimation system by modifying the state-of-the-art vision model YOLOv8. Our modified model achieves superior classification and segmentation results, with only a slight increase in computational complexity compared to the base variant. Coupled with a machine learning regression model, our system achieves an impressive 6.94 mean absolute error (MAE), 11.03 root mean squared error (RMSE), and a 96.0% R^2 score in calorie estimation, making it both highly effective and accurate for real-world food calorie calculations.

[270] InfoScale: Unleashing Training-free Variable-scaled Image Generation via Effective Utilization of Information

Guohui Zhang,Jiangtong Tan,Linjiang Huang,Zhonghang Yuan,Naishan Zheng,Jie Huang,Feng Zhao

Main category: cs.CV

TL;DR: This paper proposes InfoScale, a framework that enhances diffusion models' ability to generate images at various resolutions by addressing information loss, aggregation inflexibility, and noise misalignment.

Details

Motivation: Diffusion models suffer performance drops when generating images at resolutions different from the training scale due to issues in information conversion. This work aims to unify variable-scale generation by addressing these key limitations. Method: The authors propose InfoScale, a framework that introduces three modules: Progressive Frequency Compensation to recover high-frequency details, Adaptive Information Aggregation to balance local and global information, and Noise Adaptation to align initial noise with target resolutions. Result: InfoScale demonstrates effectiveness in generating images at variable scales, showing improved performance across extensive experiments while being compatible with existing diffusion models. Conclusion: InfoScale effectively addresses the challenge of variable-scaled image generation in diffusion models by compensating for high-frequency information loss, enabling adaptive information aggregation, and aligning the initial noise distribution. Abstract: Diffusion models (DMs) have become dominant in visual generation but suffer performance drop when tested on resolutions that differ from the training scale, whether lower or higher. In fact, the key challenge in generating variable-scale images lies in the differing amounts of information across resolutions, which requires information conversion procedures to be varied for generating variable-scaled images. In this paper, we investigate the issues of three critical aspects in DMs for a unified analysis in variable-scaled generation: dilated convolution, attention mechanisms, and initial noise. Specifically, 1) dilated convolution in DMs for the higher-resolution generation loses high-frequency information. 2) Attention for variable-scaled image generation struggles to adjust the information aggregation adaptively. 3) The spatial distribution of information in the initial noise is misaligned with variable-scaled image. To solve the above problems, we propose \textbf{InfoScale}, an information-centric framework for variable-scaled image generation by effectively utilizing information from three aspects correspondingly. For information loss in 1), we introduce Progressive Frequency Compensation module to compensate for high-frequency information lost by dilated convolution in higher-resolution generation. For information aggregation inflexibility in 2), we introduce Adaptive Information Aggregation module to adaptively aggregate information in lower-resolution generation and achieve an effective balance between local and global information in higher-resolution generation. For information distribution misalignment in 3), we design Noise Adaptation module to re-distribute information in initial noise for variable-scaled generation. Our method is plug-and-play for DMs and extensive experiments demonstrate the effectiveness in variable-scaled image generation.

[271] Mamba-CNN: A Hybrid Architecture for Efficient and Accurate Facial Beauty Prediction

Djamel Eddine Boukhari

Main category: cs.CV

TL;DR: 提出了一种新的高效混合架构Mamba-CNN，用于面部吸引力的计算评估。

Details

Motivation: 为了解决卷积神经网络和视觉变压器在面部吸引力计算评估中的局限性。 Method: 将受Mamba启发的状态空间模型门控机制集成到分层卷积骨干中。 Result: 在SCUT-FBP5500基准上达到了0.9187的皮尔逊相关系数，0.2022的平均绝对误差和0.2610的均方根误差。 Conclusion: 结合卷积神经网络和选择性状态空间模型具有协同潜力，并提出了用于细微视觉理解任务的新架构范式。 Abstract: The computational assessment of facial attractiveness, a challenging subjective regression task, is dominated by architectures with a critical trade-off: Convolutional Neural Networks (CNNs) offer efficiency but have limited receptive fields, while Vision Transformers (ViTs) model global context at a quadratic computational cost. To address this, we propose Mamba-CNN, a novel and efficient hybrid architecture. Mamba-CNN integrates a lightweight, Mamba-inspired State Space Model (SSM) gating mechanism into a hierarchical convolutional backbone. This core innovation allows the network to dynamically modulate feature maps and selectively emphasize salient facial features and their long-range spatial relationships, mirroring human holistic perception while maintaining computational efficiency. We conducted extensive experiments on the widely-used SCUT-FBP5500 benchmark, where our model sets a new state-of-the-art. Mamba-CNN achieves a Pearson Correlation (PC) of 0.9187, a Mean Absolute Error (MAE) of 0.2022, and a Root Mean Square Error (RMSE) of 0.2610. Our findings validate the synergistic potential of combining CNNs with selective SSMs and present a powerful new architectural paradigm for nuanced visual understanding tasks.

[272] SoccerHigh: A Benchmark Dataset for Automatic Soccer Video Summarization

Artur Díaz-Juan,Coloma Ballester,Gloria Haro

Main category: cs.CV

TL;DR: This paper introduces a curated soccer video summarization dataset with shot boundaries for 237 matches and proposes a baseline model achieving an F1 score of 0.3956, along with a new summary length-constrained evaluation metric.

Details

Motivation: The lack of publicly available datasets for sports highlight generation motivated the creation of a benchmark dataset and model to support robust automatic video summarization in soccer. Method: The paper introduces a curated dataset for soccer video summarization, derived from the SoccerNet dataset, and proposes a baseline model tailored for this task. Additionally, a new evaluation metric constrained by summary length is introduced. Result: The dataset includes shot boundaries for 237 matches from the Spanish, French, and Italian leagues. The baseline model achieved an F1 score of 0.3956 on the test set, and a new evaluation metric was successfully introduced. Conclusion: The paper concludes that the proposed dataset and baseline model provide a solid foundation for future research in soccer video summarization, as demonstrated by the F1 score of 0.3956 and the new evaluation metric. Abstract: Video summarization aims to extract key shots from longer videos to produce concise and informative summaries. One of its most common applications is in sports, where highlight reels capture the most important moments of a game, along with notable reactions and specific contextual events. Automatic summary generation can support video editors in the sports media industry by reducing the time and effort required to identify key segments. However, the lack of publicly available datasets poses a challenge in developing robust models for sports highlight generation. In this paper, we address this gap by introducing a curated dataset for soccer video summarization, designed to serve as a benchmark for the task. The dataset includes shot boundaries for 237 matches from the Spanish, French, and Italian leagues, using broadcast footage sourced from the SoccerNet dataset. Alongside the dataset, we propose a baseline model specifically designed for this task, which achieves an F1 score of 0.3956 in the test set. Furthermore, we propose a new metric constrained by the length of each target summary, enabling a more objective evaluation of the generated content. The dataset and code are available at https://ipcv.github.io/SoccerHigh/.

[273] Reinforced Visual Perception with Tools

Zetong Zhou,Dongping Chen,Zixian Ma,Zhihan Hu,Mingyang Fu,Sinan Wang,Yao Wan,Zhou Zhao,Ranjay Krishna

Main category: cs.CV

TL;DR: ReVPT通过强化学习增强了多模态LLMs的视觉推理能力，在多个基准测试中表现出了优越的性能。

Details

Motivation: 尽管计算机视觉的进步为各种感知任务产生了强大的模型，但将其用于一般的视觉推理仍然具有挑战性。 Method: 提出了一种基于GRPO的新型RL算法，旨在训练模型与四个视觉工具进行推理。 Result: 通过广泛的实验，所提出的方法在多个感知密集型基准测试中实现了最先进的性能，包括SAT、CV-Bench、BLINK和MMStar。 Conclusion: ReVPT是一种通过强化学习增强多模态LLMs推理和使用视觉工具的能力的新方法，它解决了先前方法的一些关键限制。 Abstract: Visual reasoning, a cornerstone of human intelligence, encompasses complex perceptual and logical processes essential for solving diverse visual problems. While advances in computer vision have produced powerful models for various perceptual tasks, leveraging these for general visual reasoning remains challenging. Prior work demonstrates that augmenting LLMs with vision models via supervised finetuning improves performance, but faces key limitations such as expensive data generation, reliance on careful data filtering, and poor generalization. To address these issues, we propose ReVPT to enhance multi-modal LLMs' abilities to reason about and use visual tools through reinforcement learning. We introduce a novel RL algorithm based on GRPO, designed to train models to reason with a suite of four visual tools. Through extensive experiments, we show that our method achieves state-of-the-art performance on several perception-heavy benchmarks, including SAT, CV-Bench, BLINK and MMStar, significantly outperforming the supervised and text-based RL finetuning baselines. Notably, Our ReVPT-3B and ReVPT-7B outperform the instruct models by 9.03% and 9.44% on CV-Bench. Finally, we bring to the community new insights on RL-based visual tool-usage through extensive ablations. Our code is available at https://github.com/ls-kelvin/REVPT.

[274] Traces of Image Memorability in Vision Encoders: Activations, Attention Distributions and Autoencoder Losses

Ece Takmaz,Albert Gatt,Jakub Dotlacil

Main category: cs.CV

TL;DR: 本文研究了图像对人类的可记忆性在预训练视觉编码器中的表现，发现稀疏自编码器损失等特征可作为预测图像记忆性的有效指标。

Details

Motivation: 图像在人类记忆中的可记忆性存在差异，本文旨在探索这种差异在预训练视觉编码器中的表现及其相关因素。 Method: 受认知科学和计算机视觉研究的启发，本文探讨了预训练视觉编码器中与图像可记忆性相关的因素，重点关注潜在激活、注意力分布以及图像块的均匀性，并利用稀疏自编码器损失作为可记忆性的代理指标。 Result: 研究发现视觉变换器表示中的稀疏自编码器损失作为可记忆性代理指标，其效果优于过去基于卷积神经网络的方法。某些模型内部特征与可记忆性存在一定程度的相关性。 Conclusion: 研究发现模型内部特征与图像对人类的可记忆性之间存在关系，其中一些特征可以作为预测图像可记忆性的有用指标。 Abstract: Images vary in how memorable they are to humans. Inspired by findings from cognitive science and computer vision, this paper explores the correlates of image memorability in pretrained vision encoders, focusing on latent activations, attention distributions, and the uniformity of image patches. We find that these features correlate with memorability to some extent. Additionally, we explore sparse autoencoder loss over the representations of vision transformers as a proxy for memorability, which yields results outperforming past methods using convolutional neural network representations. Our results shed light on the relationship between model-internal features and memorability. They show that some features are informative predictors of what makes images memorable to humans.

[275] RSCC: A Large-Scale Remote Sensing Change Caption Dataset for Disaster Events

Zhenyuan Chen,Chenxi Wang,Ningyu Zhang,Feng Zhang

Main category: cs.CV

TL;DR: The study introduces the RSCC dataset, which provides 62,315 pre-/post-disaster image pairs with detailed textual captions to improve vision-language models for disaster monitoring.

Details

Motivation: Existing remote sensing datasets lack temporal image pairs and detailed textual annotations, limiting their ability to capture dynamic disaster impacts over time. Method: The authors introduced the Remote Sensing Change Caption (RSCC) dataset, comprising 62,315 pre-/post-disaster image pairs with rich textual annotations, to address the lack of temporal image pairs and detailed annotations in existing datasets. Result: RSCC enables detailed disaster-related analysis and facilitates the development of more accurate, interpretable, and scalable vision-language applications in remote sensing. Conclusion: RSCC dataset bridges the temporal and semantic divide in remote sensing data, enabling robust training and evaluation of vision-language models for disaster-aware bi-temporal understanding. Abstract: Remote sensing is critical for disaster monitoring, yet existing datasets lack temporal image pairs and detailed textual annotations. While single-snapshot imagery dominates current resources, it fails to capture dynamic disaster impacts over time. To address this gap, we introduce the Remote Sensing Change Caption (RSCC) dataset, a large-scale benchmark comprising 62,315 pre-/post-disaster image pairs (spanning earthquakes, floods, wildfires, and more) paired with rich, human-like change captions. By bridging the temporal and semantic divide in remote sensing data, RSCC enables robust training and evaluation of vision-language models for disaster-aware bi-temporal understanding. Our results highlight RSCC's ability to facilitate detailed disaster-related analysis, paving the way for more accurate, interpretable, and scalable vision-language applications in remote sensing. Code and dataset are available at https://github.com/Bili-Sakura/RSCC.

[276] Im2Haircut: Single-view Strand-based Hair Reconstruction for Human Avatars

Vanessa Sklyarova,Egor Zakharov,Malte Prinzler,Giorgio Becherini,Michael J. Black,Justus Thies

Main category: cs.CV

TL;DR: 本文提出了一种结合真实与合成数据训练的发型先验模型，结合高斯点阵重建技术，从单张图像中生成高质量3D头发结构，解决了传统方法无法捕捉发型内部结构的问题。

Details

Motivation: 从单张照片中重建基于发丝的头发几何结构具有挑战性，因为发型多样且几何复杂，同时缺乏真实的训练数据。现有方法依赖于合成数据的发型先验，但合成数据在数量和质量上均有限，因此需要一种更有效的方法。 Method: 该方法采用基于Transformer的先验模型，并结合基于高斯点阵的重建技术，利用合成数据训练内部发型结构，引入真实数据建模外部结构。 Result: 与现有重建方法的定性和定量比较表明，该方法在捕捉头发方向、整体轮廓和背面一致性方面具有更优性能。 Conclusion: 该论文提出了一种基于真实和合成数据的新颖发型先验模型，通过结合全局先验和局部优化，实现了从单张照片中重建高质量的3D头发几何结构。 Abstract: We present a novel approach for 3D hair reconstruction from single photographs based on a global hair prior combined with local optimization. Capturing strand-based hair geometry from single photographs is challenging due to the variety and geometric complexity of hairstyles and the lack of ground truth training data. Classical reconstruction methods like multi-view stereo only reconstruct the visible hair strands, missing the inner structure of hairstyles and hampering realistic hair simulation. To address this, existing methods leverage hairstyle priors trained on synthetic data. Such data, however, is limited in both quantity and quality since it requires manual work from skilled artists to model the 3D hairstyles and create near-photorealistic renderings. To address this, we propose a novel approach that uses both, real and synthetic data to learn an effective hairstyle prior. Specifically, we train a transformer-based prior model on synthetic data to obtain knowledge of the internal hairstyle geometry and introduce real data in the learning process to model the outer structure. This training scheme is able to model the visible hair strands depicted in an input image, while preserving the general 3D structure of hairstyles. We exploit this prior to create a Gaussian-splatting-based reconstruction method that creates hairstyles from one or more images. Qualitative and quantitative comparisons with existing reconstruction pipelines demonstrate the effectiveness and superior performance of our method for capturing detailed hair orientation, overall silhouette, and backside consistency. For additional results and code, please refer to https://im2haircut.is.tue.mpg.de.

[277] Understanding Space Is Rocket Science - Only Top Reasoning Models Can Solve Spatial Understanding Tasks

Nils Hoehing,Mayug Maniparambil,Ellen Rushe,Noel E. O'Connor,Anthony Ventresque

Main category: cs.CV

TL;DR: RocketScience是一个新的视觉语言模型基准，揭示了模型在空间关系理解上的不足，并发现推理模型表现优于传统VLMs。

Details

Motivation: 测试当前VLMs在空间关系理解上的表现，设计一个对人类容易但对模型困难的基准。 Method: 构建了一个新的对比视觉语言模型基准RocketScience，通过分离物体定位和空间推理能力进行解缠分析。 Result: 实验结果显示当前VLMs在空间关系理解上表现不佳，而推理模型表现优异，性能瓶颈在于空间推理能力。 Conclusion: RocketScience验证了当前视觉语言模型（VLMs）在空间关系理解上的不足，并发现推理模型表现突出，而性能瓶颈主要在于空间推理能力而非物体定位能力。 Abstract: We propose RocketScience, an open-source contrastive VLM benchmark that tests for spatial relation understanding. It is comprised of entirely new real-world image-text pairs covering mostly relative spatial understanding and the order of objects. The benchmark is designed to be very easy for humans and hard for the current generation of VLMs, and this is empirically verified. Our results show a striking lack of spatial relation understanding in open source and frontier commercial VLMs and a surprisingly high performance of reasoning models. Additionally, we perform a disentanglement analysis to separate the contributions of object localization and spatial reasoning in chain-of-thought-based models and find that the performance on the benchmark is bottlenecked by spatial reasoning and not object localization capabilities. We release the dataset with a CC-BY-4.0 license and make the evaluation code available at: https://github.com/nilshoehing/rocketscience

[278] PointSlice: Accurate and Efficient Slice-Based Representation for 3D Object Detection from Point Clouds

Liu Qifeng,Zhao Dawei,Dong Yabo,Xiao Liang,Wang Juan,Min Chen,Li Fuyang,Jiang Weizhong,Lu Dongming,Nie Yiming

Main category: cs.CV

TL;DR: PointSlice is a fast and efficient method for 3D object detection from point clouds that slices data into 2D and uses a dedicated network to maintain 3D perception, achieving strong performance across multiple datasets.

Details

Motivation: The motivation behind PointSlice is to address the limitations of existing point cloud processing methods, specifically the trade-off between high accuracy with slower inference speeds in voxel-based methods and lower accuracy in pillar-based methods. Method: The PointSlice method slices point clouds along the horizontal plane into 2D data slices, processes them using a 2D backbone network, and introduces a Slice Interaction Network (SIN) to maintain vertical relationships across slices for improved 3D object perception. Result: PointSlice achieves high detection accuracy and inference speed, as demonstrated on multiple datasets. On the Waymo dataset, it is 1.13x faster with 0.79x fewer parameters compared to SAFDNet, and on the nuScenes dataset, it achieves a state-of-the-art result of 66.74 mAP. Conclusion: PointSlice is a novel method for 3D object detection from point clouds that balances accuracy and inference speed by slicing point clouds into 2D data and incorporating a Slice Interaction Network. Abstract: 3D object detection from point clouds plays a critical role in autonomous driving. Currently, the primary methods for point cloud processing are voxel-based and pillarbased approaches. Voxel-based methods offer high accuracy through fine-grained spatial segmentation but suffer from slower inference speeds. Pillar-based methods enhance inference speed but still fall short of voxel-based methods in accuracy. To address these issues, we propose a novel point cloud processing method, PointSlice, which slices point clouds along the horizontal plane and includes a dedicated detection network. The main contributions of PointSlice are: (1) A new point cloud processing technique that converts 3D point clouds into multiple sets of 2D (x-y) data slices. The model only learns 2D data distributions, treating the 3D point cloud as separate batches of 2D data, which reduces the number of model parameters and enhances inference speed; (2) The introduction of a Slice Interaction Network (SIN). To maintain vertical relationships across slices, we incorporate SIN into the 2D backbone network, which improves the model's 3D object perception capability. Extensive experiments demonstrate that PointSlice achieves high detection accuracy and inference speed. On the Waymo dataset, PointSlice is 1.13x faster and has 0.79x fewer parameters than the state-of-the-art voxel-based method (SAFDNet), with only a 1.2 mAPH accuracy reduction. On the nuScenes dataset, we achieve a state-of-the-art detection result of 66.74 mAP. On the Argoverse 2 dataset, PointSlice is 1.10x faster, with 0.66x fewer parameters and a 1.0 mAP accuracy reduction. The code will be available at https://github.com/qifeng22/PointSlice2.

[279] A Continuous-Time Consistency Model for 3D Point Cloud Generation

Sebastian Eilermann,René Heesch,Oliver Niggemann

Main category: cs.CV

TL;DR: ConTiCoM-3D is a fast and effective 3D shape generation method that operates directly in point space using continuous-time modeling, outperforming existing diffusion and latent consistency models.

Details

Motivation: Fast and accurate 3D shape generation from point clouds is crucial for robotics, AR/VR, and digital content creation. Method: ConTiCoM-3D uses a continuous-time consistency model with a TrigFlow-inspired noise schedule and Chamfer Distance-based loss, operating entirely in point space without discretized steps or pre-trained models. Result: Experiments on ShapeNet show ConTiCoM-3D achieves high geometric fidelity and efficiency, outperforming previous methods in generation speed and quality. Conclusion: ConTiCoM-3D is a practical framework for scalable 3D shape generation that matches or outperforms state-of-the-art models in quality and efficiency. Abstract: Fast and accurate 3D shape generation from point clouds is essential for applications in robotics, AR/VR, and digital content creation. We introduce ConTiCoM-3D, a continuous-time consistency model that synthesizes 3D shapes directly in point space, without discretized diffusion steps, pre-trained teacher models, or latent-space encodings. The method integrates a TrigFlow-inspired continuous noise schedule with a Chamfer Distance-based geometric loss, enabling stable training on high-dimensional point sets while avoiding expensive Jacobian-vector products. This design supports efficient one- to two-step inference with high geometric fidelity. In contrast to previous approaches that rely on iterative denoising or latent decoders, ConTiCoM-3D employs a time-conditioned neural network operating entirely in continuous time, thereby achieving fast generation. Experiments on the ShapeNet benchmark show that ConTiCoM-3D matches or outperforms state-of-the-art diffusion and latent consistency models in both quality and efficiency, establishing it as a practical framework for scalable 3D shape generation.

[280] MSA2-Net: Utilizing Self-Adaptive Convolution Module to Extract Multi-Scale Information in Medical Image Segmentation

Chao Deng,Xiaosen Li,Xiao Qin

Main category: cs.CV

TL;DR: This study introduces a Self-Adaptive Convolution Module to enhance the generalization of the MSA2-Net for medical image segmentation, achieving high Dice coefficient scores across various datasets.

Details

Motivation: The motivation is to overcome the limitation of the nnUNet segmentation framework, which does not tune internal hyperparameters within the segmentation network, thereby constraining the model's ability to generalize. Method: The Self-Adaptive Convolution Module dynamically adjusts the size of the convolution kernels based on different datasets, and it is integrated into the Multi-Scale Convolution Bridge and the Multi-Scale Amalgamation Decoder of the MSA2-Net. Result: MSA2-Net achieved Dice coefficient scores of 86.49%, 92.56%, 93.37%, and 92.98% on the Synapse, ACDC, Kvasir, and Skin Lesion Segmentation (ISIC2017) datasets, respectively. Conclusion: MSA2-Net, bolstered by the Self-Adaptive Convolution Module, demonstrates robustness and precision in medical image segmentation tasks across various datasets. Abstract: The nnUNet segmentation framework adeptly adjusts most hyperparameters in training scripts automatically, but it overlooks the tuning of internal hyperparameters within the segmentation network itself, which constrains the model's ability to generalize. Addressing this limitation, this study presents a novel Self-Adaptive Convolution Module that dynamically adjusts the size of the convolution kernels depending on the unique fingerprints of different datasets. This adjustment enables the MSA2-Net, when equipped with this module, to proficiently capture both global and local features within the feature maps. Self-Adaptive Convolution Module is strategically integrated into two key components of the MSA2-Net: the Multi-Scale Convolution Bridge and the Multi-Scale Amalgamation Decoder. In the MSConvBridge, the module enhances the ability to refine outputs from various stages of the CSWin Transformer during the skip connections, effectively eliminating redundant data that could potentially impair the decoder's performance. Simultaneously, the MSADecoder, utilizing the module, excels in capturing detailed information of organs varying in size during the decoding phase. This capability ensures that the decoder's output closely reproduces the intricate details within the feature maps, thus yielding highly accurate segmentation images. MSA2-Net, bolstered by this advanced architecture, has demonstrated exceptional performance, achieving Dice coefficient scores of 86.49\%, 92.56\%, 93.37\%, and 92.98\% on the Synapse, ACDC, Kvasir, and Skin Lesion Segmentation (ISIC2017) datasets, respectively. This underscores MSA2-Net's robustness and precision in medical image segmentation tasks across various datasets.

[281] Variation-aware Vision Token Dropping for Faster Large Vision-Language Models

Junjie Chen,Xuyang Liu,Zichen Wen,Yiyu Wang,Siteng Huang,Honggang Chen

Main category: cs.CV

TL;DR: This paper proposes V$^2$Drop, a new token compression method for LVLMs that improves computational efficiency while maintaining high performance in image and video understanding tasks.

Details

Motivation: Existing inner-LLM token compression methods suffer from positional bias and incompatibility with efficient operators, which limits their practical deployment for LVLM acceleration. Method: The paper proposes Variation-aware Vision Token Dropping (V$^2$Drop), which progressively removes visual tokens with minimal variation during LVLM inference. Result: Experiments show that V$^2$Drop maintains 94.0% and 98.6% of the original model performance for image and video understanding tasks, while reducing LLM generation latency by 31.5% and 74.2%. Conclusion: This paper concludes that V$^2$Drop is an effective token compression method that improves computational efficiency while maintaining high performance in image and video understanding tasks. Abstract: Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks. However, the increasing demand for high-resolution image and long-video understanding results in substantial token counts, leading to reduced inference efficiency. Token compression offers a direct solution by reducing the number of tokens to be processed, thereby improving computational efficiency. Through extensive analysis, we identify two critical limitations in existing inner-LLM token compression methods: positional bias and incompatibility with efficient operators, which hinder their practical deployment for LVLM acceleration. This paper presents the first approach from a token variation perspective, revealing that visual token variations within LLMs exhibit task-agnostic properties. We propose Variation-aware Vision Token Dropping (\textit{i.e.}, \textbf{V$^2$Drop}), which progressively removes visual tokens with minimal variation during LVLM inference, thereby enhancing computational efficiency. Extensive experiments across multiple models and benchmarks demonstrate that our V$^2$Drop is able to maintain \textbf{94.0\%} and \textbf{98.6\%} of the original model performance for image and video understanding tasks respectively, while reducing LLM generation latency by \textbf{31.5\%} and \textbf{74.2\%}. When combined with efficient operators, V$^2$Drop further reduces GPU peak memory usage.

[282] Unified Supervision For Vision-Language Modeling in 3D Computed Tomography

Hao-Chih Lee,Zelong Liu,Hamza Ahmed,Spencer Kim,Sean Huver,Vishwesh Nath,Zahi A. Fayad,Timothy Deyer,Xueyan Mei

Main category: cs.CV

TL;DR: Uniferum is a volumetric vision-language model (VLM) that unifies diverse supervision signals from classification labels and segmentation masks into a single training framework, achieving state-of-the-art results and robust generalization in 3D medical imaging.

Details

Motivation: General-purpose vision-language models (VLMs) lack discriminative precision required for reliable clinical use in diagnostic radiology due to scarcity and heterogeneity of public volumetric CT datasets. Method: Uniferum harmonizes three public 3D CT datasets with distinct annotations into a single training framework, unifying diverse supervision signals encoded in classification labels and segmentation masks. Result: Uniferum achieves state-of-the-art performance, improving AUROC on the CT-RATE benchmark by 7% compared to CLIP-based and conventional multi-label convolutional models. It also demonstrates robust out-of-distribution generalization with unexpected zero-shot performance on the RAD-CHEST and INSPECT datasets. Conclusion: Uniferum successfully integrates heterogeneous annotations and body segmentation to enhance model performance, setting a new direction for clinically reliable, data-efficient VLMs in 3D medical imaging. Abstract: General-purpose vision-language models (VLMs) have emerged as promising tools in radiology, offering zero-shot capabilities that mitigate the need for large labeled datasets. However, in high-stakes domains like diagnostic radiology, these models often lack the discriminative precision required for reliable clinical use. This challenge is compounded by the scarcity and heterogeneity of publicly available volumetric CT datasets, which vary widely in annotation formats and granularity. To address these limitations, we introduce Uniferum, a volumetric VLM that unifies diverse supervision signals, encoded in classification labels and segmentation masks, into a single training framework. By harmonizing three public 3D CT datasets with distinct annotations, Uniferum achieves state-of-the-art performance, improving AUROC on the CT-RATE benchmark by 7% compared to CLIP-based and conventional multi-label convolutional models. The model demonstrates robust out-of-distribution generalization, with observed evidence of unexpected zero-shot performance on the RAD-CHEST and INSPECT datasets. Our results highlight the effectiveness of integrating heterogeneous annotations and body segmentation to enhance model performance, setting a new direction for clinically reliable, data-efficient VLMs in 3D medical imaging.

[283] Acoustic Interference Suppression in Ultrasound images for Real-Time HIFU Monitoring Using an Image-Based Latent Diffusion Model

Dejia Cai,Yao Ran,Kun Yang,Xinwang Shi,Yingying Zhou,Kexian Wu,Yang Xu,Yi Hu,Xiaowei Zhou

Main category: cs.CV

TL;DR: HIFU-ILDiff 是一种新型深度学习方法，能够有效抑制 HIFU 干扰并实现实时处理，显著提升 HIFU 治疗的监测效果和精度。

Details

Motivation: HIFU 治疗的成功和安全性依赖于实时监测，而超声引导 HIFU 治疗时的干扰影响了监测效果。因此，需要一种新的方法来抑制 HIFU 引起的干扰。 Method: HIFU-ILDiff 使用 Vector Quantized Variational Autoencoder (VQ-VAE) 将噪声超声图像编码到低维潜在空间，然后使用潜在扩散模型迭代去除干扰，最后解码以重建高分辨率、无干扰的超声图像。 Result: 实验结果表明，HIFU-ILDiff 显著优于 Notch Filter 方法，在体外场景下 SSIM 达到 0.796，PSNR 达到 23.780，而 Notch Filter 的 SSIM 为 0.443，PSNR 为 14.420。此外，HIFU-ILDiff 的实时处理速度为每秒 15 帧，明显快于 Notch Filter 的每帧 5 秒。 Conclusion: HIFU-ILDiff 是一种新型的深度学习方法，利用潜在扩散模型来抑制HIFU干扰，显著优于传统的Notch Filter方法，并能够实现实时处理。 Abstract: High-Intensity Focused Ultrasound (HIFU) is a non-invasive therapeutic technique widely used for treating various diseases. However, the success and safety of HIFU treatments depend on real-time monitoring, which is often hindered by interference when using ultrasound to guide HIFU treatment. To address these challenges, we developed HIFU-ILDiff, a novel deep learning-based approach leveraging latent diffusion models to suppress HIFU-induced interference in ultrasound images. The HIFU-ILDiff model employs a Vector Quantized Variational Autoencoder (VQ-VAE) to encode noisy ultrasound images into a lower-dimensional latent space, followed by a latent diffusion model that iteratively removes interference. The denoised latent vectors are then decoded to reconstruct high-resolution, interference-free ultrasound images. We constructed a comprehensive dataset comprising 18,872 image pairs from in vitro phantoms, ex vivo tissues, and in vivo animal data across multiple imaging modalities and HIFU power levels to train and evaluate the model. Experimental results demonstrate that HIFU-ILDiff significantly outperforms the commonly used Notch Filter method, achieving a Structural Similarity Index (SSIM) of 0.796 and Peak Signal-to-Noise Ratio (PSNR) of 23.780 compared to SSIM of 0.443 and PSNR of 14.420 for the Notch Filter under in vitro scenarios. Additionally, HIFU-ILDiff achieves real-time processing at 15 frames per second, markedly faster than the Notch Filter's 5 seconds per frame. These findings indicate that HIFU-ILDiff is able to denoise HIFU interference in ultrasound guiding images for real-time monitoring during HIFU therapy, which will greatly improve the treatment precision in current clinical applications.

[284] Kwai Keye-VL 1.5 Technical Report

Biao Yang,Bin Wen,Boyang Ding,Changyi Liu,Chenglong Chu,Chengru Song,Chongling Rao,Chuan Yi,Da Li,Dunju Zang,Fan Yang,Guorui Zhou,Guowang Zhang,Han Shen,Hao Peng,Haojie Ding,Hao Wang,Hengrui Ju,Jiaming Huang,Jiangxia Cao,Jiankang Chen,Jingyun Hua,Kaibing Chen,Kaiyu Jiang,Kaiyu Tang,Kun Gai,Muhao Wei,Qiang Wang,Ruitao Wang,Sen Na,Shengnan Zhang,Siyang Mao,Sui Huang,Tianke Zhang,Tingting Gao,Wei Chen,Wei Yuan,Xiangyu Wu,Xiao Hu,Xingyu Lu,Yi-Fan Zhang,Yiping Yang,Yulong Chen,Zeyi Lu,Zhenhua Wu,Zhixin Ling,Zhuoran Yang,Ziming Li,Di Xu,Haixuan Gao,Hang Li,Jing Wang,Lejian Ren,Qigen Hu,Qianqian Wang,Shiyao Wang,Xinchen Luo,Yan Li,Yuhang Hu,Zixing Zhang

Main category: cs.CV

TL;DR: Keye-VL-1.5 improves video understanding with innovative strategies in encoding, pre-training, and post-training, leading to better performance on challenging video tasks.

Details

Motivation: Video understanding remains challenging due to the dynamic and information-dense nature of videos, with existing models struggling to balance spatial resolution and temporal coverage. Method: Keye-VL-1.5 addresses challenges in video comprehension through three innovations: a Slow-Fast video encoding strategy, a progressive four-stage pre-training methodology, and a comprehensive post-training pipeline focusing on reasoning enhancement and human preference alignment. Result: Keye-VL-1.5 shows significant improvements in video understanding tasks and maintains competitive performance on general multimodal benchmarks, as demonstrated through extensive evaluation and internal human assessment. Conclusion: Keye-VL-1.5 demonstrates significant improvements over existing models, particularly excelling in video understanding tasks while maintaining competitive performance on general multimodal benchmarks. Abstract: In recent years, the development of Large Language Models (LLMs) has significantly advanced, extending their capabilities to multimodal tasks through Multimodal Large Language Models (MLLMs). However, video understanding remains a challenging area due to the dynamic and information-dense nature of videos. Existing models struggle with the trade-off between spatial resolution and temporal coverage when processing video content. We present Keye-VL-1.5, which addresses fundamental challenges in video comprehension through three key innovations. First, we introduce a novel Slow-Fast video encoding strategy that dynamically allocates computational resources based on inter-frame similarity, processing key frames with significant visual changes at higher resolution (Slow pathway) while handling relatively static frames with increased temporal coverage at lower resolution (Fast pathway). Second, we implement a progressive four-stage pre-training methodology that systematically extends the model's context length from 8K to 128K tokens, enabling processing of longer videos and more complex visual content. Third, we develop a comprehensive post-training pipeline focusing on reasoning enhancement and human preference alignment, incorporating a 5-step chain-of-thought data construction process, iterative GSPO-based reinforcement learning with progressive prompt hinting for difficult cases, and alignment training. Through extensive evaluation on public benchmarks and rigorous internal human assessment, Keye-VL-1.5 demonstrates significant improvements over existing models, particularly excelling in video understanding tasks while maintaining competitive performance on general multimodal benchmarks.

[285] ViSTA-SLAM: Visual SLAM with Symmetric Two-view Association

Ganlin Zhang,Shenhan Qian,Xi Wang,Daniel Cremers

Main category: cs.CV

TL;DR: ViSTA-SLAM是一种高效的实时单目视觉SLAM系统，能够在无需相机内参的情况下实现高质量的相机跟踪和三维重建。

Details

Motivation: 为了提升视觉SLAM系统的适用性和性能，ViSTA-SLAM设计了一个轻量级前端模型和优化的后端位姿图，以减少模型复杂度并提高跟踪和重建质量。 Method: 该系统采用轻量级对称双视角关联模型作为前端，仅通过两个RGB图像估计相机位姿和局部点云图，并在后端构建Sim(3)位姿图优化轨迹。 Result: 实验表明，ViSTA-SLAM在相机跟踪精度和三维重建质量方面优于现有方法，且模型大小仅为现有方法的35%。 Conclusion: ViSTA-SLAM是一个无需相机内参的实时单目视觉SLAM系统，在相机跟踪和三维重建方面性能优越。 Abstract: We present ViSTA-SLAM as a real-time monocular visual SLAM system that operates without requiring camera intrinsics, making it broadly applicable across diverse camera setups. At its core, the system employs a lightweight symmetric two-view association (STA) model as the frontend, which simultaneously estimates relative camera poses and regresses local pointmaps from only two RGB images. This design reduces model complexity significantly, the size of our frontend is only 35\% that of comparable state-of-the-art methods, while enhancing the quality of two-view constraints used in the pipeline. In the backend, we construct a specially designed Sim(3) pose graph that incorporates loop closures to address accumulated drift. Extensive experiments demonstrate that our approach achieves superior performance in both camera tracking and dense 3D reconstruction quality compared to current methods. Github repository: https://github.com/zhangganlin/vista-slam

[286] O-DisCo-Edit: Object Distortion Control for Unified Realistic Video Editing

Yuqing Chen,Junjie Wang,Lin Liu,Ruihang Chu,Xiaopeng Zhang,Qi Tian,Yujiu Yang

Main category: cs.CV

TL;DR: O-DisCo-Edit是一种统一的视频编辑框架，通过引入基于随机和自适应噪声的对象失真控制信号和“复制形式”保留模块，实现了高效、高保真的编辑效果，并在各种视频编辑任务中表现出色。

Details

Motivation: 扩散模型在视频编辑方面取得了进展，但由于需要精确操作各种对象属性，可控编辑仍然具有挑战性。当前的方法需要为不同的编辑任务设计不同的控制信号，这使得模型设计复杂并需要大量的训练资源。 Method: 提出了一种名为O-DisCo-Edit的统一框架，结合了基于随机和自适应噪声的对象失真控制（O-DisCo）信号和用于保留未编辑区域的“复制形式”保存模块。 Result: 广泛的实验和全面的人类评估一致表明，O-DisCo-Edit在各种视频编辑任务中都优于现有的专用和多任务最先进方法。 Conclusion: O-DisCo-Edit是一个统一的视频编辑框架，通过对象失真控制信号和“复制形式”保留模块，实现了高效、高保真的编辑效果，并且在各种视频编辑任务中优于现有的专用和多任务方法。 Abstract: Diffusion models have recently advanced video editing, yet controllable editing remains challenging due to the need for precise manipulation of diverse object properties. Current methods require different control signal for diverse editing tasks, which complicates model design and demands significant training resources. To address this, we propose O-DisCo-Edit, a unified framework that incorporates a novel object distortion control (O-DisCo). This signal, based on random and adaptive noise, flexibly encapsulates a wide range of editing cues within a single representation. Paired with a "copy-form" preservation module for preserving non-edited regions, O-DisCo-Edit enables efficient, high-fidelity editing through an effective training paradigm. Extensive experiments and comprehensive human evaluations consistently demonstrate that O-DisCo-Edit surpasses both specialized and multitask state-of-the-art methods across various video editing tasks. https://cyqii.github.io/O-DisCo-Edit.github.io/

[287] TransForSeg: A Multitask Stereo ViT for Joint Stereo Segmentation and 3D Force Estimation in Catheterization

Pedram Fekri,Mehrdad Zadeh,Javad Dargahi

Main category: cs.CV

TL;DR: A new stereo Vision Transformer model improves multitask performance in catheter segmentation and 3D force estimation, surpassing existing approaches.

Details

Motivation: To enhance catheterization procedures by providing more accurate tactile and visual perception data through an improved multitask deep learning model. Method: A novel encoder-decoder Vision Transformer model that processes two input X-ray images as separate sequences to capture long-range dependencies and generate embeddings for segmentation and 3D force estimation. Result: The model successfully achieves simultaneous 3D force estimation and stereo segmentation of the catheter, outperforming state-of-the-art models in extensive experiments. Conclusion: The proposed stereo Vision Transformer model outperforms existing models in catheter segmentation and force estimation, setting a new state-of-the-art. Abstract: Recently, the emergence of multitask deep learning models has enhanced catheterization procedures by providing tactile and visual perception data through an end-to-end architec- ture. This information is derived from a segmentation and force estimation head, which localizes the catheter in X-ray images and estimates the applied pressure based on its deflection within the image. These stereo vision architectures incorporate a CNN- based encoder-decoder that captures the dependencies between X-ray images from two viewpoints, enabling simultaneous 3D force estimation and stereo segmentation of the catheter. With these tasks in mind, this work approaches the problem from a new perspective. We propose a novel encoder-decoder Vision Transformer model that processes two input X-ray images as separate sequences. Given sequences of X-ray patches from two perspectives, the transformer captures long-range dependencies without the need to gradually expand the receptive field for either image. The embeddings generated by both the encoder and decoder are fed into two shared segmentation heads, while a regression head employs the fused information from the decoder for 3D force estimation. The proposed model is a stereo Vision Transformer capable of simultaneously segmenting the catheter from two angles while estimating the generated forces at its tip in 3D. This model has undergone extensive experiments on synthetic X-ray images with various noise levels and has been compared against state-of-the-art pure segmentation models, vision-based catheter force estimation methods, and a multitask catheter segmentation and force estimation approach. It outperforms existing models, setting a new state-of-the-art in both catheter segmentation and force estimation.

[288] Improving Large Vision and Language Models by Learning from a Panel of Peers

Jefferson Hernandez,Jing Shi,Simon Jenni,Vicente Ordonez,Kushal Kafle

Main category: cs.CV

TL;DR: The paper introduces a novel Panel-of-Peers learning framework for aligning Large Vision and Language Models (LVLMs), leveraging peer evaluations and iterative self-improvement to overcome the limitations of traditional alignment methods.

Details

Motivation: Traditional alignment methods for LVLMs rely on costly human-curated preference data, limited-quality machine-generated data, and self-supervised data that may introduce hallucinations. This work aims to overcome these limitations. Method: The proposed Panel-of-Peers learning framework simulates a peer review system, where a panel of LVLMs evaluates and learns from collective outputs through an iterative self-improvement process. Result: Experiments demonstrate significant improvements across multiple benchmarks, with the average score on fifteen benchmarks increasing from 48% to 57%. Conclusion: Panel-of-Peers learning framework is a scalable alternative to self-supervised alignment, significantly improving model performance across multiple benchmarks without extensive human-labeled datasets. Abstract: Traditional alignment methods for Large Vision and Language Models (LVLMs) primarily rely on human-curated preference data. Human-generated preference data is costly; machine-generated preference data is limited in quality; and self-supervised preference data often introduces hallucinations. To overcome these limitations, we propose a novel Panel-of-Peers learning framework inspired by collaborative learning among humans. This approach leverages a panel of LVLMs, each evaluating and learning from their collective outputs through an iterative self-improvement process. By simulating a peer review system, our models generate, assess, and refine outputs in response to a curated set of prompts, mimicking a classroom learning environment. We demonstrate that this methodology enhances model performance without requiring extensive human-labeled datasets. Our experiments show significant improvement across multiple benchmarks, demonstrating the potential of peer evaluations as a scalable alternative to self-supervised alignment. Notably, we show that Panel-of-Peers increases the average score on fifteen benchmarks from 48% to 57%

[289] Q-Sched: Pushing the Boundaries of Few-Step Diffusion Models with Quantization-Aware Scheduling

Natalia Frumkin,Diana Marculescu

Main category: cs.CV

TL;DR: Q-Sched is a new quantization method for diffusion models that reduces model size and computational cost while maintaining high image generation quality, outperforming existing methods.

Details

Motivation: Text-to-image diffusion models are computationally intensive, requiring large model evaluations and limiting the effectiveness of existing quantization methods. A more efficient approach is needed for reduced computational cost. Method: Q-Sched modifies the diffusion model scheduler rather than model weights, using a quantization-aware pre-conditioning approach with the JAQ loss to optimize text-image compatibility and image quality. Result: Q-Sched achieves a 4x reduction in model size while maintaining full-precision accuracy, with a 15.5% FID improvement over FP16 4-step Latent Consistency Model and 16.6% improvement over FP16 8-step Phased Consistency Model. Conclusion: Q-Sched is an effective method for post-training quantization in diffusion models, significantly reducing model size while maintaining full-precision accuracy and achieving substantial performance improvements. Abstract: Text-to-image diffusion models are computationally intensive, often requiring dozens of forward passes through large transformer backbones. For instance, Stable Diffusion XL generates high-quality images with 50 evaluations of a 2.6B-parameter model, an expensive process even for a single batch. Few-step diffusion models reduce this cost to 2-8 denoising steps but still depend on large, uncompressed U-Net or diffusion transformer backbones, which are often too costly for full-precision inference without datacenter GPUs. These requirements also limit existing post-training quantization methods that rely on full-precision calibration. We introduce Q-Sched, a new paradigm for post-training quantization that modifies the diffusion model scheduler rather than model weights. By adjusting the few-step sampling trajectory, Q-Sched achieves full-precision accuracy with a 4x reduction in model size. To learn quantization-aware pre-conditioning coefficients, we propose the JAQ loss, which combines text-image compatibility with an image quality metric for fine-grained optimization. JAQ is reference-free and requires only a handful of calibration prompts, avoiding full-precision inference during calibration. Q-Sched delivers substantial gains: a 15.5% FID improvement over the FP16 4-step Latent Consistency Model and a 16.6% improvement over the FP16 8-step Phased Consistency Model, showing that quantization and few-step distillation are complementary for high-fidelity generation. A large-scale user study with more than 80,000 annotations further confirms Q-Sched's effectiveness on both FLUX.1[schnell] and SDXL-Turbo.

[290] OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning

Yanqing Liu,Xianhang Li,Letian Zhang,Zirui Wang,Zeyu Zheng,Yuyin Zhou,Cihang Xie

Main category: cs.CV

TL;DR: This paper introduces OpenVision 2, a simplified vision-language model that removes the text encoder and contrastive loss to improve training efficiency while maintaining performance on multimodal benchmarks.

Details

Motivation: The motivation is to enhance training efficiency in vision-language pretraining by simplifying the model architecture, inspired by prior works such as CapPa, AIMv2, and LLaVA. Method: The method involves simplifying OpenVision's architecture by removing the text encoder and contrastive loss, using only the captioning loss as a generative training signal, resulting in a more efficient model named OpenVision 2. Result: OpenVision 2 matches the original model's performance on multimodal benchmarks while significantly reducing training time and memory consumption. For example, with ViT-L/14, training time is reduced by about 1.5x and memory usage by about 1.8x, enabling larger batch sizes and scaling beyond 1 billion parameters. Conclusion: The paper concludes that the new OpenVision 2, by removing the text encoder and contrastive loss and using only the captioning loss, achieves competitive performance while significantly improving training efficiency, making it a promising direction for future vision encoder development in multimodal foundation models. Abstract: This paper provides a simplification on OpenVision's architecture and loss design for enhancing its training efficiency. Following the prior vision-language pretraining works CapPa and AIMv2, as well as modern multimodal designs like LLaVA, our changes are straightforward: we remove the text encoder (and therefore the contrastive loss), retaining only the captioning loss as a purely generative training signal. We name this new version OpenVision 2. The initial results are promising: despite this simplification, OpenVision 2 competitively matches the original model's performance on a broad set of multimodal benchmarks while substantially cutting both training time and memory consumption. For example, with ViT-L/14, it reduces training time by about 1.5x (from 83h to 57h), and memory usage by about 1.8x (from 24.5GB to 13.8GB, equivalently allowing the maximum batch size to grow from 2k to 8k). This superior training efficiency also allows us to scale far beyond the largest vision encoder used in OpenVision, reaching more than 1 billion parameters. We hold a strong belief that this lightweight, generative-only paradigm is compelling for future vision encoder development in multimodal foundation models.

[291] GaussianGAN: Real-Time Photorealistic controllable Human Avatars

Mohamed Ilyes Lakhal,Richard Bowden

Main category: cs.CV

TL;DR: GaussianGAN提出了一种基于高斯点密度化的新方法，用于实时生成高质量、逼真且可操控的人类头像，在像素保真度和视觉质量上优于现有技术。

Details

Motivation: 现有的神经渲染技术在生成逼真人类头像时存在明显的模糊问题，因此需要一种更高质量、实时且可控的生成方法。 Method: GaussianGAN利用高斯点密度化策略从估计的骨骼肢体周围的圆柱结构表面生成高斯点，结合相机校准信息进行精确的语义分割，然后使用UNet生成器结合分割图和高斯点特征生成逼真的人类头像。 Result: GaussianGAN实现了79 FPS的实时渲染速度，并在ZJU Mocap数据集上达到了32.94dB的像素保真度，在Thuman4数据集上达到了33.39dB，优于现有方法。 Conclusion: GaussianGAN是一个实时生成高质量、逼真且可操控的人类头像的方法，通过引入高斯点密度化策略和新颖的视图分割模块，提高了头像的视觉质量和实时渲染速度。 Abstract: Photorealistic and controllable human avatars have gained popularity in the research community thanks to rapid advances in neural rendering, providing fast and realistic synthesis tools. However, a limitation of current solutions is the presence of noticeable blurring. To solve this problem, we propose GaussianGAN, an animatable avatar approach developed for photorealistic rendering of people in real-time. We introduce a novel Gaussian splatting densification strategy to build Gaussian points from the surface of cylindrical structures around estimated skeletal limbs. Given the camera calibration, we render an accurate semantic segmentation with our novel view segmentation module. Finally, a UNet generator uses the rendered Gaussian splatting features and the segmentation maps to create photorealistic digital avatars. Our method runs in real-time with a rendering speed of 79 FPS. It outperforms previous methods regarding visual perception and quality, achieving a state-of-the-art results in terms of a pixel fidelity of 32.94db on the ZJU Mocap dataset and 33.39db on the Thuman4 dataset.

[292] Examination of PCA Utilisation for Multilabel Classifier of Multispectral Images

Filip Karpowicz,Wiktor Kępiński,Bartosz Staszyński,Grzegorz Sarwas

Main category: cs.CV

TL;DR: 研究PCA在使用ResNet50和DINOv2进行多标签分类的多光谱图像中的应用效果。

Details

Motivation: 多光谱图像的高维数据及其处理挑战，以及多标签分类的复杂性促使研究PCA的实用性。 Method: 通过ResNet50和DINOv2进行多标签分类，并在处理前使用PCA将数据降维至三维。 Result: PCA对于多标签多光谱图像分类的效果因深度学习架构和训练策略而异。 Conclusion: PCA的有效性取决于所选的深度学习架构和训练策略，为自监督预训练和替代降维方法的研究开辟了新途径。 Abstract: This paper investigates the utility of Principal Component Analysis (PCA) for multi-label classification of multispectral images using ResNet50 and DINOv2, acknowledging the high dimensionality of such data and the associated processing challenges. Multi-label classification, where each image may belong to multiple classes, adds further complexity to feature extraction. Our pipeline includes an optional PCA step that reduces the data to three dimensions before feeding it into a three-layer classifier. The findings demonstrate that the effectiveness of PCA for multi-label multispectral image classification depends strongly on the chosen deep learning architecture and training strategy, opening avenues for future research into self-supervised pre-training and alternative dimensionality reduction approaches.

[293] Deep Learning-Based Rock Particulate Classification Using Attention-Enhanced ConvNeXt

Anthony Amankwah,Chris Aldrich

Main category: cs.CV

TL;DR: 本文提出了一种基于ConvNeXt改进的CNSCA模型，通过引入注意力机制提升了岩石尺寸分类的准确性与鲁棒性。

Details

Motivation: 精确的岩石尺寸分类对地质工程、采矿和资源管理的操作效率和安全性至关重要，但传统方法难以同时捕捉岩石图像中的细粒度局部模式和广泛的上下文关系。 Method: 在ConvNeXt架构基础上引入自注意力以捕获长距离空间依赖关系，并引入通道注意力以强调重要特征通道。 Result: 在岩石尺寸分类数据集上的实验表明，注意力机制的引入显著增强了模型在自然纹理细粒度分类任务中的性能。 Conclusion: CNSCA模型通过结合自注意力和通道注意力机制，在岩石尺寸分类任务中显著提高了分类准确性和鲁棒性。 Abstract: Accurate classification of rock sizes is a vital component in geotechnical engineering, mining, and resource management, where precise estimation influences operational efficiency and safety. In this paper, we propose an enhanced deep learning model based on the ConvNeXt architecture, augmented with both self-attention and channel attention mechanisms. Building upon the foundation of ConvNext, our proposed model, termed CNSCA, introduces self-attention to capture long-range spatial dependencies and channel attention to emphasize informative feature channels. This hybrid design enables the model to effectively capture both fine-grained local patterns and broader contextual relationships within rock imagery, leading to improved classification accuracy and robustness. We evaluate our model on a rock size classification dataset and compare it against three strong baseline. The results demonstrate that the incorporation of attention mechanisms significantly enhances the models capability for fine-grained classification tasks involving natural textures like rocks.

[294] Clinical Metadata Guided Limited-Angle CT Image Reconstruction

Yu Shi,Shuyi Fan,Changsheng Fang,Shuo Han,Haodong Li,Li Zhou,Bahareh Morovati,Dayang Wang,Hengyong Yu

Main category: cs.CV

TL;DR: This paper introduces a two-stage diffusion framework using clinical metadata to enhance limited-angle computed tomography (LACT) reconstruction, resulting in improved image quality and reduced artifacts, especially under severe angular truncation.

Details

Motivation: The motivation stems from the limitations of LACT, such as severe artifacts caused by truncated projections, which make reconstruction challenging. The authors aim to address this by leveraging clinical metadata to improve reconstruction fidelity and reduce artifacts. Method: The proposed method involves a two-stage diffusion framework: the first stage uses a transformer-based diffusion model conditioned on clinical metadata to generate coarse anatomical priors, while the second stage refines these priors with metadata and enforces physics-based data consistency using an Alternating Direction Method of Multipliers (ADMM) module. Result: Extensive experiments on synthetic and real cardiac CT datasets demonstrated that the proposed method outperforms metadata-free baselines in terms of SSIM, PSNR, nMI, and PCC metrics. Ablation studies also showed that different types of metadata provide complementary benefits, particularly in limited-angle conditions. Conclusion: The study concludes that incorporating clinical metadata into a two-stage diffusion framework significantly enhances the quality and efficiency of limited-angle computed tomography (LACT) reconstruction, particularly under severe angular truncation. Abstract: Limited-angle computed tomography (LACT) offers improved temporal resolution and reduced radiation dose for cardiac imaging, but suffers from severe artifacts due to truncated projections. To address the ill-posedness of LACT reconstruction, we propose a two-stage diffusion framework guided by structured clinical metadata. In the first stage, a transformer-based diffusion model conditioned exclusively on metadata, including acquisition parameters, patient demographics, and diagnostic impressions, generates coarse anatomical priors from noise. The second stage further refines the images by integrating both the coarse prior and metadata to produce high-fidelity results. Physics-based data consistency is enforced at each sampling step in both stages using an Alternating Direction Method of Multipliers module, ensuring alignment with the measured projections. Extensive experiments on both synthetic and real cardiac CT datasets demonstrate that incorporating metadata significantly improves reconstruction fidelity, particularly under severe angular truncation. Compared to existing metadata-free baselines, our method achieves superior performance in SSIM, PSNR, nMI, and PCC. Ablation studies confirm that different types of metadata contribute complementary benefits, particularly diagnostic and demographic priors under limited-angle conditions. These findings highlight the dual role of clinical metadata in improving both reconstruction quality and efficiency, supporting their integration into future metadata-guided medical imaging frameworks.

[295] TransMatch: A Transfer-Learning Framework for Defect Detection in Laser Powder Bed Fusion Additive Manufacturing

Mohsen Asghari Ilani,Yaser Mike Banad

Main category: cs.CV

TL;DR: TransMatch improves defect detection in additive manufacturing using transfer and semi-supervised learning, achieving high accuracy with limited labeled data.

Details

Motivation: Surface defects in Laser Powder Bed Fusion pose risks to structural integrity, and there is a scarcity of labeled data for defect detection. Method: TransMatch combines transfer learning and semi-supervised few-shot learning to utilize both labeled and unlabeled images effectively. Result: TransMatch achieved 98.91% accuracy with minimal loss and high precision, recall, and F1-scores across multiple defect classes. Conclusion: TransMatch represents a significant advancement in additive manufacturing defect detection, offering a practical and scalable solution for quality assurance. Abstract: Surface defects in Laser Powder Bed Fusion (LPBF) pose significant risks to the structural integrity of additively manufactured components. This paper introduces TransMatch, a novel framework that merges transfer learning and semi-supervised few-shot learning to address the scarcity of labeled AM defect data. By effectively leveraging both labeled and unlabeled novel-class images, TransMatch circumvents the limitations of previous meta-learning approaches. Experimental evaluations on a Surface Defects dataset of 8,284 images demonstrate the efficacy of TransMatch, achieving 98.91% accuracy with minimal loss, alongside high precision, recall, and F1-scores for multiple defect classes. These findings underscore its robustness in accurately identifying diverse defects, such as cracks, pinholes, holes, and spatter. TransMatch thus represents a significant leap forward in additive manufacturing defect detection, offering a practical and scalable solution for quality assurance and reliability across a wide range of industrial applications.

[296] Mixture of Balanced Information Bottlenecks for Long-Tailed Visual Recognition

Yifan Lan,Xin Cai,Jun Cheng,Shan Tan

Main category: cs.CV

TL;DR: The paper introduces BIB and MBIB methods that improve the performance of DNNs on long-tailed visual recognition tasks by integrating loss function re-balancing and self-distillation techniques into the original IB network.

Details

Motivation: The motivation is to overcome the challenges of efficient training and deployment of DNNs in real-world visual recognition where data are usually long-tailed. Method: The paper proposes a balanced information bottleneck (BIB) approach and a novel structure of mixture of multiple balanced information bottlenecks (MBIB) integrating loss function re-balancing and self-distillation techniques. Result: Experiments on long-tailed datasets (CIFAR100-LT, ImageNet-LT, and iNaturalist 2018) show that both BIB and MBIB achieve state-of-the-art performance in long-tailed visual recognition. Conclusion: BIB and MBIB reach state-of-the-art performance for long-tailed visual recognition. Abstract: Deep neural networks (DNNs) have achieved significant success in various applications with large-scale and balanced data. However, data in real-world visual recognition are usually long-tailed, bringing challenges to efficient training and deployment of DNNs. Information bottleneck (IB) is an elegant approach for representation learning. In this paper, we propose a balanced information bottleneck (BIB) approach, in which loss function re-balancing and self-distillation techniques are integrated into the original IB network. BIB is thus capable of learning a sufficient representation with essential label-related information fully preserved for long-tailed visual recognition. To further enhance the representation learning capability, we also propose a novel structure of mixture of multiple balanced information bottlenecks (MBIB), where different BIBs are responsible for combining knowledge from different network layers. MBIB facilitates an end-to-end learning strategy that trains representation and classification simultaneously from an information theory perspective. We conduct experiments on commonly used long-tailed datasets, including CIFAR100-LT, ImageNet-LT, and iNaturalist 2018. Both BIB and MBIB reach state-of-the-art performance for long-tailed visual recognition.

[297] PractiLight: Practical Light Control Using Foundational Diffusion Models

Yotam Erel,Rishabh Dabral,Vladislav Golyanik,Amit H. Bermano,Christian Theobalt

Main category: cs.CV

TL;DR: PractiLight enables effective and generalizable light control in image generation by leveraging foundational knowledge of generative models and self-attention mechanisms.

Details

Motivation: Controlling light in generated images is challenging and existing methods are limited by domain-specific datasets, prompting the need for a more general and practical solution. Method: PractiLight trains a lightweight LoRA regressor to generate direct irradiance maps, leveraging insights from self-attention mechanisms and utilizing Classifier Guidance for relighting. Result: PractiLight achieves superior quality and control in image relighting with high parameter and data efficiency, outperforming leading approaches across various scene types. Conclusion: PractiLight offers a practical and efficient approach for light control in image generation, demonstrating state-of-the-art performance and generalization across diverse conditions. Abstract: Light control in generated images is a difficult task, posing specific challenges, spanning over the entire image and frequency spectrum. Most approaches tackle this problem by training on extensive yet domain-specific datasets, limiting the inherent generalization and applicability of the foundational backbones used. Instead, PractiLight is a practical approach, effectively leveraging foundational understanding of recent generative models for the task. Our key insight is that lighting relationships in an image are similar in nature to token interaction in self-attention layers, and hence are best represented there. Based on this and other analyses regarding the importance of early diffusion iterations, PractiLight trains a lightweight LoRA regressor to produce the direct irradiance map for a given image, using a small set of training images. We then employ this regressor to incorporate the desired lighting into the generation process of another image using Classifier Guidance. This careful design generalizes well to diverse conditions and image domains. We demonstrate state-of-the-art performance in terms of quality and control with proven parameter and data efficiency compared to leading works over a wide variety of scenes types. We hope this work affirms that image lighting can feasibly be controlled by tapping into foundational knowledge, enabling practical and general relighting.

[298] Latent Gene Diffusion for Spatial Transcriptomics Completion

Paula Cárdenas,Leonardo Manrique,Daniela Vega,Daniela Ruiz,Pablo Arbeláez

Main category: cs.CV

TL;DR: LGDiST是一种无需参考的空间转录组数据补全方法，通过潜在基因扩散模型有效解决了数据丢失问题，性能显著优于现有方法。

Details

Motivation: 当前基于单细胞RNA测序参考的方法受限于对齐质量、外部数据集依赖性和数据丢失问题，因此需要一种更可靠的空间转录组数据补全方法。 Method: LGDiST采用了一种新的扩散模型架构，无需依赖单细胞RNA测序数据，并通过潜在基因扩散解决数据丢失问题。 Result: LGDiST在26个数据集上的平均均方误差比现有最先进方法低18%，并且使用LGDiST补全数据可将基因表达预测性能提高最多10%。 Conclusion: LGDiST通过利用以前被认为不相关的基因，构建了一个丰富的、生物学上合理的遗传潜在空间，从而显著优于现有的最先进方法。 Abstract: Computer Vision has proven to be a powerful tool for analyzing Spatial Transcriptomics (ST) data. However, current models that predict spatially resolved gene expression from histopathology images suffer from significant limitations due to data dropout. Most existing approaches rely on single-cell RNA sequencing references, making them dependent on alignment quality and external datasets while also risking batch effects and inherited dropout. In this paper, we address these limitations by introducing LGDiST, the first reference-free latent gene diffusion model for ST data dropout. We show that LGDiST outperforms the previous state-of-the-art in gene expression completion, with an average Mean Squared Error that is 18% lower across 26 datasets. Furthermore, we demonstrate that completing ST data with LGDiST improves gene expression prediction performance on six state-of-the-art methods up to 10% in MSE. A key innovation of LGDiST is using context genes previously considered uninformative to build a rich and biologically meaningful genetic latent space. Our experiments show that removing key components of LGDiST, such as the context genes, the ST latent space, and the neighbor conditioning, leads to considerable drops in performance. These findings underscore that the full architecture of LGDiST achieves substantially better performance than any of its isolated components.

[299] Enabling Federated Object Detection for Connected Autonomous Vehicles: A Deployment-Oriented Evaluation

Komala Subramanyam Cherukuri,Kewei Sha,Zhenhua Huang

Main category: cs.CV

TL;DR: This paper evaluates FL-based object detection for CAVs, analyzing performance, efficiency, and robustness trade-offs to enable practical deployment.

Details

Motivation: Object detection is essential for CAVs to understand their surroundings, but centralized training has limitations in scalability, adaptability, and privacy. Federated learning offers a promising alternative but faces challenges in deployment due to computational demands and diverse operating conditions. Method: The authors use state-of-the-art object detection models (YOLOv5, YOLOv8, YOLOv11, and Deformable DETR) and evaluate them on three well-known datasets (KITTI, BDD100K, and nuScenes). They assess performance under various resolutions, batch sizes, weather and lighting conditions, and dynamic client participation. Result: The work presents a holistic, deployment-oriented evaluation of FL-based object detection in CAVs, offering insights into accuracy-computation-resource trade-offs and strategies to handle data heterogeneity, hardware constraints, and environmental variability. Conclusion: The paper provides a comprehensive evaluation of FL-based object detection in CAVs, analyzing trade-offs between detection accuracy, computational cost, and resource usage while addressing challenges like heterogeneity, constrained hardware, and environmental variability. Abstract: Object detection is crucial for Connected Autonomous Vehicles (CAVs) to perceive their surroundings and make safe driving decisions. Centralized training of object detection models often achieves promising accuracy, fast convergence, and simplified training process, but it falls short in scalability, adaptability, and privacy-preservation. Federated learning (FL), by contrast, enables collaborative, privacy-preserving, and continuous training across naturally distributed CAV fleets. However, deploying FL in real-world CAVs remains challenging due to the substantial computational demands of training and inference, coupled with highly diverse operating conditions. Practical deployment must address three critical factors: (i) heterogeneity from non-IID data distributions, (ii) constrained onboard computing hardware, and (iii) environmental variability such as lighting and weather, alongside systematic evaluation to ensure reliable performance. This work introduces the first holistic deployment-oriented evaluation of FL-based object detection in CAVs, integrating model performance, system-level resource profiling, and environmental robustness. Using state-of-the-art detectors, YOLOv5, YOLOv8, YOLOv11, and Deformable DETR, evaluated on the KITTI, BDD100K, and nuScenes datasets, we analyze trade-offs between detection accuracy, computational cost, and resource usage under diverse resolutions, batch sizes, weather and lighting conditions, and dynamic client participation, paving the way for robust FL deployment in CAVs.

[300] Doctoral Thesis: Geometric Deep Learning For Camera Pose Prediction, Registration, Depth Estimation, and 3D Reconstruction

Xueyang Kang

Main category: cs.CV

TL;DR: 这篇论文研究了如何结合传统几何技术和深度学习来解决3D视觉中的核心挑战，并开发了几何感知的深度学习模型，以提升在数字文化遗产保护和沉浸式VR/AR环境中的表现。

Details

Motivation: 由于3D数据的高维性和标记数据集的稀缺性，直接在3D数据上训练深度学习模型面临挑战，同时传统的3D映射技术在非结构化环境中表现不佳，难以生成适用于下游任务的详细几何表示。 Method: 将几何先验或约束（如深度信息、表面法线和等变性）整合到深度学习模型中，并系统研究了相机姿态估计、点云配准、深度估计和高保真3D重建等3D视觉关键组件。 Result: 开发了几何感知的深度学习模型，增强了几何表示的准确性和鲁棒性，并在真实世界应用中证明了其有效性。 Conclusion: 该论文通过结合传统几何技术与深度学习能力，开发了几何深度学习方法，以解决3D视觉中的基本挑战，并展示了其在数字文化遗产保护和沉浸式VR/AR环境中的有效性。 Abstract: Modern deep learning developments create new opportunities for 3D mapping technology, scene reconstruction pipelines, and virtual reality development. Despite advances in 3D deep learning technology, direct training of deep learning models on 3D data faces challenges due to the high dimensionality inherent in 3D data and the scarcity of labeled datasets. Structure-from-motion (SfM) and Simultaneous Localization and Mapping (SLAM) exhibit robust performance when applied to structured indoor environments but often struggle with ambiguous features in unstructured environments. These techniques often struggle to generate detailed geometric representations effective for downstream tasks such as rendering and semantic analysis. Current limitations require the development of 3D representation methods that combine traditional geometric techniques with deep learning capabilities to generate robust geometry-aware deep learning models. The dissertation provides solutions to the fundamental challenges in 3D vision by developing geometric deep learning methods tailored for essential tasks such as camera pose estimation, point cloud registration, depth prediction, and 3D reconstruction. The integration of geometric priors or constraints, such as including depth information, surface normals, and equivariance into deep learning models, enhances both the accuracy and robustness of geometric representations. This study systematically investigates key components of 3D vision, including camera pose estimation, point cloud registration, depth estimation, and high-fidelity 3D reconstruction, demonstrating their effectiveness across real-world applications such as digital cultural heritage preservation and immersive VR/AR environments.

[301] HydroVision: Predicting Optically Active Parameters in Surface Water Using Computer Vision

Shubham Laxmikant Deshmukh,Matthew Wilchek,Feras A. Batarseh

Main category: cs.CV

TL;DR: HydroVision是一种基于深度学习的场景分类框架，可以从标准RGB图像中估计水的光学活性参数，用于实时水质监测。

Details

Motivation: 深度学习提供了一种非接触式的方法来评估水质和检测污染，这对于应对灾害和保护公众健康至关重要。 Method: 使用超过50万张季节变化的RGB图像训练模型，通过迁移学习评估了五种先进的卷积神经网络和视觉转换器的性能，最终选择DenseNet121作为最佳架构。 Result: DenseNet121在预测CDOM时达到了0.89的R2分数，证明了该框架在不同条件下进行实际水质监测的潜力。 Conclusion: HydroVision的成功验证了深度学习在水环境监测中的可行性，未来将提升模型在低光和遮挡条件下的鲁棒性。 Abstract: Ongoing advancements in computer vision, particularly in pattern recognition and scene classification, have enabled new applications in environmental monitoring. Deep learning now offers non-contact methods for assessing water quality and detecting contamination, both critical for disaster response and public health protection. This work introduces HydroVision, a deep learning-based scene classification framework that estimates optically active water quality parameters including Chlorophyll-Alpha, Chlorophylls, Colored Dissolved Organic Matter (CDOM), Phycocyanins, Suspended Sediments, and Turbidity from standard Red-Green-Blue (RGB) images of surface water. HydroVision supports early detection of contamination trends and strengthens monitoring by regulatory agencies during external environmental stressors, industrial activities, and force majeure events. The model is trained on more than 500,000 seasonally varied images collected from the United States Geological Survey Hydrologic Imagery Visualization and Information System between 2022 and 2024. This approach leverages widely available RGB imagery as a scalable, cost-effective alternative to traditional multispectral and hyperspectral remote sensing. Four state-of-the-art convolutional neural networks (VGG-16, ResNet50, MobileNetV2, DenseNet121) and a Vision Transformer are evaluated through transfer learning to identify the best-performing architecture. DenseNet121 achieves the highest validation performance, with an R2 score of 0.89 in predicting CDOM, demonstrating the framework's promise for real-world water quality monitoring across diverse conditions. While the current model is optimized for well-lit imagery, future work will focus on improving robustness under low-light and obstructed scenarios to expand its operational utility.

[302] Automated Wildfire Damage Assessment from Multi view Ground level Imagery Via Vision Language Models

Miguel Esparza,Archit Gupta,Ali Mostafavi,Kai Yin,Yiming Xiao

Main category: cs.CV

TL;DR: This study proposes a zero-shot framework using pre-trained vision language models (VLMs) for rapid wildfire damage assessment, showing that multi-view analysis significantly improves accuracy without requiring labeled data or extensive training.

Details

Motivation: The increasing intensity and frequency of wildfires necessitate faster and more accurate property damage assessment methods. Traditional methods are time-consuming, and current computer vision techniques require extensive labeled datasets, which are not always available immediately after a disaster. This research aims to address these challenges by proposing a zero-shot framework using pre-trained models. Method: The research introduced two pipelines using pre-trained vision language models (VLMs) to assess wildfire damage from ground-level imagery. Pipeline A used only a VLM, while Pipeline B integrated an LLM to enhance prompting. The models were evaluated based on their performance in classifying damage from the 2025 Eaton and Palisades fires in California, using both single-view and multi-view analyses. Result: The results showed that single-view assessments had low classification accuracy (F1 scores from 0.225 to 0.511), while multi-view analysis significantly improved performance (F1 scores from 0.857 to 0.947). The McNemar test confirmed that multi-view assessment leads to statistically significant improvements. However, no significant difference was observed between Pipeline A (VLM-only) and Pipeline B (VLM + LLM). Conclusion: The study concludes that leveraging pre-trained vision language models (VLMs) through a zero-shot framework can provide an immediately deployable, flexible, and interpretable workflow for wildfire damage assessment without the need for supervised training. Multi-view analysis significantly improves classification accuracy, although combining VLM with a large language model (LLM) did not yield statistically significant improvements. Abstract: The escalating intensity and frequency of wildfires demand innovative computational methods for rapid and accurate property damage assessment. Traditional methods are often time consuming, while modern computer vision approaches typically require extensive labeled datasets, hindering immediate post-disaster deployment. This research introduces a novel, zero-shot framework leveraging pre-trained vision language models (VLMs) to classify damage from ground-level imagery. We propose and evaluate two pipelines applied to the 2025 Eaton and Palisades fires in California, a VLM (Pipeline A) and a VLM + large language model (LLM) approach (Pipeline B), that integrate structured prompts based on specific wildfire damage indicators. A primary scientific contribution of this study is demonstrating the VLMs efficacy in synthesizing information from multiple perspectives to identify nuanced damage, a critical limitation in existing literature. Our findings reveal that while single view assessments struggled to classify affected structures (F1 scores ranging from 0.225 to 0.511), the multi-view analysis yielded dramatic improvements (F1 scores ranging from 0.857 to 0.947). Moreover, the McNemar test confirmed that pipelines with a multi-view image assessment yields statistically significant classification improvements; however, the improvements this research observed between Pipeline A and B were not statistically significant. Thus, future research can explore the potential of LLM prompting in damage assessment. The practical contribution is an immediately deployable, flexible, and interpretable workflow that bypasses the need for supervised training, significantly accelerating triage and prioritization for disaster response practitioners.

[303] DroneSR: Rethinking Few-shot Thermal Image Super-Resolution from Drone-based Perspective

Zhipeng Weng,Xiaopeng Liu,Ce Liu,Xingyuan Guo,Yukai Shi,Liang Lin

Main category: cs.CV

TL;DR: This paper proposes a Gaussian quantization representation learning method to address overfitting in large-scale diffusion models for drone-based infrared image super-resolution.

Details

Motivation: Overfitting in large-scale diffusion models hampers generalization, especially in few-sample scenarios like drone-captured infrared image reconstruction. Method: A new Gaussian quantization representation learning method combined with an overfitting monitoring mechanism during training. Result: The method significantly reduces overfitting while maintaining model complexity, validated through experiments on a newly constructed multi-source drone infrared image benchmark dataset. Conclusion: The proposed method effectively mitigates overfitting in large-scale architectures for drone-based infrared image super-resolution tasks, demonstrating superior performance over existing approaches. Abstract: Although large scale models achieve significant improvements in performance, the overfitting challenge still frequently undermines their generalization ability. In super resolution tasks on images, diffusion models as representatives of generative models typically adopt large scale architectures. However, few-shot drone-captured infrared training data frequently induces severe overfitting in large-scale architectures. To address this key challenge, our method proposes a new Gaussian quantization representation learning method oriented to diffusion models that alleviates overfitting and enhances robustness. At the same time, an effective monitoring mechanism tracks large scale architectures during training to detect signs of overfitting. By introducing Gaussian quantization representation learning, our method effectively reduces overfitting while maintaining architecture complexity. On this basis, we construct a multi source drone-based infrared image benchmark dataset for detection and use it to emphasize overfitting issues of large scale architectures in few sample, drone-based diverse drone-based image reconstruction scenarios. To verify the efficacy of the method in mitigating overfitting, experiments are conducted on the constructed benchmark. Experimental results demonstrate that our method outperforms existing super resolution approaches and significantly mitigates overfitting of large scale architectures under complex conditions. The code and DroneSR dataset will be available at: https://github.com/wengzp1/GARLSR.

[304] Towards Interpretable Geo-localization: a Concept-Aware Global Image-GPS Alignment Framework

Furong Jia,Lanxin Liu,Ce Hou,Fan Zhang,Xinyan Liu,Yu Liu

Main category: cs.CV

TL;DR: 本文提出了一种新的地理定位框架，通过引入概念瓶颈和概念感知对齐模块，提高了地理定位的准确性及模型的可解释性。

Details

Motivation: 尽管地理定位模型如GeoCLIP已经通过对比学习取得了进展，但这些模型的可解释性仍不足，现有的基于概念的可解释性方法无法有效地与地理对齐的图像-位置嵌入目标对齐。 Method: 该方法通过联合投影图像和位置嵌入到一个共享的地理概念库上，并最小化概念级别的损失，以增强在概念特定子空间中的对齐。 Result: 实验结果表明，该方法在地理定位准确性上超越了GeoCLIP，并在各种地理空间预测任务中提升了性能，揭示了地理决策过程中的更丰富的语义洞察。 Conclusion: 该研究提出了一个新的框架，将全球地理定位与概念瓶颈相结合，通过插入一个概念感知对齐模块，提高了地理定位的准确性和可解释性。 Abstract: Worldwide geo-localization involves determining the exact geographic location of images captured globally, typically guided by geographic cues such as climate, landmarks, and architectural styles. Despite advancements in geo-localization models like GeoCLIP, which leverages images and location alignment via contrastive learning for accurate predictions, the interpretability of these models remains insufficiently explored. Current concept-based interpretability methods fail to align effectively with Geo-alignment image-location embedding objectives, resulting in suboptimal interpretability and performance. To address this gap, we propose a novel framework integrating global geo-localization with concept bottlenecks. Our method inserts a Concept-Aware Alignment Module that jointly projects image and location embeddings onto a shared bank of geographic concepts (e.g., tropical climate, mountain, cathedral) and minimizes a concept-level loss, enhancing alignment in a concept-specific subspace and enabling robust interpretability. To our knowledge, this is the first work to introduce interpretability into geo-localization. Extensive experiments demonstrate that our approach surpasses GeoCLIP in geo-localization accuracy and boosts performance across diverse geospatial prediction tasks, revealing richer semantic insights into geographic decision-making processes.

[305] A Diffusion-Based Framework for Configurable and Realistic Multi-Storage Trace Generation

Seohyun Kim,Junyoung Lee,Jongho Park,Jinhyung Koo,Sungjin Lee,Yeseong Kim

Main category: cs.CV

TL;DR: DiTTO is a diffusion-based framework for generating realistic, configurable, and diverse multi-device storage traces with high fidelity and minimal error.

Details

Motivation: The motivation is to address the need for generating realistic, configurable, and diverse multi-device storage traces for better simulation and analysis in storage systems. Method: DiTTO uses a diffusion-based framework to synthesize realistic storage traces, capturing temporal dynamics and inter-device dependencies through advanced diffusion techniques. Result: Experimental results show that DiTTO can generate high-fidelity traces with only 8% errors, aligning closely with user-defined configurations while preserving diversity. Conclusion: DiTTO is able to generate high-fidelity, diverse, and precisely configurable storage traces, proving its effectiveness in aligning with user-defined configurations while maintaining quality and diversity. Abstract: We propose DiTTO, a novel diffusion-based framework for generating realistic, precisely configurable, and diverse multi-device storage traces. Leveraging advanced diffusion tech- niques, DiTTO enables the synthesis of high-fidelity continuous traces that capture temporal dynamics and inter-device dependencies with user-defined configurations. Our experimental results demonstrate that DiTTO can generate traces with high fidelity and diversity while aligning closely with guided configurations with only 8% errors.

[306] Structure-aware Contrastive Learning for Diagram Understanding of Multimodal Models

Hiroshi Sasaki

Main category: cs.CV

TL;DR: 本文提出了一种新的训练范式，通过使用针对图表特性的对比学习和两个专门的损失函数，提高了视觉-语言模型对图表的理解能力。

Details

Motivation: 现有的多模态模型（如CLIP）在处理自然图像和文本任务上表现出色，但在理解具有结构化、符号信息的图表时存在局限性。 Method: 作者提出了一种专门针对图表理解的对比学习训练方法，结合了两个利用图表结构特性的损失函数。 Result: 在流程图数据集上的实验表明，与标准CLIP和传统对比学习方法相比，新方法在图像-文本匹配和视觉问答任务上都有显著提升。 Conclusion: 针对特定任务设计的训练策略对于提高视觉-语言模型对复杂图表内容的理解至关重要。 Abstract: Multimodal models, such as the Contrastive Language-Image Pre-training (CLIP) model, have demonstrated remarkable success in aligning visual and linguistic representations. However, these models exhibit limitations when applied to specialised visual domains, such as diagrams, which encode structured, symbolic information distinct from that of natural imagery. In this paper, we introduce a novel training paradigm explicitly designed to enhance the comprehension of diagrammatic images within vision-language models. Our approach uses ``hard'' samples for our proposed contrastive learning that incorporates two specialised loss functions that leverage the inherent structural properties of diagrams. By integrating these objectives into model training, our method enables models to develop a more structured and semantically coherent understanding of diagrammatic content. We empirically validate our approach on a benchmark dataset of flowcharts, as a representative class of diagrammatic imagery, demonstrating substantial improvements over standard CLIP and conventional hard negative CLIP learning paradigms for both image-text matching and visual question answering tasks. Our findings underscore the significance of tailored training strategies for specialised tasks and contribute to advancing diagrammatic understanding within the broader landscape of vision-language integration.

[307] 2D Gaussian Splatting with Semantic Alignment for Image Inpainting

Hongyu Li,Chaofeng Chen,Xiaoming Li,Guangming Lu

Main category: cs.CV

TL;DR: This paper proposes an image inpainting framework using 2D Gaussian Splatting to encode incomplete images and reconstruct them with a differentiable rasterization process, achieving promising results in both quantitative metrics and perceptual quality.

Details

Motivation: The motivation is to explore the untapped potential of Gaussian Splatting for image inpainting, which requires both locally coherent pixel synthesis and globally consistent semantic restoration. Method: The method involves encoding incomplete images into a continuous field of 2D Gaussian splat coefficients and reconstructing the final image via a differentiable rasterization process. A patch-wise rasterization strategy is introduced to improve efficiency and scalability, while features from a pretrained DINO model are used to ensure global semantic consistency. Result: Extensive experiments on standard benchmarks demonstrate that the proposed method achieves competitive performance in both quantitative metrics and perceptual quality. Conclusion: The paper concludes that their proposed method using 2D Gaussian Splatting for image inpainting achieves competitive performance in both quantitative metrics and perceptual quality, establishing a new direction for applying Gaussian Splatting to 2D image processing. Abstract: Gaussian Splatting (GS), a recent technique for converting discrete points into continuous spatial representations, has shown promising results in 3D scene modeling and 2D image super-resolution. In this paper, we explore its untapped potential for image inpainting, which demands both locally coherent pixel synthesis and globally consistent semantic restoration. We propose the first image inpainting framework based on 2D Gaussian Splatting, which encodes incomplete images into a continuous field of 2D Gaussian splat coefficients and reconstructs the final image via a differentiable rasterization process. The continuous rendering paradigm of GS inherently promotes pixel-level coherence in the inpainted results. To improve efficiency and scalability, we introduce a patch-wise rasterization strategy that reduces memory overhead and accelerates inference. For global semantic consistency, we incorporate features from a pretrained DINO model. We observe that DINO's global features are naturally robust to small missing regions and can be effectively adapted to guide semantic alignment in large-mask scenarios, ensuring that the inpainted content remains contextually consistent with the surrounding scene. Extensive experiments on standard benchmarks demonstrate that our method achieves competitive performance in both quantitative metrics and perceptual quality, establishing a new direction for applying Gaussian Splatting to 2D image processing.

[308] Ensemble-Based Event Camera Place Recognition Under Varying Illumination

Therese Joseph,Tobias Fischer,Michael Milford

Main category: cs.CV

TL;DR: The paper introduces an ensemble-based approach for event camera place recognition, combining multiple reconstruction methods and feature extractors to improve robustness under varying lighting conditions, achieving significant performance gains during day-night transitions.

Details

Motivation: The motivation is to address the challenge of developing robust visual place recognition frameworks for event cameras under severe illumination changes, leveraging their high dynamic range and low latency. Method: The method involves an ensemble approach combining sequence-matched results from multiple event-to-frame reconstructions, VPR feature extractors, and temporal resolutions, with a focus on enhancing performance at longer sequence lengths. Result: The result shows a 57% relative improvement in Recall@1 across day-night transitions, demonstrating enhanced robustness under challenging lighting conditions. Conclusion: The paper concludes that their ensemble-based approach for event camera place recognition significantly improves robustness under varied lighting conditions, especially achieving better performance during day-night transitions. Abstract: Compared to conventional cameras, event cameras provide a high dynamic range and low latency, offering greater robustness to rapid motion and challenging lighting conditions. Although the potential of event cameras for visual place recognition (VPR) has been established, developing robust VPR frameworks under severe illumination changes remains an open research problem. In this paper, we introduce an ensemble-based approach to event camera place recognition that combines sequence-matched results from multiple event-to-frame reconstructions, VPR feature extractors, and temporal resolutions. Unlike previous event-based ensemble methods, which only utilise temporal resolution, our broader fusion strategy delivers significantly improved robustness under varied lighting conditions (e.g., afternoon, sunset, night), achieving a 57% relative improvement in Recall@1 across day-night transitions. We evaluate our approach on two long-term driving datasets (with 8 km per traverse) without metric subsampling, thereby preserving natural variations in speed and stop duration that influence event density. We also conduct a comprehensive analysis of key design choices, including binning strategies, polarity handling, reconstruction methods, and feature extractors, to identify the most critical components for robust performance. Additionally, we propose a modification to the standard sequence matching framework that enhances performance at longer sequence lengths. To facilitate future research, we will release our codebase and benchmarking framework.

[309] MOSAIC: Multi-Subject Personalized Generation via Correspondence-Aware Alignment and Disentanglement

Dong She,Siming Fu,Mushui Liu,Qiaoqiao Jin,Hualiang Wang,Mu Liu,Jidong Jiang

Main category: cs.CV

TL;DR: MOSAIC通过显式语义对应和正交特征解耦解决多主体生成中的身份保真度和语义连贯性问题，并在多个基准测试中实现了最先进的性能。

Details

Motivation: 现有的多主体生成方法在共享表示空间中建模不同主体之间的交互时，常常遭遇身份混合和属性泄露的问题，因此需要一种新的方法。 Method: MOSAIC引入了SemAlign-MS数据集，并提出了语义对应注意力损失和多参考解耦损失，以实现精确的点对点语义对齐并防止特征干扰。 Result: MOSAIC在多个基准测试中实现了最先进的性能，尤其是在处理超过4个参考主体时保持高保真度，而现有方法通常在超过3个主体时性能下降。 Conclusion: MOSAIC是一种以表示为中心的框架，通过显式的语义对应和正交特征解耦，解决了多主体个性化生成中的身份保真度和语义连贯性问题。 Abstract: Multi-subject personalized generation presents unique challenges in maintaining identity fidelity and semantic coherence when synthesizing images conditioned on multiple reference subjects. Existing methods often suffer from identity blending and attribute leakage due to inadequate modeling of how different subjects should interact within shared representation spaces. We present MOSAIC, a representation-centric framework that rethinks multi-subject generation through explicit semantic correspondence and orthogonal feature disentanglement. Our key insight is that multi-subject generation requires precise semantic alignment at the representation level - knowing exactly which regions in the generated image should attend to which parts of each reference. To enable this, we introduce SemAlign-MS, a meticulously annotated dataset providing fine-grained semantic correspondences between multiple reference subjects and target images, previously unavailable in this domain. Building on this foundation, we propose the semantic correspondence attention loss to enforce precise point-to-point semantic alignment, ensuring high consistency from each reference to its designated regions. Furthermore, we develop the multi-reference disentanglement loss to push different subjects into orthogonal attention subspaces, preventing feature interference while preserving individual identity characteristics. Extensive experiments demonstrate that MOSAIC achieves state-of-the-art performance on multiple benchmarks. Notably, while existing methods typically degrade beyond 3 subjects, MOSAIC maintains high fidelity with 4+ reference subjects, opening new possibilities for complex multi-subject synthesis applications.

[310] Discrete Noise Inversion for Next-scale Autoregressive Text-based Image Editing

Quan Dao,Xiaoxiao He,Ligong Han,Ngan Hoai Nguyen,Amin Heyrani Nobar,Faez Ahmed,Han Zhang,Viet Anh Nguyen,Dimitris Metaxas

Main category: cs.CV

TL;DR: 本文提出了VARIN，一种基于噪声反演的文本引导图像编辑技术，专用于视觉自回归模型（VAR），通过位置感知的argmax反演（LAI）生成逆Gumbel噪声，实现对图像的精确重构和可控编辑。

Details

Motivation: 尽管条件生成已被广泛研究，但无需额外训练即可进行提示引导的图像编辑能力同样重要，因为它支持许多实际应用。本文旨在探索VAR模型在文本到图像编辑方面的能力。 Method: 本文引入了VARIN，这是一种基于噪声反演的编辑技术，利用LAI方法生成逆Gumbel噪声，从而实现图像的精确重构和根据文本提示进行编辑。 Result: 实验表明，VARIN能够根据指定的文本提示有效修改源图像，同时显著保留原始背景和结构细节。 Conclusion: VARIN被验证为一种有效的文本引导图像编辑方法，特别适用于VAR模型，具有广泛的现实应用潜力。 Abstract: Visual autoregressive models (VAR) have recently emerged as a promising class of generative models, achieving performance comparable to diffusion models in text-to-image generation tasks. While conditional generation has been widely explored, the ability to perform prompt-guided image editing without additional training is equally critical, as it supports numerous practical real-world applications. This paper investigates the text-to-image editing capabilities of VAR by introducing Visual AutoRegressive Inverse Noise (VARIN), the first noise inversion-based editing technique designed explicitly for VAR models. VARIN leverages a novel pseudo-inverse function for argmax sampling, named Location-aware Argmax Inversion (LAI), to generate inverse Gumbel noises. These inverse noises enable precise reconstruction of the source image and facilitate targeted, controllable edits aligned with textual prompts. Extensive experiments demonstrate that VARIN effectively modifies source images according to specified prompts while significantly preserving the original background and structural details, thus validating its efficacy as a practical editing approach.

[311] Draw-In-Mind: Learning Precise Image Editing via Chain-of-Thought Imagination

Ziyun Zeng,Junhao Zhang,Wei Li,Mike Zheng Shou

Main category: cs.CV

TL;DR: 本研究提出 DIM 数据集和模型，通过重新分配理解模块和生成模块的责任，显著提升了图像编辑效果。

Details

Motivation: 当前多模态模型在图像编辑任务中表现不佳，主要因为理解模块和生成模块之间的责任分配不平衡。 Method: 引入了 Draw-In-Mind (DIM) 数据集，并结合 Qwen2.5-VL-3B 和 SANA1.5-1.6B 模型进行训练，利用轻量级 MLP 连接两模块。 Result: DIM-4.6B-Edit 在 ImgEdit 和 GEdit-Bench 基准上表现优异，超越了更大的模型如 UniWorld-V1 和 Step1X-Edit。 Conclusion: 通过将设计责任明确分配给理解模块，DIM-4.6B-Edit 在图像编辑任务上取得了显著优势，证明了这种分工的有效性。 Abstract: In recent years, integrating multimodal understanding and generation into a single unified model has emerged as a promising paradigm. While this approach achieves strong results in text-to-image (T2I) generation, it still struggles with precise image editing. We attribute this limitation to an imbalanced division of responsibilities. The understanding module primarily functions as a translator that encodes user instructions into semantic conditions, while the generation module must simultaneously act as designer and painter, inferring the original layout, identifying the target editing region, and rendering the new content. This imbalance is counterintuitive because the understanding module is typically trained with several times more data on complex reasoning tasks than the generation module. To address this issue, we introduce Draw-In-Mind (DIM), a dataset comprising two complementary subsets: (i) DIM-T2I, containing 14M long-context image-text pairs to enhance complex instruction comprehension; and (ii) DIM-Edit, consisting of 233K chain-of-thought imaginations generated by GPT-4o, serving as explicit design blueprints for image edits. We connect a frozen Qwen2.5-VL-3B with a trainable SANA1.5-1.6B via a lightweight two-layer MLP, and train it on the proposed DIM dataset, resulting in DIM-4.6B-T2I/Edit. Despite its modest parameter scale, DIM-4.6B-Edit achieves SOTA or competitive performance on the ImgEdit and GEdit-Bench benchmarks, outperforming much larger models such as UniWorld-V1 and Step1X-Edit. These findings demonstrate that explicitly assigning the design responsibility to the understanding module provides significant benefits for image editing. Our dataset and models will be available at https://github.com/showlab/DIM.

[312] Explaining What Machines See: XAI Strategies in Deep Object Detection Models

FatemehSadat Seyedmomeni,Mohammad Ali Keyvanrad

Main category: cs.CV

TL;DR: 这篇论文综述了应用于对象检测模型的可解释性人工智能（XAI）技术，分析了其分类、适用性和评估指标，并强调了模型解释在关键领域的重要性。

Details

Motivation: 深度学习在计算机视觉任务中取得了巨大成功，但其黑盒性质和高复杂度对解释性提出了重大挑战，尤其是在自动驾驶、医学成像和安全系统等关键领域。因此，可解释性人工智能（XAI）旨在提供工具和方法，使模型决策更加透明、可解释和可信。 Method: 论文通过将现有的XAI技术分类为扰动基础、梯度基础、反向传播基础和图基础方法，对最先进的可解释性方法进行了全面分析。此外，论文还统计分析了2022年至2025年中期的出版趋势，探讨了常用数据集和评估指标，并对现有方法进行了批判性评估。 Result: 论文详细讨论了基于不同机制的XAI技术，包括扰动基础、梯度基础、反向传播基础和图基础方法，并调查了它们在YOLO、SSD、Faster R-CNN和EfficientDet等各种对象检测架构中的适用性。此外，论文还显示了2022年至2025年中期出版趋势的加速增长，表明可解释性对象检测的重要性日益增加。 Conclusion: 这篇论文得出的结论是，可解释性人工智能（XAI）在对象检测中的应用变得越来越重要，并且已经出现了许多有效的解释方法，例如D-RISE、BODEM、D-CLOSE和FSOD。论文通过分类和评估现有方法，为研究人员和实践者选择适合的可解释性技术提供了指导，并促进了更具可解释性的人工智能系统的发展。 Abstract: In recent years, deep learning has achieved unprecedented success in various computer vision tasks, particularly in object detection. However, the black-box nature and high complexity of deep neural networks pose significant challenges for interpretability, especially in critical domains such as autonomous driving, medical imaging, and security systems. Explainable Artificial Intelligence (XAI) aims to address this challenge by providing tools and methods to make model decisions more transparent, interpretable, and trust-worthy for humans. This review provides a comprehensive analysis of state-of-the-art explain-ability methods specifically applied to object detection models. The paper be-gins by categorizing existing XAI techniques based on their underlying mechanisms-perturbation-based, gradient-based, backpropagation-based, and graph-based methods. Notable methods such as D-RISE, BODEM, D-CLOSE, and FSOD are discussed in detail. Furthermore, the paper investigates their applicability to various object detection architectures, including YOLO, SSD, Faster R-CNN, and EfficientDet. Statistical analysis of publication trends from 2022 to mid-2025 shows an accelerating interest in explainable object detection, indicating its increasing importance. The study also explores common datasets and evaluation metrics, and highlights the major challenges associated with model interpretability. By providing a structured taxonomy and a critical assessment of existing methods, this review aims to guide researchers and practitioners in selecting suitable explainability techniques for object detection applications and to foster the development of more interpretable AI systems.

[313] Palette Aligned Image Diffusion

Elad Aharoni,Noy Porat,Dani Lischinski,Ariel Shamir

Main category: cs.CV

TL;DR: Palette-Adapter通过解释调色板为稀疏直方图并引入两种标量控制参数，实现了对文本到图像扩散模型的用户指定调色板条件控制。

Details

Motivation: 调色板是创意工作流程中广泛使用的紧凑且直观的工具，但用于图像生成条件控制时会引入显著的模糊性和不稳定性。 Method: 将调色板解释为稀疏直方图，并引入两个标量控制参数：直方图熵和调色板到直方图距离；此外，还引入了负直方图机制来抑制特定不希望的色调。 Result: 该方法在广泛的调色板和提示下实现了稳定、语义连贯的生成，并在定性、定量和用户研究评估中均表现出优于现有方法的效果。 Conclusion: Palette-Adapter提供了一种灵活且有效的方法来增强文本到图像生成模型的调色板控制能力。 Abstract: We introduce the Palette-Adapter, a novel method for conditioning text-to-image diffusion models on a user-specified color palette. While palettes are a compact and intuitive tool widely used in creative workflows, they introduce significant ambiguity and instability when used for conditioning image generation. Our approach addresses this challenge by interpreting palettes as sparse histograms and introducing two scalar control parameters: histogram entropy and palette-to-histogram distance, which allow flexible control over the degree of palette adherence and color variation. We further introduce a negative histogram mechanism that allows users to suppress specific undesired hues, improving adherence to the intended palette under the standard classifier-free guidance mechanism. To ensure broad generalization across the color space, we train on a carefully curated dataset with balanced coverage of rare and common colors. Our method enables stable, semantically coherent generation across a wide range of palettes and prompts. We evaluate our method qualitatively, quantitatively, and through a user study, and show that it consistently outperforms existing approaches in achieving both strong palette adherence and high image quality.

[314] Vision-Based Embedded System for Noncontact Monitoring of Preterm Infant Behavior in Low-Resource Care Settings

Stanley Mugisha,Rashid Kisitu,Francis Komakech,Excellence Favor

Main category: cs.CV

TL;DR: 本文介绍了一种基于量化MobileNet模型的嵌入式监控系统，可在Raspberry Pi上进行实时新生儿行为状态检测，从而提供一种可扩展、低成本和临床可行的NICU监测系统。

Details

Motivation: 早产儿的持续监测至关重要，但目前的方法依赖于手动观察或侵入性传感器，容易出错、不切实际，并可能导致皮肤损伤。 Method: 我们引入了一个嵌入式监控系统，该系统使用在Raspberry Pi上部署的量化MobileNet模型用于实时行为状态检测。 Result: 我们的系统在睡眠检测中达到了最先进的准确性（睡眠检测为91.8%，哭泣/正常分类为97.7%），同时保持了适合边缘部署的计算效率。 Conclusion: 轻量级优化模型如MobileNet为可扩展、低成本和临床可行的NICU监测系统提供了最可行的基础，为资源有限环境下的早产护理铺平了道路。 Abstract: Preterm birth remains a leading cause of neonatal mortality, disproportionately affecting low-resource settings with limited access to advanced neonatal intensive care units (NICUs).Continuous monitoring of infant behavior, such as sleep/awake states and crying episodes, is critical but relies on manual observation or invasive sensors, which are prone to error, impractical, and can cause skin damage. This paper presents a novel, noninvasive, and automated vision-based framework to address this gap. We introduce an embedded monitoring system that utilizes a quantized MobileNet model deployed on a Raspberry Pi for real-time behavioral state detection. When trained and evaluated on public neonatal image datasets, our system achieves state-of-the-art accuracy (91.8% for sleep detection and 97.7% for crying/normal classification) while maintaining computational efficiency suitable for edge deployment. Through comparative benchmarking, we provide a critical analysis of the trade-offs between model size, inference latency, and diagnostic accuracy. Our findings demonstrate that while larger architectures (e.g., ResNet152, VGG19) offer marginal gains in accuracy, their computational cost is prohibitive for real-time edge use. The proposed framework integrates three key innovations: model quantization for memory-efficient inference (68% reduction in size), Raspberry Pi-optimized vision pipelines, and secure IoT communication for clinical alerts. This work conclusively shows that lightweight, optimized models such as the MobileNet offer the most viable foundation for scalable, low-cost, and clinically actionable NICU monitoring systems, paving the way for improved preterm care in resource-constrained environments.

[315] Unsupervised Training of Vision Transformers with Synthetic Negatives

Nikolaos Giakoumoglou,Andreas Floros,Kleanthis Marios Papadopoulos,Tania Stathaki

Main category: cs.CV

TL;DR: 通过将合成困难负样本整合到视觉变换器中，本文提升了自监督学习的效果。

Details

Motivation: 先前的工作很少在视觉变换器中探索合成困难负样本的潜力。 Method: 整合合成困难负样本到视觉变换模型中进行自监督学习。 Result: 该技术显著提升了DeiT-S和Swin-T架构的性能。 Conclusion: 合成困难负样本技术显著提高了视觉变换模型的表示学习效果。 Abstract: This paper does not introduce a novel method per se. Instead, we address the neglected potential of hard negative samples in self-supervised learning. Previous works explored synthetic hard negatives but rarely in the context of vision transformers. We build on this observation and integrate synthetic hard negatives to improve vision transformer representation learning. This simple yet effective technique notably improves the discriminative power of learned representations. Our experiments show performance improvements for both DeiT-S and Swin-T architectures.

[316] See No Evil: Adversarial Attacks Against Linguistic-Visual Association in Referring Multi-Object Tracking Systems

Halima Bouzidi,Haoyu Liu,Mohammad Al Faruque

Main category: cs.CV

TL;DR: This paper identifies security vulnerabilities in Referring Multi-Object Tracking (RMOT) systems and introduces the VEIL adversarial framework to exploit these weaknesses, highlighting the need for more secure RMOT designs.

Details

Motivation: Despite advances in RMOT systems, their reliability and robustness remain underexplored, particularly regarding security implications and adversarial vulnerabilities. Method: The authors propose VEIL, an adversarial framework designed to disrupt RMOT models by introducing digital and physical perturbations that affect tracking logic reliability. Result: Experiments on the Refer-KITTI dataset demonstrate that VEIL effectively induces track ID switches and terminations, revealing vulnerabilities in spatial-temporal reasoning and memory mechanisms of RMOT models. Conclusion: The paper concludes that RMOT systems are vulnerable to adversarial attacks, particularly through the linguistic-visual referring and track-object matching components, and highlights the urgent need for security-aware RMOT designs. Abstract: Language-vision understanding has driven the development of advanced perception systems, most notably the emerging paradigm of Referring Multi-Object Tracking (RMOT). By leveraging natural-language queries, RMOT systems can selectively track objects that satisfy a given semantic description, guided through Transformer-based spatial-temporal reasoning modules. End-to-End (E2E) RMOT models further unify feature extraction, temporal memory, and spatial reasoning within a Transformer backbone, enabling long-range spatial-temporal modeling over fused textual-visual representations. Despite these advances, the reliability and robustness of RMOT remain underexplored. In this paper, we examine the security implications of RMOT systems from a design-logic perspective, identifying adversarial vulnerabilities that compromise both the linguistic-visual referring and track-object matching components. Additionally, we uncover a novel vulnerability in advanced RMOT models employing FIFO-based memory, whereby targeted and consistent attacks on their spatial-temporal reasoning introduce errors that persist within the history buffer over multiple subsequent frames. We present VEIL, a novel adversarial framework designed to disrupt the unified referring-matching mechanisms of RMOT models. We show that carefully crafted digital and physical perturbations can corrupt the tracking logic reliability, inducing track ID switches and terminations. We conduct comprehensive evaluations using the Refer-KITTI dataset to validate the effectiveness of VEIL and demonstrate the urgent need for security-aware RMOT designs for critical large-scale applications.

[317] Fake & Square: Training Self-Supervised Vision Transformers with Synthetic Data and Synthetic Hard Negatives

Nikolaos Giakoumoglou,Andreas Floros,Kleanthis Marios Papadopoulos,Tania Stathaki

Main category: cs.CV

TL;DR: This paper explores the use of synthetic data and synthetic hard negatives in self-supervised learning for vision transformers, combining both methods in a framework called Syn2Co.

Details

Motivation: The motivation is to find alternatives to contrastive self-supervised learning, which usually requires large amounts of real-world data and carefully curated hard negatives. Method: The paper builds on existing self-supervised learning approaches, exploring the use of synthetic data and synthetic hard negatives in vision transformers. It introduces a framework called Syn2Co that combines both methods. Result: The framework Syn2Co is evaluated on DeiT-S and Swin-T architectures to determine if synthetically enhanced training can lead to more robust and transferable visual representations. Conclusion: The paper concludes that synthetic data shows promise but also has limitations in self-supervised learning, providing insights for future research. Abstract: This paper does not introduce a new method per se. Instead, we build on existing self-supervised learning approaches for vision, drawing inspiration from the adage "fake it till you make it". While contrastive self-supervised learning has achieved remarkable success, it typically relies on vast amounts of real-world data and carefully curated hard negatives. To explore alternatives to these requirements, we investigate two forms of "faking it" in vision transformers. First, we study the potential of generative models for unsupervised representation learning, leveraging synthetic data to augment sample diversity. Second, we examine the feasibility of generating synthetic hard negatives in the representation space, creating diverse and challenging contrasts. Our framework - dubbed Syn2Co - combines both approaches and evaluates whether synthetically enhanced training can lead to more robust and transferable visual representations on DeiT-S and Swin-T architectures. Our findings highlight the promise and limitations of synthetic data in self-supervised learning, offering insights for future work in this direction.

[318] ContextFusion and Bootstrap: An Effective Approach to Improve Slot Attention-Based Object-Centric Learning

Pinzhuo Tian,Shengjie Yang,Hang Yu,Alex C. Kot

Main category: cs.CV

TL;DR: 这篇论文提出了一种新的以物体为中心的学习方法，通过引入ContextFusion阶段和Bootstrap Branch，解决了现有槽注意力方法在高层次语义信息和编码器微调方面的局限性，并在实验中验证了其性能的显著提升。

Details

Motivation: 现有的槽注意力方法在分配图像区域到槽时主要依赖于颜色和纹理等低级特征，这使得模型对低级特征过于敏感，并且限制了其对物体轮廓、形状或其他语义特征的理解。此外，现有方法在整个训练过程中需要稳定的特征空间，这限制了有效的以物体为中心的学习所需的灵活性。 Method: 通过引入ContextFusion阶段和Bootstrap Branch，改进现有的槽注意力模型。ContextFusion利用前景和背景的语义信息，而Bootstrap Branch则采用自举策略训练特征自适应机制。 Result: 实验结果表明，所提出的方法在模拟和真实世界数据集上显著提高了不同SOTA槽注意力模型的性能。 Conclusion: 该论文提出了一种新的ContextFusion阶段和Bootstrap Branch方法，有效解决了现有槽注意力方法在高层次语义信息和编码器微调方面的局限性。 Abstract: A key human ability is to decompose a scene into distinct objects and use their relationships to understand the environment. Object-centric learning aims to mimic this process in an unsupervised manner. Recently, the slot attention-based framework has emerged as a leading approach in this area and has been widely used in various downstream tasks. However, existing slot attention methods face two key limitations: (1) a lack of high-level semantic information. In current methods, image areas are assigned to slots based on low-level features such as color and texture. This makes the model overly sensitive to low-level features and limits its understanding of object contours, shapes, or other semantic characteristics. (2) The inability to fine-tune the encoder. Current methods require a stable feature space throughout training to enable reconstruction from slots, which restricts the flexibility needed for effective object-centric learning. To address these limitations, we propose a novel ContextFusion stage and a Bootstrap Branch, both of which can be seamlessly integrated into existing slot attention models. In the ContextFusion stage, we exploit semantic information from the foreground and background, incorporating an auxiliary indicator that provides additional contextual cues about them to enrich the semantic content beyond low-level features. In the Bootstrap Branch, we decouple feature adaptation from the original reconstruction phase and introduce a bootstrap strategy to train a feature-adaptive mechanism, allowing for more flexible adaptation. Experimental results show that our method significantly improves the performance of different SOTA slot attention models on both simulated and real-world datasets.

[319] A Data-Centric Approach to Pedestrian Attribute Recognition: Synthetic Augmentation via Prompt-driven Diffusion Models

Alejandro Alonso,Sawaiz A. Chaudhry,Juan C. SanMiguel,Álvaro García-Martín,Pablo Ayuso-Albizu,Pablo Carballeira

Main category: cs.CV

TL;DR: 本文提出了一种基于文本描述的合成数据增强方法来改进行人属性识别（PAR），通过生成与数据集一致的合成行人图像，并将合成样本无缝地融入训练数据，从而提高了对代表性不足属性的识别能力以及整体模型性能，同时增强了零样本泛化能力。

Details

Motivation: 行人属性识别在现实世界数据中面临挑战，传统方法受限于训练数据集的局限性，特别是某些属性的表示不足。 Method: 定义了一个协议来识别多个数据集中弱识别的属性；提出了一种基于提示的流水线，利用扩散模型生成合成行人图像；开发了一种策略将合成样本无缝融入训练数据，考虑基于提示的注释规则并修改损失函数。 Result: 在流行的PAR数据集上的结果表明，该方法不仅提高了对代表性不足属性的识别能力，还提升了整体模型性能，同时增强了零样本泛化能力。 Conclusion: 该方法在不改变模型架构的情况下，提供了一种高效且可扩展的解决方案，用于改进现实世界中行人属性的识别。 Abstract: Pedestrian Attribute Recognition (PAR) is a challenging task as models are required to generalize across numerous attributes in real-world data. Traditional approaches focus on complex methods, yet recognition performance is often constrained by training dataset limitations, particularly the under-representation of certain attributes. In this paper, we propose a data-centric approach to improve PAR by synthetic data augmentation guided by textual descriptions. First, we define a protocol to identify weakly recognized attributes across multiple datasets. Second, we propose a prompt-driven pipeline that leverages diffusion models to generate synthetic pedestrian images while preserving the consistency of PAR datasets. Finally, we derive a strategy to seamlessly incorporate synthetic samples into training data, which considers prompt-based annotation rules and modifies the loss function. Results on popular PAR datasets demonstrate that our approach not only boosts recognition of underrepresented attributes but also improves overall model performance beyond the targeted attributes. Notably, this approach strengthens zero-shot generalization without requiring architectural changes of the model, presenting an efficient and scalable solution to improve the recognition of attributes of pedestrians in the real world.

[320] SALAD -- Semantics-Aware Logical Anomaly Detection

Matic Fučka,Vitjan Zavrtanik,Danijel Skočaj

Main category: cs.CV

TL;DR: SALAD is a novel method for logical anomaly detection that improves performance by modeling semantic relationships in object composition maps.

Details

Motivation: Existing logical anomaly detection methods discard spatial and semantic information, leading to suboptimal performance, which SALAD aims to address. Method: SALAD uses a semantics-aware discriminative approach with a new composition branch to model object composition maps and introduces a novel procedure for extracting these maps without hand-made labels or category-specific information. Result: SALAD achieves an impressive image-level AUROC of 96.1% on the MVTec LOCO benchmark, outperforming state-of-the-art methods. Conclusion: SALAD significantly improves logical anomaly detection performance on the MVTec LOCO benchmark, achieving an image-level AUROC of 96.1%. Abstract: Recent surface anomaly detection methods excel at identifying structural anomalies, such as dents and scratches, but struggle with logical anomalies, such as irregular or missing object components. The best-performing logical anomaly detection approaches rely on aggregated pretrained features or handcrafted descriptors (most often derived from composition maps), which discard spatial and semantic information, leading to suboptimal performance. We propose SALAD, a semantics-aware discriminative logical anomaly detection method that incorporates a newly proposed composition branch to explicitly model the distribution of object composition maps, consequently learning important semantic relationships. Additionally, we introduce a novel procedure for extracting composition maps that requires no hand-made labels or category-specific information, in contrast to previous methods. By effectively modelling the composition map distribution, SALAD significantly improves upon state-of-the-art methods on the standard benchmark for logical anomaly detection, MVTec LOCO, achieving an impressive image-level AUROC of 96.1%. Code: https://github.com/MaticFuc/SALAD

[321] NOOUGAT: Towards Unified Online and Offline Multi-Object Tracking

Benjamin Missaoui,Orcun Cetintas,Guillem Brasó,Tim Meinhardt,Laura Leal-Taixé

Main category: cs.CV

TL;DR: NOOUGAT是一种新的跟踪器，它可以在任意时间范围内运行，并在在线和离线跟踪模式下都实现了最先进的性能。

Details

Motivation: 现有的在线跟踪器依赖于逐帧手工关联策略，在长期遮挡方面表现不佳，而离线方法虽然可以覆盖更大的时间间隙，但仍然依赖于启发式拼接。 Method: NOOUGAT利用统一的图神经网络（GNN）框架处理不重叠的子片段，并通过一种新颖的自回归长期跟踪（ALT）层将它们融合。 Result: NOOUGAT在在线跟踪方面在DanceTrack上提升了+2.3，在SportsMOT上提升了+9.2，在MOT20上提升了+5.0，并且在离线模式下获得了更大的增益。 Conclusion: NOOUGAT是一种可以在任意时间范围内运行的跟踪器，它在在线和离线跟踪模式下都实现了最先进的性能。 Abstract: The long-standing division between \textit{online} and \textit{offline} Multi-Object Tracking (MOT) has led to fragmented solutions that fail to address the flexible temporal requirements of real-world deployment scenarios. Current \textit{online} trackers rely on frame-by-frame hand-crafted association strategies and struggle with long-term occlusions, whereas \textit{offline} approaches can cover larger time gaps, but still rely on heuristic stitching for arbitrarily long sequences. In this paper, we introduce NOOUGAT, the first tracker designed to operate with arbitrary temporal horizons. NOOUGAT leverages a unified Graph Neural Network (GNN) framework that processes non-overlapping subclips, and fuses them through a novel Autoregressive Long-term Tracking (ALT) layer. The subclip size controls the trade-off between latency and temporal context, enabling a wide range of deployment scenarios, from frame-by-frame to batch processing. NOOUGAT achieves state-of-the-art performance across both tracking regimes, improving \textit{online} AssA by +2.3 on DanceTrack, +9.2 on SportsMOT, and +5.0 on MOT20, with even greater gains in \textit{offline} mode.

[322] SegFormer Fine-Tuning with Dropout: Advancing Hair Artifact Removal in Skin Lesion Analysis

Asif Mohammed Saad,Umme Niraj Mahi

Main category: cs.CV

TL;DR: This paper proposes a SegformerWithDropout model for precise hair artifact segmentation in dermoscopic images, showing strong performance with potential benefits for skin cancer detection tasks.

Details

Motivation: Hair artifacts in dermoscopic images pose challenges for accurate skin lesion analysis, potentially obscuring critical diagnostic features in dermatological assessments. Method: A fine-tuned SegFormer model with dropout regularization was used for hair mask segmentation. The architecture employed the MiT-B2 encoder pretrained on ImageNet, with 3 input channels and 2 output classes. Dropout regularization with a probability of 0.3 was applied to prevent overfitting. Training was conducted using 10-fold cross-validation, AdamW optimization, cross-entropy loss, and early stopping based on validation loss. Result: The model achieved an average Dice coefficient of approximately 0.96, IoU values of 0.93, PSNR around 34 dB, SSIM of 0.97, and low LPIPS of 0.06, indicating high effectiveness in hair artifact segmentation. Conclusion: The proposed SegformerWithDropout model demonstrates robust performance in hair artifact segmentation, showing potential to enhance preprocessing for skin cancer detection tasks. Abstract: Hair artifacts in dermoscopic images present significant challenges for accurate skin lesion analysis, potentially obscuring critical diagnostic features in dermatological assessments. This work introduces a fine-tuned SegFormer model augmented with dropout regularization to achieve precise hair mask segmentation. The proposed SegformerWithDropout architecture leverages the MiT-B2 encoder, pretrained on ImageNet, with an in-channel count of 3 and 2 output classes, incorporating a dropout probability of 0.3 in the segmentation head to prevent overfitting. Training is conducted on a specialized dataset of 500 dermoscopic skin lesion images with fine-grained hair mask annotations, employing 10-fold cross-validation, AdamW optimization with a learning rate of 0.001, and cross-entropy loss. Early stopping is applied based on validation loss, with a patience of 3 epochs and a maximum of 20 epochs per fold. Performance is evaluated using a comprehensive suite of metrics, including Intersection over Union (IoU), Dice coefficient, Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS). Experimental results from the cross-validation demonstrate robust performance, with average Dice coefficients reaching approximately 0.96 and IoU values of 0.93, alongside favorable PSNR (around 34 dB), SSIM (0.97), and low LPIPS (0.06), highlighting the model's effectiveness in accurate hair artifact segmentation and its potential to enhance preprocessing for downstream skin cancer detection tasks.

[323] Enhancing Zero-Shot Pedestrian Attribute Recognition with Synthetic Data Generation: A Comparative Study with Image-To-Image Diffusion Models

Pablo Ayuso-Albizu,Juan C. SanMiguel,Pablo Carballeira

Main category: cs.CV

TL;DR: This paper explores the potential of diffusion models in addressing the issue of scarce annotated datasets in Pedestrian Attribute Recognition (PAR), identifying key parameters of img2img diffusion-based data expansion and demonstrating that optimal selection can lead to a 4.5% improvement in PAR recognition performance.

Details

Motivation: The scarcity of large-scale annotated datasets hinders the generalization of PAR models, specially in complex scenarios involving occlusions, varying poses, and diverse environments. Recent advances in diffusion models have shown promise for generating diverse and realistic synthetic images, allowing to expand the size and variability of training data. Method: The paper investigates the effectiveness of diffusion models in generating synthetic pedestrian images tailored to PAR tasks and identifies key parameters of img2img diffusion-based data expansion, including text prompts, image properties, and the latest enhancements in diffusion-based data augmentation. Result: Experimental results show that prompt alignment and image properties are critical factors in image generation, with optimal selection leading to a 4.5% improvement in PAR recognition performance. Conclusion: Prompt alignment and image properties are critical factors in image generation, with optimal selection leading to a 4.5% improvement in PAR recognition performance. Abstract: Pedestrian Attribute Recognition (PAR) involves identifying various human attributes from images with applications in intelligent monitoring systems. The scarcity of large-scale annotated datasets hinders the generalization of PAR models, specially in complex scenarios involving occlusions, varying poses, and diverse environments. Recent advances in diffusion models have shown promise for generating diverse and realistic synthetic images, allowing to expand the size and variability of training data. However, the potential of diffusion-based data expansion for generating PAR-like images remains underexplored. Such expansion may enhance the robustness and adaptability of PAR models in real-world scenarios. This paper investigates the effectiveness of diffusion models in generating synthetic pedestrian images tailored to PAR tasks. We identify key parameters of img2img diffusion-based data expansion; including text prompts, image properties, and the latest enhancements in diffusion-based data augmentation, and examine their impact on the quality of generated images for PAR. Furthermore, we employ the best-performing expansion approach to generate synthetic images for training PAR models, by enriching the zero-shot datasets. Experimental results show that prompt alignment and image properties are critical factors in image generation, with optimal selection leading to a 4.5% improvement in PAR recognition performance.

[324] Omnidirectional Spatial Modeling from Correlated Panoramas

Xinshen Zhang,Tongxi Fu,Xu Zheng

Main category: cs.CV

TL;DR: 本文提出了用于全景场景理解的新数据集CFpano和模型\methodname，在跨帧相关全景场景中实现了先进的性能。

Details

Motivation: 全景场景理解对于多种下游应用至关重要，但目前的方法忽视了跨帧相关性，导致场景理解存在局限。 Method: 提出了CFpano数据集和\methodname模型，使用Group Relative Policy Optimization（GRPO）和定制奖励函数进行多模态推理。 Result: \methodname在多选和开放式VQA任务中均表现出色，总体性能提升了5.37％。 Conclusion: CFpano和\methodname为全景场景理解提供了新的基准，并在多项任务中实现了最先进的性能。 Abstract: Omnidirectional scene understanding is vital for various downstream applications, such as embodied AI, autonomous driving, and immersive environments, yet remains challenging due to geometric distortion and complex spatial relations in 360{\deg} imagery. Existing omnidirectional methods achieve scene understanding within a single frame while neglecting cross-frame correlated panoramas. To bridge this gap, we introduce \textbf{CFpano}, the \textbf{first} benchmark dataset dedicated to cross-frame correlated panoramas visual question answering in the holistic 360{\deg} scenes. CFpano consists of over 2700 images together with over 8000 question-answer pairs, and the question types include both multiple choice and open-ended VQA. Building upon our CFpano, we further present \methodname, a multi-modal large language model (MLLM) fine-tuned with Group Relative Policy Optimization (GRPO) and a set of tailored reward functions for robust and consistent reasoning with cross-frame correlated panoramas. Benchmark experiments with existing MLLMs are conducted with our CFpano. The experimental results demonstrate that \methodname achieves state-of-the-art performance across both multiple-choice and open-ended VQA tasks, outperforming strong baselines on all major reasoning categories (\textbf{+5.37\%} in overall performance). Our analyses validate the effectiveness of GRPO and establish a new benchmark for panoramic scene understanding.

[325] ADVMEM: Adversarial Memory Initialization for Realistic Test-Time Adaptation via Tracklet-Based Benchmarking

Shyma Alhuwaider,Motasem Alfarra,Juan C. Perez,Merey Ramazanova,Bernard Ghanem

Main category: cs.CV

TL;DR: This paper introduces a new tracklet-based dataset for benchmarking TTA methods, aiming to mimic real-world scenarios with temporal dependencies. It highlights the limitations of current TTA methods and proposes a novel strategy to improve their performance.

Details

Motivation: The motivation is to address the shortcomings of current TTA benchmarks, which fail to represent realistic scenarios with temporal dependencies, such as consecutive video frames showing the same object over time. Method: The authors proposed a new TTA benchmark called the 'Inherent Temporal Dependencies' (ITD) dataset, which mimics real-world scenarios with temporal dependencies. They conducted a thorough experimental analysis of current TTA methods and proposed a novel adversarial memory initialization strategy to enhance memory-based TTA methods. Result: The results show that the proposed adversarial memory initialization strategy substantially boosts the performance of various TTA methods on the new ITD dataset. Conclusion: The paper concludes that current TTA methods have limitations when facing temporal dependencies, and their proposed adversarial memory initialization strategy significantly improves performance on their new benchmark dataset. Abstract: We introduce a novel tracklet-based dataset for benchmarking test-time adaptation (TTA) methods. The aim of this dataset is to mimic the intricate challenges encountered in real-world environments such as images captured by hand-held cameras, self-driving cars, etc. The current benchmarks for TTA focus on how models face distribution shifts, when deployed, and on violations to the customary independent-and-identically-distributed (i.i.d.) assumption in machine learning. Yet, these benchmarks fail to faithfully represent realistic scenarios that naturally display temporal dependencies, such as how consecutive frames from a video stream likely show the same object across time. We address this shortcoming of current datasets by proposing a novel TTA benchmark we call the "Inherent Temporal Dependencies" (ITD) dataset. We ensure the instances in ITD naturally embody temporal dependencies by collecting them from tracklets-sequences of object-centric images we compile from the bounding boxes of an object-tracking dataset. We use ITD to conduct a thorough experimental analysis of current TTA methods, and shed light on the limitations of these methods when faced with the challenges of temporal dependencies. Moreover, we build upon these insights and propose a novel adversarial memory initialization strategy to improve memory-based TTA methods. We find this strategy substantially boosts the performance of various methods on our challenging benchmark.

[326] Palmistry-Informed Feature Extraction and Analysis using Machine Learning

Shweta Patil

Main category: cs.CV

TL;DR: 本文开发了一个基于机器学习的手掌特征分析框架，能够识别手掌数据中的复杂模式，为数字人体测量学和个性化用户分析提供了新的方法。

Details

Motivation: 本文旨在通过提供一个数据驱动的、定量的框架，超越传统的主观解释，研究手掌形态与外部验证特征或条件之间的相关性。 Method: 我们展示了一个计算机视觉流程，从手掌图像中提取关键特征，如主要线条结构、纹理和形状指标，并使用这些特征在一个新颖的数据集上训练预测模型。 Result: 该方法证明了在数字人体测量学和个性化用户分析中的可行性，具有在移动平台上部署的潜力。 Conclusion: 机器学习模型可以识别手掌数据中的复杂模式，为文化实践与计算分析的交叉研究开辟了新途径。 Abstract: This paper explores the automated analysis of palmar features using machine learning techniques. We present a computer vision pipeline that extracts key characteristics from palm images, such as principal line structures, texture, and shape metrics. These features are used to train predictive models on a novel dataset curated from annotated palm images. Our approach moves beyond traditional subjective interpretation by providing a data-driven, quantitative framework for studying the correlations between palmar morphology and externally validated traits or conditions. The methodology demonstrates feasibility for applications in digital anthropometry and personalized user analytics, with potential for deployment on mobile platforms. Results indicate that machine learning models can identify complex patterns in palm data, opening avenues for research that intersects cultural practices with computational analysis.

[327] A Multimodal Cross-View Model for Predicting Postoperative Neck Pain in Cervical Spondylosis Patients

Jingyang Shan,Qishuai Yu,Jiacen Liu,Shaolin Zhang,Wen Shen,Yanxiao Zhao,Tianyi Wang,Xiaolin Qin,Yiheng Yin

Main category: cs.CV

TL;DR: 本研究提出了一种新的深度学习模型，有效预测颈椎病术后颈部疼痛恢复，提高了预测精度。

Details

Motivation: 颈椎病的主要症状是颈部疼痛，但其潜在机制尚不清楚，导致治疗效果不确定，因此需要更有效的预测模型。 Method: 本文提出了一种自适应双向金字塔差分卷积（ABPDC）模块和特征金字塔配准辅助网络（FPRAN），以解决由于影像差异和空间不匹配导致的多模态特征融合难题。 Result: 在MMCSD数据集上的实验表明，该模型在预测术后颈部疼痛恢复方面具有较高的准确性，并通过消融研究进一步验证了其有效性。 Conclusion: 该论文提出的ABPDC模块和FPRAN网络在预测颈椎病术后颈部疼痛恢复方面表现出色，优于现有方法。 Abstract: Neck pain is the primary symptom of cervical spondylosis, yet its underlying mechanisms remain unclear, leading to uncertain treatment outcomes. To address the challenges of multimodal feature fusion caused by imaging differences and spatial mismatches, this paper proposes an Adaptive Bidirectional Pyramid Difference Convolution (ABPDC) module that facilitates multimodal integration by exploiting the advantages of difference convolution in texture extraction and grayscale invariance, and a Feature Pyramid Registration Auxiliary Network (FPRAN) to mitigate structural misalignment. Experiments on the MMCSD dataset demonstrate that the proposed model achieves superior prediction accuracy of postoperative neck pain recovery compared with existing methods, and ablation studies further confirm its effectiveness.

[328] DSGC-Net: A Dual-Stream Graph Convolutional Network for Crowd Counting via Feature Correlation Mining

Yihong Wu,Jinqiao Wei,Xionghui Zhao,Yidi Li,Shaoyi Du,Bin Ren,Nicu Sebe

Main category: cs.CV

TL;DR: DSGC-Net, a Dual-Stream Graph Convolutional Network, improves crowd counting accuracy by addressing challenges in density distribution and individual representation inconsistencies, achieving superior performance on benchmark datasets.

Details

Motivation: Existing deep learning-based crowd counting models face challenges in adapting to significant density distribution differences between regions and inconsistencies in individual representations due to viewpoint changes and body posture differences. Method: The proposed DSGC-Net uses a Dual-Stream Graph Convolutional Network with a Density Approximation (DA) branch and a Representation Approximation (RA) branch. It captures feature correlations in density variations and representation distributions through semantic graphs, enhancing the model's adaptability and accuracy. Result: Extensive experiments on three widely used datasets show that DSGC-Net outperforms current state-of-the-art methods, achieving an MAE of 48.9 and 5.9 on the ShanghaiTech Part A and Part B datasets, respectively. Conclusion: DSGC-Net demonstrates superior performance in crowd counting by addressing challenges related to density distribution differences and inconsistency of individual representations in complex crowd scenarios. Abstract: Deep learning-based crowd counting methods have achieved remarkable progress in recent years. However, in complex crowd scenarios, existing models still face challenges when adapting to significant density distribution differences between regions. Additionally, the inconsistency of individual representations caused by viewpoint changes and body posture differences further limits the counting accuracy of the models. To address these challenges, we propose DSGC-Net, a Dual-Stream Graph Convolutional Network based on feature correlation mining. DSGC-Net introduces a Density Approximation (DA) branch and a Representation Approximation (RA) branch. By modeling two semantic graphs, it captures the potential feature correlations in density variations and representation distributions. The DA branch incorporates a density prediction module that generates the density distribution map, and constructs a density-driven semantic graph based on density similarity. The RA branch establishes a representation-driven semantic graph by computing global representation similarity. Then, graph convolutional networks are applied to the two semantic graphs separately to model the latent semantic relationships, which enhance the model's ability to adapt to density variations and improve counting accuracy in multi-view and multi-pose scenarios. Extensive experiments on three widely used datasets demonstrate that DSGC-Net outperforms current state-of-the-art methods. In particular, we achieve MAE of 48.9 and 5.9 in ShanghaiTech Part A and Part B datasets, respectively. The released code is available at: https://github.com/Wu-eon/CrowdCounting-DSGCNet.

[329] RS-OOD: A Vision-Language Augmented Framework for Out-of-Distribution Detection in Remote Sensing

Yingrui Ji,Jiansheng Chen,Jingbo Chen,Anzhi Yue,Chenhao Wang,Kai Li,Yao Zhu

Main category: cs.CV

TL;DR: RS-OOD: A novel framework for robust few-shot OOD detection in remote sensing imagery using spatial-semantic integration.

Details

Motivation: Out-of-distribution detection is a critical challenge in remote sensing applications due to data scarcity, complex multi-scale scene structures, and pronounced distribution shifts. Method: RS-OOD framework incorporating spatial feature enhancement, dual-prompt alignment mechanism, and confidence-guided self-training loop Result: RS-OOD consistently outperforms existing methods across multiple remote sensing benchmarks and enables efficient adaptation with minimal labeled data. Conclusion: RS-OOD demonstrates that spatial-semantic integration is crucial for effective OOD detection in remote sensing applications. Abstract: Out-of-distribution (OOD) detection represents a critical challenge in remote sensing applications, where reliable identification of novel or anomalous patterns is essential for autonomous monitoring, disaster response, and environmental assessment. Despite remarkable progress in OOD detection for natural images, existing methods and benchmarks remain poorly suited to remote sensing imagery due to data scarcity, complex multi-scale scene structures, and pronounced distribution shifts. To this end, we propose RS-OOD, a novel framework that leverages remote sensing-specific vision-language modeling to enable robust few-shot OOD detection. Our approach introduces three key innovations: spatial feature enhancement that improved scene discrimination, a dual-prompt alignment mechanism that cross-verifies scene context against fine-grained semantics for spatial-semantic consistency, and a confidence-guided self-training loop that dynamically mines pseudo-labels to expand training data without manual annotation. RS-OOD consistently outperforms existing methods across multiple remote sensing benchmarks and enables efficient adaptation with minimal labeled data, demonstrating the critical value of spatial-semantic integration.

[330] SynthGenNet: a self-supervised approach for test-time generalization using synthetic multi-source domain mixing of street view images

Pushpendra Dhakara,Prachi Chachodhia,Vaibhav Kumar

Main category: cs.CV

TL;DR: SynthGenNet通过结合多源合成数据和自监督学习策略，在复杂城市环境中实现了优秀的领域泛化效果。

Details

Motivation: 非结构化城市环境由于其复杂多样的布局，对场景理解和泛化能力提出了独特的挑战。 Method: 引入了ClassMix++算法、Grounded Mask Consistency Loss (GMC) 和 Pseudo-Label Guided Contrastive Learning (PLGCL) 机制，并采用自监督学生-教师架构进行知识蒸馏。 Result: 在真实世界数据集如Indian Driving Dataset (IDD) 上，取得了50%的Mean Intersection-Over-Union (mIoU) 值，超越了依赖单一来源的最先进方法。 Conclusion: SynthGenNet有效提升了在复杂城市环境中的领域泛化能力，减少了对标注目标数据的依赖。 Abstract: Unstructured urban environments present unique challenges for scene understanding and generalization due to their complex and diverse layouts. We introduce SynthGenNet, a self-supervised student-teacher architecture designed to enable robust test-time domain generalization using synthetic multi-source imagery. Our contributions include the novel ClassMix++ algorithm, which blends labeled data from various synthetic sources while maintaining semantic integrity, enhancing model adaptability. We further employ Grounded Mask Consistency Loss (GMC), which leverages source ground truth to improve cross-domain prediction consistency and feature alignment. The Pseudo-Label Guided Contrastive Learning (PLGCL) mechanism is integrated into the student network to facilitate domain-invariant feature learning through iterative knowledge distillation from the teacher network. This self-supervised strategy improves prediction accuracy, addresses real-world variability, bridges the sim-to-real domain gap, and reliance on labeled target data, even in complex urban areas. Outcomes show our model outperforms the state-of-the-art (relying on single source) by achieving 50% Mean Intersection-Over-Union (mIoU) value on real-world datasets like Indian Driving Dataset (IDD).

[331] Data-Driven Loss Functions for Inference-Time Optimization in Text-to-Image Generation

Sapir Esther Yiflach,Yuval Atzmon,Gal Chechik

Main category: cs.CV

TL;DR: Learn-to-Steer improves spatial reasoning in text-to-image models by learning from internal representations, significantly boosting accuracy on challenging benchmarks.

Details

Motivation: Text-to-image diffusion models struggle with spatial reasoning tasks that are simple for humans, such as positioning objects correctly based on textual descriptions. Method: The method involves training a lightweight classifier to decode spatial relationships from the model's cross-attention maps and using it as a learned loss function during inference, combined with a dual-inversion strategy to enforce geometric understanding. Result: Spatial accuracy improved from 0.20 to 0.61 on FLUX.1-dev and from 0.07 to 0.54 on SD2.1 across standard benchmarks, showing strong generalization to multiple spatial relations. Conclusion: The proposed Learn-to-Steer framework significantly improves the spatial accuracy of text-to-image diffusion models by learning data-driven objectives for test-time optimization, outperforming existing methods on FLUX.1-dev and SD2.1 benchmarks. Abstract: Text-to-image diffusion models can generate stunning visuals, yet they often fail at tasks children find trivial--like placing a dog to the right of a teddy bear rather than to the left. When combinations get more unusual--a giraffe above an airplane--these failures become even more pronounced. Existing methods attempt to fix these spatial reasoning failures through model fine-tuning or test-time optimization with handcrafted losses that are suboptimal. Rather than imposing our assumptions about spatial encoding, we propose learning these objectives directly from the model's internal representations. We introduce Learn-to-Steer, a novel framework that learns data-driven objectives for test-time optimization rather than handcrafting them. Our key insight is to train a lightweight classifier that decodes spatial relationships from the diffusion model's cross-attention maps, then deploy this classifier as a learned loss function during inference. Training such classifiers poses a surprising challenge: they can take shortcuts by detecting linguistic traces rather than learning true spatial patterns. We solve this with a dual-inversion strategy that enforces geometric understanding. Our method dramatically improves spatial accuracy: from 0.20 to 0.61 on FLUX.1-dev and from 0.07 to 0.54 on SD2.1 across standard benchmarks. Moreover, our approach generalizes to multiple relations and significantly improves accuracy.

[332] Hues and Cues: Human vs. CLIP

Nuria Alabau-Bosque,Jorge Vila-Tomás,Paula Daudén-Oliver,Pablo Hernández-Cámara,Jose Manuel Jaén-Lorites,Valero Laparra,Jesús Malo

Main category: cs.CV

TL;DR: 通过棋盘游戏Hues & Cues评估CLIP模型与人类特征的一致性，揭示了其存在文化偏差和某些缺陷。

Details

Motivation: 游戏通常是挑战人类特征的，而评估人工模型时往往忽略了这一点。 Method: 通过玩棋盘游戏Hues & Cues测试CLIP的颜色感知和命名能力，以评估其与人类的一致性。 Result: 实验显示CLIP与人类观察者基本一致，但该方法揭示了CLIP在处理不同抽象层次时存在文化偏差和不一致。 Conclusion: 评估模型与人类特征的相似性可以通过棋盘游戏等不同任务来实现，这可以揭示模型中某些缺陷。 Abstract: Playing games is inherently human, and a lot of games are created to challenge different human characteristics. However, these tasks are often left out when evaluating the human-like nature of artificial models. The objective of this work is proposing a new approach to evaluate artificial models via board games. To this effect, we test the color perception and color naming capabilities of CLIP by playing the board game Hues & Cues and assess its alignment with humans. Our experiments show that CLIP is generally well aligned with human observers, but our approach brings to light certain cultural biases and inconsistencies when dealing with different abstraction levels that are hard to identify with other testing strategies. Our findings indicate that assessing models with different tasks like board games can make certain deficiencies in the models stand out in ways that are difficult to test with the commonly used benchmarks.

[333] OmniActor: A Generalist GUI and Embodied Agent for 2D&3D Worlds

Longrong Yang,Zhixiong Zeng,Yufeng Zhong,Jing Huang,Liming Zheng,Lei Chen,Haibo Qiu,Zequn Qin,Lin Ma,Xi Li

Main category: cs.CV

TL;DR: 提出OmniActor代理，通過混合GUI和體現數據解決多模態任務中的數據衝突問題。

Details

Motivation: 多模態代理通常需要同時與2D虛擬世界和3D現實世界交互，但這兩種數據類型存在衝突。 Method: 提出Layer-heterogeneity MoE方法，並統一GUI和體現任務的行動空間。 Result: OmniActor在GUI和體現任務上都優於僅使用GUI或體現數據訓練的代理。 Conclusion: OmniActor是一个高效的通用代理，通过分离深層參數和共享淺層參數來解決GUI和體現數據之間的衝突並利用它們的協同作用。 Abstract: Multimodal large language models are evolving toward multimodal agents capable of proactively executing tasks. Most agent research focuses on GUI or embodied scenarios, which correspond to agents interacting with 2D virtual worlds or 3D real worlds, respectively. However, many complex tasks typically require agents to interleavely interact with these two types of environment. We initially mix GUI and embodied data to train, but find the performance degeneration brought by the data conflict. Further analysis reveals that GUI and embodied data exhibit synergy and conflict at the shallow and deep layers, respectively, which resembles the cerebrum-cerebellum mechanism in the human brain. To this end, we propose a high-performance generalist agent OmniActor, designed from both structural and data perspectives. First, we propose Layer-heterogeneity MoE to eliminate the conflict between GUI and embodied data by separating deep-layer parameters, while leverage their synergy by sharing shallow-layer parameters. By successfully leveraging the synergy and eliminating the conflict, OmniActor outperforms agents only trained by GUI or embodied data in GUI or embodied tasks. Furthermore, we unify the action spaces of GUI and embodied tasks, and collect large-scale GUI and embodied data from various sources for training. This significantly improves OmniActor under different scenarios, especially in GUI tasks. The code will be publicly available.

[334] Ordinal Adaptive Correction: A Data-Centric Approach to Ordinal Image Classification with Noisy Labels

Alireza Sedighi Moghaddam,Mohammad Reza Mohammadi

Main category: cs.CV

TL;DR: 本研究开发了一种新颖的数据中心方法ORDinal自适应校正（ORDAC），用于检测和校正序数图像分类任务中的标签噪声。

Details

Motivation: 标记数据在训练监督深度学习模型中至关重要，但标注过程容易出错和产生噪声，尤其是在类别边界经常模糊的序数图像分类中。这样的标签噪声可能会显著降低机器学习模型的性能和可靠性。 Method: 提出了一种新的数据中心方法ORDinal自适应校正（ORDAC）用于自适应校正噪声标签。ORDAC动态调整每个样本的标签分布的均值和标准差。 Result: 结果表明，ORDAC及其扩展版本（ORDAC_C和ORDAC_R）在模型性能上带来了显著改进。例如，在Adience数据集上，噪声为40%的情况下，ORDAC_R将平均绝对误差从0.86降低到0.62，并将召回率指标从0.37提高到0.49。 Conclusion: 本研究表明，使用标签分布进行自适应标签校正是一种在存在噪声数据的情况下增强序数分类模型鲁棒性和准确性的有效策略。 Abstract: Labeled data is a fundamental component in training supervised deep learning models for computer vision tasks. However, the labeling process, especially for ordinal image classification where class boundaries are often ambiguous, is prone to error and noise. Such label noise can significantly degrade the performance and reliability of machine learning models. This paper addresses the problem of detecting and correcting label noise in ordinal image classification tasks. To this end, a novel data-centric method called ORDinal Adaptive Correction (ORDAC) is proposed for adaptive correction of noisy labels. The proposed approach leverages the capabilities of Label Distribution Learning (LDL) to model the inherent ambiguity and uncertainty present in ordinal labels. During training, ORDAC dynamically adjusts the mean and standard deviation of the label distribution for each sample. Rather than discarding potentially noisy samples, this approach aims to correct them and make optimal use of the entire training dataset. The effectiveness of the proposed method is evaluated on benchmark datasets for age estimation (Adience) and disease severity detection (Diabetic Retinopathy) under various asymmetric Gaussian noise scenarios. Results show that ORDAC and its extended versions (ORDAC_C and ORDAC_R) lead to significant improvements in model performance. For instance, on the Adience dataset with 40% noise, ORDAC_R reduced the mean absolute error from 0.86 to 0.62 and increased the recall metric from 0.37 to 0.49. The method also demonstrated its effectiveness in correcting intrinsic noise present in the original datasets. This research indicates that adaptive label correction using label distributions is an effective strategy to enhance the robustness and accuracy of ordinal classification models in the presence of noisy data.

[335] Category-Aware 3D Object Composition with Disentangled Texture and Shape Multi-view Diffusion

Zeren Xiong,Zikun Chen,Zedong Zhang,Xiang Li,Ying Tai,Jian Yang,Jun Li

Main category: cs.CV

TL;DR: The paper proposes a new method called C33D for 3D object synthesis that integrates multi-view images and text descriptions to generate novel and structurally coherent 3D models, overcoming the limitations of existing methods.

Details

Motivation: The motivation is to address the challenges faced by existing methods in effectively integrating multiple content sources for 3D object synthesis, which often result in inconsistent textures and inaccurate shapes. Method: The method involves rendering multi-view images and normal maps from an input 3D model, generating a novel 2D object using adaptive text-image harmony (ATIH), refining textures and shapes through multi-view diffusion processes, and reconstructing a complete 3D model. Result: The result of the study shows that the proposed C33D method can generate impressive 3D creations with structural coherence, as demonstrated by the experimental results. Conclusion: The paper concludes that the proposed C33D method is effective in generating novel and structurally coherent 3D models by integrating multi-view images and text descriptions from different object categories. Abstract: In this paper, we tackle a new task of 3D object synthesis, where a 3D model is composited with another object category to create a novel 3D model. However, most existing text/image/3D-to-3D methods struggle to effectively integrate multiple content sources, often resulting in inconsistent textures and inaccurate shapes. To overcome these challenges, we propose a straightforward yet powerful approach, category+3D-to-3D (C33D), for generating novel and structurally coherent 3D models. Our method begins by rendering multi-view images and normal maps from the input 3D model, then generating a novel 2D object using adaptive text-image harmony (ATIH) with the front-view image and a text description from another object category as inputs. To ensure texture consistency, we introduce texture multi-view diffusion, which refines the textures of the remaining multi-view RGB images based on the novel 2D object. For enhanced shape accuracy, we propose shape multi-view diffusion to improve the 2D shapes of both the multi-view RGB images and the normal maps, also conditioned on the novel 2D object. Finally, these outputs are used to reconstruct a complete and novel 3D model. Extensive experiments demonstrate the effectiveness of our method, yielding impressive 3D creations, such as shark(3D)-crocodile(text) in the first row of Fig. 1. A project page is available at: https://xzr52.github.io/C33D/

[336] Why Do MLLMs Struggle with Spatial Understanding? A Systematic Analysis from Data to Architecture

Wanyue Zhang,Yibin Huang,Yangbin Xu,JingJing Huang,Helu Zhi,Shuo Ren,Wang Xu,Jiajun Zhang

Main category: cs.CV

TL;DR: 论文系统性地分析了多模态大语言模型在空间理解方面的限制，并提出了通过数据扩展和架构优化改进的方向。

Details

Motivation: 尽管MLLMs在空间理解方面取得了一些进展，但现有研究缺乏对其局限性的全面和系统评估，论文旨在填补这一空白。 Method: 论文从数据和架构两个视角出发，提出MulSeT基准测试，并设计了一系列实验来分析MLLM的空间推理能力。 Result: 实验结果显示，空间理解的性能随着训练数据增加迅速收敛，但上限较低，尤其是在需要空间想象的任务上；此外，空间理解更依赖视觉编码器中的位置编码。 Conclusion: 该论文指出当前多模态大语言模型（MLLMs）在空间理解方面存在局限性，并通过数据和架构两个角度的分析，提供了改进的方向。 Abstract: Spatial understanding is essential for Multimodal Large Language Models (MLLMs) to support perception, reasoning, and planning in embodied environments. Despite recent progress, existing studies reveal that MLLMs still struggle with spatial understanding. However, existing research lacks a comprehensive and systematic evaluation of these limitations, often restricted to isolated scenarios, such as single-view or video. In this work, we present a systematic analysis of spatial understanding from both data and architectural perspectives across three representative scenarios: single-view, multi-view, and video. We propose a benchmark named MulSeT (Multi-view Spatial Understanding Tasks), and design a series of experiments to analyze the spatial reasoning capabilities of MLLMs. From the data perspective, the performance of spatial understanding converges quickly as the training data increases, and the upper bound is relatively low, especially for tasks that require spatial imagination. This indicates that merely expanding training data is insufficient to achieve satisfactory performance. From the architectural perspective, we find that spatial understanding relies more heavily on the positional encoding within the visual encoder than within the language model, in both cascaded and native MLLMs. Moreover, we explore reasoning injection and envision future improvements through architectural design to optimize spatial understanding. These insights shed light on the limitations of current MLLMs and suggest new directions for improving spatial reasoning capabilities through data scaling and architectural tuning.

[337] MedDINOv3: How to adapt vision foundation models for medical image segmentation?

Yuheng Li,Yizhou Wu,Yuxiang Lai,Mingzhe Hu,Xiaofeng Yang

Main category: cs.CV

TL;DR: MedDINOv3 is a framework that adapts DINOv3, a vision foundation model pretrained on natural images, to medical image segmentation. It addresses the challenges of ViT backbone underperformance and the domain gap between natural and medical images by using a simple ViT architecture with multi-scale token aggregation and domain-adaptive pretraining on a large CT dataset. MedDINOv3 matches or exceeds state-of-the-art performance across four segmentation benchmarks.

Details

Motivation: The motivation is to overcome the limitations of task-specific deep learning models in medical imaging, particularly their lack of generalizability across modalities and institutions, by adapting vision foundation models pretrained on natural images to medical imaging tasks. Method: MedDINOv3 uses a plain ViT architecture with multi-scale token aggregation and domain-adaptive pretraining on a large CT dataset (CT-3M) using a multi-stage DINOv3 recipe to learn robust dense features. Result: MedDINOv3 matches or exceeds state-of-the-art performance on four medical image segmentation benchmarks, showing the effectiveness of adapting vision foundation models to medical imaging tasks. Conclusion: MedDINOv3 demonstrates the potential of vision foundation models as unified backbones for medical image segmentation, matching or exceeding state-of-the-art performance across four segmentation benchmarks. Abstract: Accurate segmentation of organs and tumors in CT and MRI scans is essential for diagnosis, treatment planning, and disease monitoring. While deep learning has advanced automated segmentation, most models remain task-specific, lacking generalizability across modalities and institutions. Vision foundation models (FMs) pretrained on billion-scale natural images offer powerful and transferable representations. However, adapting them to medical imaging faces two key challenges: (1) the ViT backbone of most foundation models still underperform specialized CNNs on medical image segmentation, and (2) the large domain gap between natural and medical images limits transferability. We introduce \textbf{MedDINOv3}, a simple and effective framework for adapting DINOv3 to medical segmentation. We first revisit plain ViTs and design a simple and effective architecture with multi-scale token aggregation. Then, we perform domain-adaptive pretraining on \textbf{CT-3M}, a curated collection of 3.87M axial CT slices, using a multi-stage DINOv3 recipe to learn robust dense features. MedDINOv3 matches or exceeds state-of-the-art performance across four segmentation benchmarks, demonstrating the potential of vision foundation models as unified backbones for medical image segmentation. The code is available at https://github.com/ricklisz/MedDINOv3.

[338] Decoupling Bidirectional Geometric Representations of 4D cost volume with 2D convolution

Xiaobao Wei,Changyong Shu,Zhaokun Yue,Chang Huang,Weiwei Liu,Shuai Yang,Lirong Yang,Peng Gao,Wenbin Zhang,Gaochao Zhu,Chengxiang Wang

Main category: cs.CV

TL;DR: DBStereo is a real-time stereo matching method using 2D convolutions that achieves high accuracy and efficiency, making it suitable for mobile deployment.

Details

Motivation: Existing high-performance stereo matching methods using 3D regularization are unsuitable for mobile devices, while 2D-based methods struggle in ill-posed regions. This work aims to develop an efficient and accurate solution for real-time deployment. Method: DBStereo uses pure 2D convolutions and introduces a lightweight bidirectional geometry aggregation block to separately capture spatial and disparity representations, enabling decoupled learning. Result: DBStereo achieves superior accuracy and real-time performance, outperforming existing aggregation-based methods and even surpassing iterative-based methods like IGEV-Stereo in efficiency and precision. Conclusion: DBStereo provides a deployment-friendly solution for real-time stereo matching, breaking the empirical design of 3D convolutions for 4D cost volume and establishing a strong baseline for further research. Abstract: High-performance real-time stereo matching methods invariably rely on 3D regularization of the cost volume, which is unfriendly to mobile devices. And 2D regularization based methods struggle in ill-posed regions. In this paper, we present a deployment-friendly 4D cost aggregation network DBStereo, which is based on pure 2D convolutions. Specifically, we first provide a thorough analysis of the decoupling characteristics of 4D cost volume. And design a lightweight bidirectional geometry aggregation block to capture spatial and disparity representation respectively. Through decoupled learning, our approach achieves real-time performance and impressive accuracy simultaneously. Extensive experiments demonstrate that our proposed DBStereo outperforms all existing aggregation-based methods in both inference time and accuracy, even surpassing the iterative-based method IGEV-Stereo. Our study break the empirical design of using 3D convolutions for 4D cost volume and provides a simple yet strong baseline of the proposed decouple aggregation paradigm for further study. Code will be available at (\href{https://github.com/happydummy/DBStereo}{https://github.com/happydummy/DBStereo}) soon.

[339] From Noisy Labels to Intrinsic Structure: A Geometric-Structural Dual-Guided Framework for Noise-Robust Medical Image Segmentation

Tao Wang,Zhenxuan Zhang,Yuanbo Zhou,Xinlin Zhang,Yuanbin Chen,Tao Tan,Guang Yang,Tong Tong

Main category: cs.CV

TL;DR: This paper proposes GSD-Net, a network that uses geometric and structural cues to enhance robustness against noisy annotations in medical image segmentation, achieving state-of-the-art results.

Details

Motivation: The study aims to address the challenges posed by costly, time-consuming annotations and the inevitable noise in expert-labeled datasets, which affect model performance in medical image segmentation. Method: A Geometric Distance-Aware module, a Structure-Guided Label Refinement module, and a Knowledge Transfer module are introduced to dynamically adjust weights, refine labels, and enrich supervision, respectively. Result: GSD-Net demonstrated significant improvements in performance under noisy annotations, with specific gains of 2.52% on Kvasir, 22.76% on Shenzhen, 8.87% on BU-SUC, and 4.59% on BraTS2020 under SR simulated noise. Conclusion: GSD-Net achieves state-of-the-art performance under noisy annotations, showing significant improvements on multiple datasets. Abstract: The effectiveness of convolutional neural networks in medical image segmentation relies on large-scale, high-quality annotations, which are costly and time-consuming to obtain. Even expert-labeled datasets inevitably contain noise arising from subjectivity and coarse delineations, which disrupt feature learning and adversely impact model performance. To address these challenges, this study propose a Geometric-Structural Dual-Guided Network (GSD-Net), which integrates geometric and structural cues to improve robustness against noisy annotations. It incorporates a Geometric Distance-Aware module that dynamically adjusts pixel-level weights using geometric features, thereby strengthening supervision in reliable regions while suppressing noise. A Structure-Guided Label Refinement module further refines labels with structural priors, and a Knowledge Transfer module enriches supervision and improves sensitivity to local details. To comprehensively assess its effectiveness, we evaluated GSD-Net on six publicly available datasets: four containing three types of simulated label noise, and two with multi-expert annotations that reflect real-world subjectivity and labeling inconsistencies. Experimental results demonstrate that GSD-Net achieves state-of-the-art performance under noisy annotations, achieving improvements of 2.52% on Kvasir, 22.76% on Shenzhen, 8.87% on BU-SUC, and 4.59% on BraTS2020 under SR simulated noise. The codes of this study are available at https://github.com/ortonwang/GSD-Net.

[340] Faster and Better: Reinforced Collaborative Distillation and Self-Learning for Infrared-Visible Image Fusion

Yuhao Wang,Lingjuan Miao,Zhiqiang Zhou,Yajun Qiao,Lei Zhang

Main category: cs.CV

TL;DR: This paper proposes a reinforcement learning-based framework for infrared and visible image fusion, enabling lightweight models to learn effectively through collaborative distillation and self-learning, leading to improved fusion quality.

Details

Motivation: To address the challenge of achieving high-quality infrared and visible image fusion using lightweight models. Method: A reinforcement learning agent explores and identifies suitable training strategies for the student model by dynamically adjusting the teacher's guidance strength and generating challenging samples for self-learning. Result: Experimental results show that the method improves student performance and produces superior fusion results. Conclusion: The proposed collaborative distillation and self-learning framework driven by reinforcement learning significantly improves student model performance and achieves better image fusion results compared to existing techniques. Abstract: Infrared and visible image fusion plays a critical role in enhancing scene perception by combining complementary information from different modalities. Despite recent advances, achieving high-quality image fusion with lightweight models remains a significant challenge. To bridge this gap, we propose a novel collaborative distillation and self-learning framework for image fusion driven by reinforcement learning. Unlike conventional distillation, this approach not only enables the student model to absorb image fusion knowledge from the teacher model, but more importantly, allows the student to perform self-learning on more challenging samples to enhance its capabilities. Particularly, in our framework, a reinforcement learning agent explores and identifies a more suitable training strategy for the student.The agent takes both the student's performance and the teacher-student gap as inputs, which leads to the generation of challenging samples to facilitate the student's self-learning. Simultaneously, it dynamically adjusts the teacher's guidance strength based on the student's state to optimize the knowledge transfer. Experimental results demonstrate that our method can significantly improve student performance and achieve better fusion results compared to existing techniques.

[341] Towards High-Fidelity, Identity-Preserving Real-Time Makeup Transfer: Decoupling Style Generation

Lydia Kin Ching Chau,Zhi Yu,Ruo Wei Jiang

Main category: cs.CV

TL;DR: 本文提出了一种实时虚拟化妆试用的新框架，实现了高保真、身份保持的化妆品转移，并具有强大的时间一致性。

Details

Motivation: 在实时化妆转移应用中，合成时间上一致的结果并准确复制细粒度的化妆效果同时保持用户身份至关重要。然而，现有方法往往难以将半透明化妆品与肤色和其他身份特征分离，导致身份变化并引发公平性问题。此外，当前方法缺乏实时能力，无法保持时间一致性，限制了其实际应用。 Method: 该方法将化妆转移解耦为两个步骤：透明化妆掩膜提取和基于图形的掩膜渲染。化妆提取步骤后，化妆渲染可以实时进行，从而实现即时化妆试用。化妆提取模型是在通过两种互补方法生成的伪真实数据上训练的：基于图形的渲染管道和无监督的k-means聚类方法。为了进一步增强透明度估计和颜色保真度，作者提出了专门的训练目标，包括alpha加权重建和嘴唇颜色损失。 Result: 该方法在不同的姿势、表情和肤色上实现了强大的化妆转移，同时保持了时间平滑性。广泛的实验证明，该方法在捕捉细节、保持时间稳定性和保持身份完整性方面优于现有的基线方法。 Conclusion: 本文提出了一种解决现有方法在实时化妆转移应用中面临的关键挑战的有效解决方案，实现了高保真的实时化妆试用体验。 Abstract: We present a novel framework for real-time virtual makeup try-on that achieves high-fidelity, identity-preserving cosmetic transfer with robust temporal consistency. In live makeup transfer applications, it is critical to synthesize temporally coherent results that accurately replicate fine-grained makeup and preserve user's identity. However, existing methods often struggle to disentangle semitransparent cosmetics from skin tones and other identify features, causing identity shifts and raising fairness concerns. Furthermore, current methods lack real-time capabilities and fail to maintain temporal consistency, limiting practical adoption. To address these challenges, we decouple makeup transfer into two steps: transparent makeup mask extraction and graphics-based mask rendering. After the makeup extraction step, the makeup rendering can be performed in real time, enabling live makeup try-on. Our makeup extraction model trained on pseudo-ground-truth data generated via two complementary methods: a graphics-based rendering pipeline and an unsupervised k-means clustering approach. To further enhance transparency estimation and color fidelity, we propose specialized training objectives, including alpha-weighted reconstruction and lip color losses. Our method achieves robust makeup transfer across diverse poses, expressions, and skin tones while preserving temporal smoothness. Extensive experiments demonstrate that our approach outperforms existing baselines in capturing fine details, maintaining temporal stability, and preserving identity integrity.

[342] RiverScope: High-Resolution River Masking Dataset

Rangel Daroya,Taylor Rowley,Jonathan Flores,Elisa Friedmann,Fiona Bennitt,Heejin An,Travis Simmons,Marissa Jean Hughes,Camryn L Kluetmeier,Solomon Kica,J. Daniel Vélez,Sarah E. Esenther,Thomas E. Howard,Yanqi Ye,Audrey Turcotte,Colin Gleason,Subhransu Maji

Main category: cs.CV

TL;DR: 本文介绍了RiverScope，一个用于河流和地表水监测的高分辨率数据集，通过计算机科学与水文学专家合作开发，包含1,145幅高分辨率图像，并建立了全球首个高分辨率河流宽度估计基准。

Details

Motivation: 监测河流和地表水在精细空间和时间尺度上仍然具有挑战性，尤其是在狭窄或泥沙含量高的河流区域，传统低分辨率卫星数据难以有效捕捉。 Method: 通过跨学科合作开发了RiverScope数据集，包括高分辨率图像与专家标注的水体掩码，并与Sentinel-2、SWOT和SWOT River Database进行配准，同时评估了多种深度学习模型的性能。 Result: RiverScope数据集包含1,145幅图像，覆盖2,577平方公里，标注耗时超过100小时，同时建立了全球首个高分辨率河流宽度估计基准，中位误差为7.2米。 Conclusion: RiverScope为精细尺度和多传感器水文建模提供了宝贵的资源，有助于气候适应和可持续水资源管理。 Abstract: Surface water dynamics play a critical role in Earth's climate system, influencing ecosystems, agriculture, disaster resilience, and sustainable development. Yet monitoring rivers and surface water at fine spatial and temporal scales remains challenging -- especially for narrow or sediment-rich rivers that are poorly captured by low-resolution satellite data. To address this, we introduce RiverScope, a high-resolution dataset developed through collaboration between computer science and hydrology experts. RiverScope comprises 1,145 high-resolution images (covering 2,577 square kilometers) with expert-labeled river and surface water masks, requiring over 100 hours of manual annotation. Each image is co-registered with Sentinel-2, SWOT, and the SWOT River Database (SWORD), enabling the evaluation of cost-accuracy trade-offs across sensors -- a key consideration for operational water monitoring. We also establish the first global, high-resolution benchmark for river width estimation, achieving a median error of 7.2 meters -- significantly outperforming existing satellite-derived methods. We extensively evaluate deep networks across multiple architectures (e.g., CNNs and transformers), pretraining strategies (e.g., supervised and self-supervised), and training datasets (e.g., ImageNet and satellite imagery). Our best-performing models combine the benefits of transfer learning with the use of all the multispectral PlanetScope channels via learned adaptors. RiverScope provides a valuable resource for fine-scale and multi-sensor hydrological modeling, supporting climate adaptation and sustainable water management.

[343] GenCompositor: Generative Video Compositing with Diffusion Transformer

Shuzhou Yang,Xiaoyu Li,Xiaodong Cun,Guangzhi Wang,Lingen Li,Ying Shan,Jian Zhang

Main category: cs.CV

TL;DR: 本研究提出了一种自动化生成视频合成方法，通过改进的扩散变压器技术，实现用户自定义的动态元素融合，提高了视频合成的质量与效率。

Details

Motivation: 传统视频合成技术需要大量人力和专家协作，导致生产周期长、人工成本高，因此需要一种自动化的方法来提升效率。 Method: 设计了一种基于扩散变压器（DiT）的新型流水线，包括用于背景保持的轻量级DiT分支、使用全自注意力的DiT融合块、前景增强策略以及一种新的位置嵌入方法（Extended Rotary Position Embedding, ERoPE）用于用户控制下的背景与前景视频融合。 Result: 实验结果表明，该方法能够有效实现生成视频合成，在视频的保真度和一致性方面优于现有可行方法。 Conclusion: 该论文提出了一种基于扩散变压器（DiT）的生成视频合成方法，并构建了一个包含61K组视频的数据集VideoComp，实验表明该方法在保真度和一致性方面优于现有解决方案。 Abstract: Video compositing combines live-action footage to create video production, serving as a crucial technique in video creation and film production. Traditional pipelines require intensive labor efforts and expert collaboration, resulting in lengthy production cycles and high manpower costs. To address this issue, we automate this process with generative models, called generative video compositing. This new task strives to adaptively inject identity and motion information of foreground video to the target video in an interactive manner, allowing users to customize the size, motion trajectory, and other attributes of the dynamic elements added in final video. Specifically, we designed a novel Diffusion Transformer (DiT) pipeline based on its intrinsic properties. To maintain consistency of the target video before and after editing, we revised a light-weight DiT-based background preservation branch with masked token injection. As to inherit dynamic elements from other sources, a DiT fusion block is proposed using full self-attention, along with a simple yet effective foreground augmentation for training. Besides, for fusing background and foreground videos with different layouts based on user control, we developed a novel position embedding, named Extended Rotary Position Embedding (ERoPE). Finally, we curated a dataset comprising 61K sets of videos for our new task, called VideoComp. This data includes complete dynamic elements and high-quality target videos. Experiments demonstrate that our method effectively realizes generative video compositing, outperforming existing possible solutions in fidelity and consistency.

[344] TeRA: Rethinking Text-driven Realistic 3D Avatar Generation

Yanwen Wang,Yiyu Zhuang,Jiawei Zhang,Li Wang,Yifei Zeng,Xun Cao,Xinxin Zuo,Hao Zhu

Main category: cs.CV

TL;DR: TeRA is a two-stage framework for text-to-3D avatar generation that improves efficiency, effectiveness, and enables text-based customization, outperforming prior models.

Details

Motivation: The motivation is to improve upon SDS-based and large 3D generative models by creating a more efficient, effective, and customizable solution for text-to-avatar generation. Method: The method involves a two-stage training strategy: first, distilling a decoder to create a structured latent space from a large human reconstruction model, and second, training a text-controlled latent diffusion model to generate 3D avatars within this space. Result: The experiments showed that TeRA outperforms existing text-to-avatar generative models in both subjective and objective evaluations. Conclusion: TeRA provides an efficient and effective framework for text-to-avatar generation, surpassing previous models in performance and customization. Abstract: In this paper, we rethink text-to-avatar generative models by proposing TeRA, a more efficient and effective framework than the previous SDS-based models and general large 3D generative models. Our approach employs a two-stage training strategy for learning a native 3D avatar generative model. Initially, we distill a decoder to derive a structured latent space from a large human reconstruction model. Subsequently, a text-controlled latent diffusion model is trained to generate photorealistic 3D human avatars within this latent space. TeRA enhances the model performance by eliminating slow iterative optimization and enables text-based partial customization through a structured 3D human representation. Experiments have proven our approach's superiority over previous text-to-avatar generative models in subjective and objective evaluation.

[345] Anisotropic Fourier Features for Positional Encoding in Medical Imaging

Nabil Jabareen,Dongsheng Yuan,Dingming Liu,Foo-Wei Ten,Sören Lukassen

Main category: cs.CV

TL;DR: 本研究提出了一种新的位置编码方法AFPE，适用于医学图像中的复杂形状和各向异性数据，通过系统测试表明其性能优于现有方法。

Details

Motivation: 正弦位置编码（SPE）和各向同性傅里叶特征位置编码（IFPE）在医学成像中存在局限性，前者难以保持高维空间中的欧几里得距离，后者无法考虑图像中的各向异性。 Method: 提出了一种各向异性傅里叶特征位置编码（AFPE），并将其与常用的位置编码方法在胸部X光、CT图像中的器官分类和超声心动图射血分数回归中进行了系统基准测试。 Result: 结果表明，选择正确的位置编码可以显著提高模型性能，并且提出的AFPE在所有测试的各向异性设置中都显著优于现有最先进的位置编码。 Conclusion: 选择适合数据和感兴趣形状的各向异性位置编码对于各向异性医学图像和视频至关重要。 Abstract: The adoption of Transformer-based architectures in the medical domain is growing rapidly. In medical imaging, the analysis of complex shapes - such as organs, tissues, or other anatomical structures - combined with the often anisotropic nature of high-dimensional images complicates these adaptations. In this study, we critically examine the role of Positional Encodings (PEs), arguing that commonly used approaches may be suboptimal for the specific challenges of medical imaging. Sinusoidal Positional Encodings (SPEs) have proven effective in vision tasks, but they struggle to preserve Euclidean distances in higher-dimensional spaces. Isotropic Fourier Feature Positional Encodings (IFPEs) have been proposed to better preserve Euclidean distances, but they lack the ability to account for anisotropy in images. To address these limitations, we propose Anisotropic Fourier Feature Positional Encoding (AFPE), a generalization of IFPE that incorporates anisotropic, class-specific, and domain-specific spatial dependencies. We systematically benchmark AFPE against commonly used PEs on multi-label classification in chest X-rays, organ classification in CT images, and ejection fraction regression in echocardiography. Our results demonstrate that choosing the correct PE can significantly improve model performance. We show that the optimal PE depends on the shape of the structure of interest and the anisotropy of the data. Finally, our proposed AFPE significantly outperforms state-of-the-art PEs in all tested anisotropic settings. We conclude that, in anisotropic medical images and videos, it is of paramount importance to choose an anisotropic PE that fits the data and the shape of interest.

[346] Enhancing Fitness Movement Recognition with Attention Mechanism and Pre-Trained Feature Extractors

Shanjid Hasan Nishat,Srabonti Deb,Mohiuddin Ahmed

Main category: cs.CV

TL;DR: 提出了一种结合2D CNN和注意力增强LSTM的轻量级健身动作识别框架，准确率高且适合实时应用。

Details

Motivation: 现有的深度学习方法因依赖计算密集型3D模型，在实时或资源受限场景中的可行性受限，因此需要一种轻量级且有效的解决方案。 Method: 将预训练的2D CNN模型与增强的空间注意力机制的LSTM网络结合，提取空间特征并捕获时间依赖性。 Result: 在UCF101数据集的子集上实现了93.34%的峰值准确率，并优于多种最先进的HAR系统。 Conclusion: 该框架在计算效率和准确性之间取得了良好平衡，适用于实时健身活动识别。 Abstract: Fitness movement recognition, a focused subdomain of human activity recognition (HAR), plays a vital role in health monitoring, rehabilitation, and personalized fitness training by enabling automated exercise classification from video data. However, many existing deep learning approaches rely on computationally intensive 3D models, limiting their feasibility in real-time or resource-constrained settings. In this paper, we present a lightweight and effective framework that integrates pre-trained 2D Convolutional Neural Networks (CNNs) such as ResNet50, EfficientNet, and Vision Transformers (ViT) with a Long Short-Term Memory (LSTM) network enhanced by spatial attention. These models efficiently extract spatial features while the LSTM captures temporal dependencies, and the attention mechanism emphasizes informative segments. We evaluate the framework on a curated subset of the UCF101 dataset, achieving a peak accuracy of 93.34\% with the ResNet50-based configuration. Comparative results demonstrate the superiority of our approach over several state-of-the-art HAR systems. The proposed method offers a scalable and real-time-capable solution for fitness activity recognition with broader applications in vision-based health and activity monitoring.

Guyue Hu,Siyuan Song,Jingpeng Sun,Zhe Jin,Chenglong Li,Jin Tang

Main category: cs.CV

TL;DR: This paper introduces a new paradigm, mix-modal federated learning (MixMFL), and proposes a modality decoupling and memorizing framework (MDM-MixMFL) to address challenges in non-centralized MRI image segmentation.

Details

Motivation: Current MRI segmentation methods rely on centralized paradigms, which are unsuitable for non-centralized mixed-modal medical scenarios. There is a need to address extensive client-wise modality and data heterogeneity in distributed settings. Method: A novel modality decoupling strategy and a modality memorizing mechanism are proposed to address the heterogeneity in distributed clients' data and modalities during federated learning. Result: Extensive experiments on two public datasets demonstrate the effectiveness and superiority of the proposed MDM-MixMFL framework in MRI image segmentation. Conclusion: The proposed MDM-MixMFL framework demonstrates effectiveness in handling non-centralized mix-modal MRI image segmentation by addressing modality and data heterogeneity through modality decoupling and memorizing mechanisms. Abstract: Magnetic resonance imaging (MRI) image segmentation is crucial in diagnosing and treating many diseases, such as brain tumors. Existing MRI image segmentation methods mainly fall into a centralized multimodal paradigm, which is inapplicable in engineering non-centralized mix-modal medical scenarios. In this situation, each distributed client (hospital) processes multiple mixed MRI modalities, and the modality set and image data for each client are diverse, suffering from extensive client-wise modality heterogeneity and data heterogeneity. In this paper, we first formulate non-centralized mix-modal MRI image segmentation as a new paradigm for federated learning (FL) that involves multiple modalities, called mix-modal federated learning (MixMFL). It distinguishes from existing multimodal federating learning (MulMFL) and cross-modal federating learning (CroMFL) paradigms. Then, we proposed a novel modality decoupling and memorizing mix-modal federated learning framework (MDM-MixMFL) for MRI image segmentation, which is characterized by a modality decoupling strategy and a modality memorizing mechanism. Specifically, the modality decoupling strategy disentangles each modality into modality-tailored and modality-shared information. During mix-modal federated updating, corresponding modality encoders undergo tailored and shared updating, respectively. It facilitates stable and adaptive federating aggregation of heterogeneous data and modalities from distributed clients. Besides, the modality memorizing mechanism stores client-shared modality prototypes dynamically refreshed from every modality-tailored encoder to compensate for incomplete modalities in each local client. It further benefits modality aggregation and fusion processes during mixmodal federated learning. Extensive experiments on two public datasets for MRI image segmentation demonstrate the effectiveness and superiority of our methods.

[348] Motion-Refined DINOSAUR for Unsupervised Multi-Object Discovery

Xinrui Gong,Oliver Hahn,Christoph Reich,Krishnakant Singh,Simone Schaub-Meyer,Daniel Cremers,Stefan Roth

Main category: cs.CV

TL;DR: MR-DINOSAUR是一种完全无监督的多目标发现方法，利用视频中的运动线索和自监督预训练模型DINOSAUR，实现了优秀的性能。

Details

Motivation: 现有的无监督多目标发现方法依赖监督生成伪标签训练OCL模型，这限制了它们的适用性。 Method: 通过检索没有相机运动的视频帧来生成高质量的无监督伪标签，并对DINOSAUR的槽表示进行精炼，训练一个槽去激活模块以分配前景和背景的槽。 Result: MR-DINOSAUR在TRI-PD和KITTI数据集上实现了优秀的多目标发现结果，并且完全不需要人工监督。 Conclusion: MR-DINOSAUR是一个完全无监督的多目标发现方法，在TRI-PD和KITTI数据集上实现了强大的多目标发现结果，并优于以前最先进的方法。 Abstract: Unsupervised multi-object discovery (MOD) aims to detect and localize distinct object instances in visual scenes without any form of human supervision. Recent approaches leverage object-centric learning (OCL) and motion cues from video to identify individual objects. However, these approaches use supervision to generate pseudo labels to train the OCL model. We address this limitation with MR-DINOSAUR -- Motion-Refined DINOSAUR -- a minimalistic unsupervised approach that extends the self-supervised pre-trained OCL model, DINOSAUR, to the task of unsupervised multi-object discovery. We generate high-quality unsupervised pseudo labels by retrieving video frames without camera motion for which we perform motion segmentation of unsupervised optical flow. We refine DINOSAUR's slot representations using these pseudo labels and train a slot deactivation module to assign slots to foreground and background. Despite its conceptual simplicity, MR-DINOSAUR achieves strong multi-object discovery results on the TRI-PD and KITTI datasets, outperforming the previous state of the art despite being fully unsupervised.

[349] FastVGGT: Training-Free Acceleration of Visual Geometry Transformer

You Shen,Zhipeng Zhang,Yansong Qu,Liujuan Cao

Main category: cs.CV

TL;DR: This paper introduces FastVGGT, a method that accelerates the 3D visual geometry model VGGT by leveraging token merging, achieving significant speed improvements and reduced error accumulation in long-sequence scenarios.

Details

Motivation: The motivation stems from the challenge of scaling foundation models for 3D vision to long-sequence image inputs due to inference-time inefficiency, and the observation of a token collapse phenomenon in attention maps. Method: The authors analyzed VGGT, a state-of-the-art feed-forward visual geometry model, identified its bottlenecks, and proposed FastVGGT, which uses a training-free token merging mechanism tailored to 3D architectures and tasks. Result: FastVGGT achieved a 4x speedup over VGGT with 1000 input images while mitigating error accumulation in long-sequence scenarios, as validated by extensive experiments on multiple 3D geometry benchmarks. Conclusion: The study concludes that token merging can serve as a principled solution for scalable 3D vision systems, as demonstrated by the effectiveness of FastVGGT in accelerating VGGT without compromising its reconstruction capacity. Abstract: Foundation models for 3D vision have recently demonstrated remarkable capabilities in 3D perception. However, scaling these models to long-sequence image inputs remains a significant challenge due to inference-time inefficiency. In this work, we present a detailed analysis of VGGT, a state-of-the-art feed-forward visual geometry model and identify its primary bottleneck. Visualization further reveals a token collapse phenomenon in the attention maps. Motivated by these findings, we explore the potential of token merging in the feed-forward visual geometry model. Owing to the unique architectural and task-specific properties of 3D models, directly applying existing merging techniques proves challenging. To this end, we propose FastVGGT, which, for the first time, leverages token merging in the 3D domain through a training-free mechanism for accelerating VGGT. we devise a unique token partitioning strategy tailored to 3D architectures and tasks, effectively eliminating redundant computation while preserving VGGT's powerful reconstruction capacity. Extensive experiments on multiple 3D geometry benchmarks validate the effectiveness of our approach. Notably, with 1000 input images, FastVGGT achieves a 4x speedup over VGGT while mitigating error accumulation in long-sequence scenarios. These findings underscore the potential of token merging as a principled solution for scalable 3D vision systems. Code is available at: https://mystorm16.github.io/fastvggt/.

Table of Contents

cs.CL [Back]

[1] MultiStream-LLM: Bridging Modalities for Robust Sign Language Translation

[2] Compiling Prompts, Not Crafting Them: A Reproducible Workflow for AI-Assisted Evidence Synthesis

[3] What Are Research Hypotheses?

[4] Explainable Chain-of-Thought Reasoning: An Empirical Analysis on State-Aware Reasoning Dynamics

[5] The Rarity Blind Spot: A Framework for Evaluating Statistical Reasoning in LLMs

[6] The Differential Meaning of Models: A Framework for Analyzing the Structural Consequences of Semantic Modeling Decisions

[7] The Temporal Game: A New Perspective on Temporal Relation Extraction

[8] Exploring Reasoning-Infused Text Embedding with Large Language Models for Zero-Shot Dense Retrieval

[9] OpinioRAG: Towards Generating User-Centric Opinion Highlights from Large-scale Online Reviews

[10] Wage Sentiment Indices Derived from Survey Comments via Large Language Models

[11] Balanced Actor Initialization: Stable RLHF Training of Distillation-Based Reasoning Models

[12] GIER: Gap-Driven Self-Refinement for Large Language Models

[13] Open Data Synthesis For Deep Research

[14] GraphKV: Breaking the Static Selection Paradigm with Graph-Based KV Cache Eviction

[15] The Resurgence of GCG Adversarial Attacks on Large Language Models

[16] MedSEBA: Synthesizing Evidence-Based Answers Grounded in Evolving Medical Literature

[17] The Gold Medals in an Empty Room: Diagnosing Metalinguistic Reasoning in LLMs with Camlang

[18] GOSU: Retrieval-Augmented Generation with Global-Level Optimized Semantic Unit-Centric Framework

[19] CVPD at QIAS 2025 Shared Task: An Efficient Encoder-Based Approach for Islamic Inheritance Reasoning

[20] TECP: Token-Entropy Conformal Prediction for LLMs

[21] Talk Less, Call Right: Enhancing Role-Play LLM Agents with Automatic Prompt Optimization and Role Prompting

[22] ResearchQA: Evaluating Scholarly Question Answering at Scale Across 75 Fields with Survey-Mined Questions and Rubrics

[23] Entropy-based Coarse and Compressed Semantic Speech Representation Learning

[24] Modeling Motivated Reasoning in Law: Evaluating Strategic Role Conditioning in LLM Summarization

[25] Thinking Hard, Going Misaligned: Emergent Misalignment in LLMs

[26] StealthEval: A Probe-Rewrite-Evaluate Workflow for Reliable Benchmarks

[27] Gated Associative Memory: A Parallel O(N) Architecture for Efficient Sequence Modeling

[28] A Multi-Strategy Approach for AI-Generated Text Detection

[29] Can Multi-turn Self-refined Single Agent LMs with Retrieval Solve Hard Coding Problems?

[30] Confident, Calibrated, or Complicit: Probing the Trade-offs between Safety Alignment and Ideological Bias in Language Models in Detecting Hate Speech

[31] Router Upcycling: Leveraging Mixture-of-Routers in Mixture-of-Experts Upcycling

[32] Do small language models generate realistic variable-quality fake news headlines?

[33] Text Reinforcement for Multimodal Time Series Forecasting

[34] CE-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders

[35] Learning to Shop Like Humans: A Review-driven Retrieval-Augmented Recommendation Framework with LLMs

[36] Reward-Weighted Sampling: Enhancing Non-Autoregressive Characteristics in Masked Diffusion LLMs

[37] Designing LMS and Instructional Strategies for Integrating Generative-Conversational AI

[38] LLM Encoder vs. Decoder: Robust Detection of Chinese AI-Generated Text with LoRA

[39] Decomposing and Revising What Language Models Generate

[40] LegalChainReasoner: A Legal Chain-guided Framework for Criminal Judicial Opinion Generation

[41] CaresAI at BioCreative IX Track 1 -- LLM for Biomedical QA

[42] TMT: A Simple Way to Translate Topic Models Using Dictionaries

[43] Neural Models and Language Model Prompting for the Multidimensional Evaluation of Open-Ended Conversations

[44] Negative Matters: Multi-Granularity Hard-Negative Synthesis and Anchor-Token-Aware Pooling for Enhanced Text Embeddings

[45] Prompting Away Stereotypes? Evaluating Bias in Text-to-Image Models for Occupations

[46] Exploring and Mitigating Fawning Hallucinations in Large Language Models

[47] EviNote-RAG: Enhancing RAG Models via Answer-Supportive Evidence Notes

[48] SeLeRoSa: Sentence-Level Romanian Satire Detection Dataset

[49] Supervised In-Context Fine-Tuning for Generative Sequence Labeling

[50] MedCOD: Enhancing English-to-Spanish Medical Translation of Large Language Models Using Enriched Chain-of-Dictionary Framework

[51] Structure and Destructure: Dual Forces in the Making of Knowledge Engines

[52] RPRO:Ranked Preference Reinforcement Optimization for Enhancing Medical QA and Diagnostic Reasoning

[53] Performance Analysis of Supervised Machine Learning Algorithms for Text Classification

[54] Ranking of Bangla Word Graph using Graph-based Ranking Algorithms

[55] We Politely Insist: Your LLM Must Learn the Persian Art of Taarof

[56] A Dynamic Fusion Model for Consistent Crisis Response

[57] Speaking at the Right Level: Literacy-Controlled Counterspeech Generation with RAG-RL

[58] Assessing Large Language Models on Islamic Legal Reasoning: Evidence from Inheritance Law Evaluation

[59] A Paradigm Gap in Urdu

[60] Privacy-Preserving Reasoning with Knowledge-Distilled Parametric Retrieval Augmented Generation

[61] REFRAG: Rethinking RAG based Decoding

[62] Natural Context Drift Undermines the Natural Language Understanding of Large Language Models

[63] Dream-Coder 7B: An Open Diffusion Language Model for Code

[64] Zero-shot Cross-lingual NER via Mitigating Language Difference: An Entity-aligned Translation Perspective

[65] Joint Information Extraction Across Classical and Modern Chinese with Tea-MOELoRA

[66] Enhancing Large Language Model for Knowledge Graph Completion via Structure-Aware Alignment-Tuning

[67] Modular Techniques for Synthetic Long-Context Data Generation in Language Model Training and Evaluation

[68] Statutory Construction and Interpretation for Artificial Intelligence

[69] Efficient Large Language Models with Zero-Shot Adjustable Acceleration

[70] SimulMEGA: MoE Routers are Advanced Policy Makers for Simultaneous Speech Translation

[71] Mitigating Catastrophic Forgetting in Continual Learning through Model Growth

[72] DaMoC: Efficiently Selecting the Optimal Large Language Model for Fine-tuning Domain Taks Based on Data and Model Compression

[73] Rethinking the Chain-of-Thought: The Roles of In-Context Learning and Pre-trained Priors

[74] Annotation and modeling of emotions in a textual corpus: an evaluative approach

[75] Culture is Everywhere: A Call for Intentionally Cultural Evaluation

[76] TableZoomer: A Collaborative Agent Framework for Large-scale Table Question Answering

[77] Can Smaller LLMs do better? Unlocking Cross-Domain Potential through Parameter-Efficient Fine-Tuning for Text Summarization

[78] LongCat-Flash Technical Report