Table of Contents
cs.CL [Back]
[1] Fair Play in the Newsroom: Actor-Based Filtering Gender Discrimination in Text Corpora
Stefanie Urchs,Veronika Thurner,Matthias Aßenmacher,Christian Heumann,Stephanie Thiemichen
Main category: cs.CL
TL;DR: 论文提出了一种减轻大规模文本语料库中性别歧视的新方法,并应用于德国报纸文章语料库,显示出性别平衡的显著改善。
Details
Motivation: 论文的动机是解决大型语言模型输出中反映的结构性性别不平衡问题,这些问题源于其训练数据。 Method: 论文提出了一种扩展的参与者层面的流水线方法,用于检测和减轻大规模文本语料库中的性别歧视。 Result: 研究结果显示,在多个语言维度上性别平衡有了显著改善,但某些微妙的偏见依然存在。 Conclusion: 该论文得出结论,尽管可以通过过滤和重新平衡来减轻语言模型输出中的表面性别不对称性,但情感和框架中的细微偏见仍然存在。 Abstract: Large language models are increasingly shaping digital communication, yet their outputs often reflect structural gender imbalances that originate from their training data. This paper presents an extended actor-level pipeline for detecting and mitigating gender discrimination in large-scale text corpora. Building on prior work in discourse-aware fairness analysis, we introduce new actor-level metrics that capture asymmetries in sentiment, syntactic agency, and quotation styles. The pipeline supports both diagnostic corpus analysis and exclusion-based balancing, enabling the construction of fairer corpora. We apply our approach to the taz2024full corpus of German newspaper articles from 1980 to 2024, demonstrating substantial improvements in gender balance across multiple linguistic dimensions. Our results show that while surface-level asymmetries can be mitigated through filtering and rebalancing, subtler forms of bias persist, particularly in sentiment and framing. We release the tools and reports to support further research in discourse-based fairness auditing and equitable corpus construction.[2] MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Shilong Li,Xingyuan Bu,Wenjie Wang,Jiaheng Liu,Jun Dong,Haoyang He,Hao Lu,Haozhe Zhang,Chenchen Jing,Zhen Li,Chuanhao Li,Jiayi Tian,Chenchen Zhang,Tianhao Peng,Yancheng He,Jihao Gu,Yuanxing Zhang,Jian Yang,Ge Zhang,Wenhao Huang,Wangchunshu Zhou,Zhaoxiang Zhang,Ruizhe Ding,Shilei Wen
Main category: cs.CL
TL;DR: 本文介绍了MM-BrowseComp,这是一个用于评估AI代理在多模态环境下检索和推理能力的新基准,结果显示当前模型在这一领域仍有较大提升空间。
Details
Motivation: 现有的基准测试(如BrowseComp)主要关注文本信息,忽略了多模态内容的普遍性,因此需要一个新的基准来评估AI代理在多模态环境下的推理和检索能力。 Method: 开发了一个包含224个手工设计问题的新基准MM-BrowseComp,用于评估代理的多模态检索和推理能力,并提供每个问题的验证清单,以便细粒度分析多模态依赖性和推理路径。 Result: 对最先进模型的综合评估显示,即使是使用工具的顶级模型(如OpenAI o3)也仅达到29.02%的准确率。 Conclusion: MM-BrowseComp强调了当前模型在多模态能力和原生多模态推理方面的不足,表明未来需要在这些方面进行改进。 Abstract: AI agents with advanced reasoning and tool use capabilities have demonstrated impressive performance in web browsing for deep search. While existing benchmarks such as BrowseComp evaluate these browsing abilities, they primarily focus on textual information, overlooking the prevalence of multimodal content. To bridge this gap, we introduce MM-BrowseComp, a novel benchmark comprising 224 challenging, hand-crafted questions specifically designed to assess agents' multimodal retrieval and reasoning capabilities. These questions often incorporate images in prompts, and crucial information encountered during the search and reasoning process may also be embedded within images or videos on webpages. Consequently, methods relying solely on text prove insufficient for our benchmark. Additionally, we provide a verified checklist for each question, enabling fine-grained analysis of multimodal dependencies and reasoning paths. Our comprehensive evaluation of state-of-the-art models on MM-BrowseComp reveals that even top models like OpenAI o3 with tools achieve only 29.02\% accuracy, highlighting the suboptimal multimodal capabilities and lack of native multimodal reasoning in current models.[3] Overcoming Latency Bottlenecks in On-Device Speech Translation: A Cascaded Approach with Alignment-Based Streaming MT
Zeeshan Ahmed,Frank Seide,Niko Moritz,Ju Lin,Ruiming Xie,Simone Merello,Zhe Liu,Christian Fuegen
Main category: cs.CL
TL;DR: 这篇论文研究了在设备端实现实时流式语音翻译的方法,提出了一种同时翻译策略,结合了自动语音识别和机器翻译,利用语言线索和高效搜索剪枝技术,提高了翻译质量和实时性。
Details
Motivation: 论文的动机是解决在实时、设备端流式语音翻译中集成自动语音识别(ASR)和机器翻译(MT)所带来的挑战,尤其是在实现流式翻译实时性方面的困难。 Method: 论文提出了一种同时翻译方法,结合了自动语音识别(ASR)和机器翻译(MT),利用ASR系统生成的语言线索来管理上下文,并采用高效束搜索剪枝技术(如超时和强制最终化)以保持系统的实时性。 Result: 应用该方法于设备端双语会话语音翻译中,结果表明该方法在延迟和质量方面优于基线方法,并显著缩小了与非流式翻译系统的质量差距。 Conclusion: 该论文得出的结论是,所提出的同时翻译方法在实时性与翻译质量之间实现了有效平衡,为更准确和高效的实时语音翻译铺平了道路。 Abstract: This paper tackles several challenges that arise when integrating Automatic Speech Recognition (ASR) and Machine Translation (MT) for real-time, on-device streaming speech translation. Although state-of-the-art ASR systems based on Recurrent Neural Network Transducers (RNN-T) can perform real-time transcription, achieving streaming translation in real-time remains a significant challenge. To address this issue, we propose a simultaneous translation approach that effectively balances translation quality and latency. We also investigate efficient integration of ASR and MT, leveraging linguistic cues generated by the ASR system to manage context and utilizing efficient beam-search pruning techniques such as time-out and forced finalization to maintain system's real-time factor. We apply our approach to an on-device bilingual conversational speech translation and demonstrate that our techniques outperform baselines in terms of latency and quality. Notably, our technique narrows the quality gap with non-streaming translation systems, paving the way for more accurate and efficient real-time speech translation.[4] Stands to Reason: Investigating the Effect of Reasoning on Idiomaticity Detection
Dylan Phelps,Rodrigo Wilkens,Edward Gow-Smith,Thomas Pickard,Maggie Mi,Aline Villavicencio
Main category: cs.CL
TL;DR: This paper investigates how reasoning abilities and model size in LLMs affect idiomaticity detection, finding that larger models perform better and understand idioms more accurately, while smaller models benefit less from reasoning and require additional prompting strategies.
Details
Motivation: Idiomaticity detection can benefit from reasoning models, as understanding potentially idiomatic expressions requires logical disambiguation and reasoning. Method: The study evaluates the DeepSeek-R1 distillation models (ranging from 1.5B to 70B parameters) across four idiomaticity detection datasets, analyzing the impact of reasoning capabilities and model size. Result: Chain-of-thought reasoning improves performance for smaller models but not to the level of base models; larger models show modest improvements and better understanding of idiomatic expressions, with some success in improving smaller model performance through definition prompts. Conclusion: Reasoning capabilities in LLMs have a varied and modest effect on idiomaticity detection, with larger models performing better and showing a better understanding of idiomatic expressions. Abstract: The recent trend towards utilisation of reasoning models has improved the performance of Large Language Models (LLMs) across many tasks which involve logical steps. One linguistic task that could benefit from this framing is idiomaticity detection, as a potentially idiomatic expression must first be understood before it can be disambiguated and serves as a basis for reasoning. In this paper, we explore how reasoning capabilities in LLMs affect idiomaticity detection performance and examine the effect of model size. We evaluate, as open source representative models, the suite of DeepSeek-R1 distillation models ranging from 1.5B to 70B parameters across four idiomaticity detection datasets. We find the effect of reasoning to be smaller and more varied than expected. For smaller models, producing chain-of-thought (CoT) reasoning increases performance from Math-tuned intermediate models, but not to the levels of the base models, whereas larger models (14B, 32B, and 70B) show modest improvements. Our in-depth analyses reveal that larger models demonstrate good understanding of idiomaticity, successfully producing accurate definitions of expressions, while smaller models often fail to output the actual meaning. For this reason, we also experiment with providing definitions in the prompts of smaller models, which we show can improve performance in some cases.[5] Whispering Context: Distilling Syntax and Semantics for Long Speech Transcripts
Duygu Altinok
Main category: cs.CL
TL;DR: 该研究通过将LLaMA模型中的上下文知识提炼到Whisper模型中,提出了一种改进的ASR系统,显著提高了在长音频转录中的语法和语义准确性。
Details
Motivation: ASR系统在处理长音频时常常在语法和语义准确性方面遇到困难,这影响了命名实体识别、大写和标点等任务。 Method: 研究提出了两种策略:(1)使用最优传输进行标记级别提炼,以对齐维度和序列长度;(2)通过最小化Whisper和LLaMA句子嵌入之间的表示损失,结合语法和语义。 Result: 在Spoken Wikipedia数据集上的评估显示,该方法在词错误率、命名实体识别、大写和标点成功率方面都有显著提高。 Conclusion: 通过引入新的NER指标并探索语义感知的ASR,该研究强调了将语言上下文集成到转录中的价值,为长语音的鲁棒、上下文感知的ASR奠定了基础。 Abstract: ASR systems often struggle with maintaining syntactic and semantic accuracy in long audio transcripts, impacting tasks like Named Entity Recognition (NER), capitalization, and punctuation. We propose a novel approach that enhances ASR by distilling contextual knowledge from LLaMA models into Whisper. Our method uses two strategies: (1) token level distillation with optimal transport to align dimensions and sequence lengths, and (2) representation loss minimization between sentence embeddings of Whisper and LLaMA, blending syntax and semantics. Evaluations on the Spoken Wikipedia dataset, a benchmark with long audios and rich entities demonstrate significant improvements in Word Error Rate (WER), NER, capitalization, and punctuation success. By introducing novel NER metrics and exploring semantics aware ASR, our work highlights the value of integrating linguistic context into transcription, setting a foundation for robust, context-aware ASR in longform speech.[6] Datarus-R1: An Adaptive Multi-Step Reasoning LLM for Automated Data Analysis
Ayoub Ben Chaliah,Hela Dellagi
Main category: cs.CL
TL;DR: Datarus-R1-14B is a fine-tuned 14B-parameter model designed to act as a virtual data analyst and problem solver. It outperforms similar models in accuracy while using fewer tokens, thanks to its unique training on analytical trajectories and optimized reinforcement learning approach.
Details
Motivation: The motivation behind Datarus is to create a model that avoids the verbosity and format collapse seen in RL-aligned LLMs, by training on full analytical trajectories rather than isolated question-answer pairs, enabling more efficient and effective problem-solving. Method: Datarus was trained using a trajectory-centric synthetic data generator, a dual-reward framework combining a tag-based structural signal with a Hierarchical Reward Model (HRM), and a memory-optimized implementation of Group Relative Policy Optimization (GRPO). It also employed a cosine curriculum to balance structural fidelity and semantic depth. Result: Datarus-R1-14B achieved up to 30% higher accuracy on AIME 2024/2025 and LiveCodeBench compared to similar-sized models, and even matched larger models like QwQ-32B. It also emitted 18-49% fewer tokens per solution, showing greater efficiency. Conclusion: Datarus-R1-14B, a 14 B-parameter open-weights language model fine-tuned from Qwen 2.5-14B-Instruct, acts as a virtual data analyst and graduate-level problem solver. It surpasses similarly-sized models and competes with larger reasoning models in accuracy while emitting fewer tokens per solution. Abstract: We present Datarus-R1-14B, a 14 B-parameter open-weights language model fine-tuned from Qwen 2.5-14B-Instruct to act as a virtual data analyst and graduate-level problem solver. Datarus is trained not on isolated question-answer pairs but on full analytical trajectories including reasoning steps, code execution, error traces, self-corrections, and final conclusions, all captured in a ReAct-style notebook format spanning finance, medicine, numerical analysis, and other quantitative domains. Our training pipeline combines (i) a trajectory-centric synthetic data generator that yielded 144 000 tagged notebook episodes, (ii) a dual-reward framework blending a lightweight tag-based structural signal with a Hierarchical Reward Model (HRM) that scores both single-step soundness and end-to-end coherence, and (iii) a memory-optimized implementation of Group Relative Policy Optimization (GRPO) featuring KV-cache reuse, sequential generation, and reference-model sharding. A cosine curriculum smoothly shifts emphasis from structural fidelity to semantic depth, reducing the format collapse and verbosity that often plague RL-aligned LLMs. A central design choice in Datarus is it dual reasoning interface. In agentic mode the model produces ReAct-tagged steps that invoke Python tools to execute real code; in reflection mode it outputs compact Chain-of-Thought (CoT) traces delimited by[7] ALIGN: Word Association Learning for Cross-Cultural Generalization in Large Language Models
Chunhua Liu,Kabir Manandhar Shrestha,Sukai Huang
Main category: cs.CL
TL;DR: 本研究通过参数高效微调方法,利用母语者的词汇联想数据,成功调整了大型语言模型的文化偏差,实现了在不进行昂贵重新训练的情况下提升模型文化适应性的目标。
Details
Motivation: 由于大型语言模型(LLMs)在跨文化交流中的行为仍然受到其预训练语料库中语言和观点分布偏差的影响,而建模和对齐文化面临文化知识有限和缺乏有效学习方法的挑战,因此需要一种新的方法来改善模型的文化适应性。 Method: 该论文采用监督微调(SFT)和基于PPO的偏好优化方法,利用Small-World-of-Words项目中的英语和普通话词汇联想数据集,对Llama-3.1-8B和Qwen-2.5-7B模型进行调整。 Result: 监督微调(SFT)在英语和普通话中的关联精度分别提高了16-20%和43-165%,中位具体性提升了0.20,并达到了人类水平的情感和唤醒度。经过微调的模型在World-Values-Survey问题上表现出对目标文化的倾向,且Qwen的中文对齐响应翻倍,而Llama的美国偏见减少了三分之一。 Conclusion: 该论文提出了一种成本效益高且基于认知的方法,通过利用母语者的自由词汇联想规范来调整大型语言模型的文化偏差,从而实现更有效的文化对齐。结果表明,这种方法在不进行昂贵的重新训练的情况下,能够显著提升模型的文化适应性。 Abstract: As large language models (LLMs) increasingly mediate cross-cultural communication, their behavior still reflects the distributional bias of the languages and viewpoints that are over-represented in their pre-training corpora. Yet, it remains a challenge to model and align culture due to limited cultural knowledge and a lack of exploration into effective learning approaches. We introduce a cost-efficient, cognitively grounded remedy: parameter-efficient fine-tuning on native speakers' free word-association norms, which encode implicit cultural schemas. Leveraging English-US and Mandarin associations from the Small-World-of-Words project, we adapt Llama-3.1-8B and Qwen-2.5-7B via supervised fine-tuning (SFT) and PPO-based preference optimization. SFT boosts held-out association Precision at 5 by 16-20% in English and 43-165% in Mandarin, lifts median concreteness by +0.20, and attains human-level valence and arousal. These lexical gains transfer: on World-Values-Survey questions, fine-tuned models shift answer distributions toward the target culture, and on a 50-item high-tension subset, Qwen's Chinese-aligned responses double while Llama's US bias drops by one-third. Our 7-8B models rival or beat vanilla 70B baselines, showing that a few million culture-grounded associations can instill value alignment without costly retraining. Our work highlights both the promise and the need for future research grounded in human cognition in improving cultural alignment in AI models.[8] ProMed: Shapley Information Gain Guided Reinforcement Learning for Proactive Medical LLMs
Hongxin Ding,Baixiang Huang,Yue Fang,Weibin Liao,Xinke Jiang,Zheng Li,Junfeng Zhao,Yasha Wang
Main category: cs.CL
TL;DR: ProMed introduces a reinforcement learning framework for medical LLMs, enabling proactive questioning to improve diagnostic accuracy in interactive clinical settings.
Details
Motivation: Medical Large Language Models (LLMs) predominantly operate under a reactive paradigm, which risks incorrect diagnoses in interactive clinical settings. This limitation motivates the need for a proactive approach where models can gather more information through clinically valuable questions. Method: The proposed method, ProMed, utilizes a Shapley Information Gain (SIG) reward integrated into a two-stage training pipeline: (1) SIG-Guided Model Initialization using Monte Carlo Tree Search (MCTS) and (2) SIG-Augmented Policy Optimization with a SIG-guided Reward Distribution Mechanism. Result: Experiments on newly curated partial-information medical benchmarks show that ProMed outperforms state-of-the-art methods by an average of 6.29% and achieves a 54.45% improvement over the reactive paradigm, while also generalizing well to out-of-domain cases. Conclusion: ProMed is a reinforcement learning framework that shifts medical LLMs from a reactive to a proactive paradigm by enabling them to ask clinically valuable questions before making decisions, significantly outperforming existing methods. Abstract: Interactive medical questioning is essential in real-world clinical consultations, where physicians must actively gather information from patients. While medical Large Language Models (LLMs) have shown impressive capabilities in static medical question answering, they predominantly operate under a reactive paradigm: generating answers directly without seeking additional information, which risks incorrect diagnoses in such interactive settings. To address this limitation, we propose ProMed, a reinforcement learning (RL) framework that transitions medical LLMs toward a proactive paradigm, equipping them with the ability to ask clinically valuable questions before decision-making. At the core of ProMed is the Shapley Information Gain (SIG) reward, which quantifies the clinical utility of each question by combining the amount of newly acquired information with its contextual importance, estimated via Shapley values. We integrate SIG into a two-stage training pipeline: (1) SIG-Guided Model Initialization uses Monte Carlo Tree Search (MCTS) to construct high-reward interaction trajectories to supervise the model, and (2) SIG-Augmented Policy Optimization, which integrates SIG and enhances RL with a novel SIG-guided Reward Distribution Mechanism that assigns higher rewards to informative questions for targeted optimization. Extensive experiments on two newly curated partial-information medical benchmarks demonstrate that ProMed significantly outperforms state-of-the-art methods by an average of 6.29% and delivers a 54.45% gain over the reactive paradigm, while also generalizing robustly to out-of-domain cases.[9] Saudi-Dialect-ALLaM: LoRA Fine-Tuning for Dialectal Arabic Generation
Hassan Barmandah
Main category: cs.CL
TL;DR: 本研究通过LoRA微调ALLaM-7B-Instruct-preview模型,使用Dialect-Token训练方法显著提高了沙特方言的生成比例并减少了现代标准阿拉伯语的泄漏,同时在方言控制和文本保真度方面优于其他通用指令模型。
Details
Motivation: 论文的动机是当前大型语言模型(LLMs)主要以现代标准阿拉伯语(MSA)为主,对沙特方言(如Najdi和Hijazi)的支持有限,这阻碍了它们捕捉真实的方言变化能力。 Method: 论文的方法包括使用私有整理的沙特方言指令数据集对ALLaM-7B-Instruct-preview模型进行LoRA微调。研究了两种变体:(i) Dialect-Token训练,即在指令前添加显式的方言标签;(ii) No-Token训练,即在格式化时不添加标签。评估结合了外部方言分类器、文本保真度指标(chrF++和BERTScore)以及多样性度量。 Result: 论文的结果显示,Dialect-Token模型在控制方言生成方面表现最佳,将沙特方言的比例从47.97%提高到84.21%,并将MSA泄漏减少从32.63%到6.21%。同时,文本保真度也有所提高(chrF++增加3.53,BERTScore增加0.059)。两种LoRA变体在方言控制和保真度方面均优于强大的通用指令模型,并避免了这些模型常见的元数据标签回显问题。 Conclusion: 论文的结论是,通过使用Dialect-Token训练方法,ALLaM-7B-Instruct-preview模型在沙特方言生成方面表现最佳,显著提高了沙特方言的生成比例并减少了现代标准阿拉伯语(MSA)的泄漏。此外,两种LoRA变体在方言控制和文本保真度方面均优于其他通用指令模型,并且避免了这些模型常见的元数据标签回显问题。 Abstract: Large language models (LLMs) for Arabic are still dominated by Modern Standard Arabic (MSA), with limited support for Saudi dialects such as Najdi and Hijazi. This underrepresentation hinders their ability to capture authentic dialectal variation. Using a privately curated Saudi Dialect Instruction dataset (Hijazi and Najdi; 5,466 synthetic instruction-response pairs; 50/50 split), we LoRA-tune ALLaM-7B-Instruct-preview, the first foundation model developed in Saudi Arabia, for Saudi dialect generation. We investigate two variants: (i) Dialect-Token training, which prepends an explicit dialect tag to the instruction, and (ii) No-Token training, which omits the tag at formatting time. Evaluation on a held-out test set combines an external dialect classifier with text fidelity metrics (chrF++ and BERTScore) and diversity measures. The Dialect-Token model achieves the best control, raising the Saudi rate from 47.97% to 84.21% and reducing MSA leakage from 32.63% to 6.21%; fidelity also improves (chrF++ +3.53, BERTScore +0.059). Both LoRA variants outperform strong generic instruction models (Falcon-7B-Instruct, Llama-3.1-8B-Instruct, Qwen-2.5-7B-Instruct, AceGPT-v2-8B-Chat, JAIS-13B-Chat) in dialect control and fidelity, while avoiding metadata-tag echoing that these baselines frequently exhibit. We do not release the dataset or any model weights/adapters; instead, we release training/evaluation/inference code and a detailed datasheet (schema and aggregate statistics) to support independent verification.[10] MATA (māta): Mindful Assessment of the Telugu Abilities of Large Language Models
Chalamalasetti Kranti,Sowmya Vajjala
Main category: cs.CL
TL;DR: This paper introduces MATA, a new dataset for evaluating Telugu language LLMs, providing insights into model performance and reliability in low-resource settings.
Details
Motivation: To assess the ability of Large Language Models in Telugu and understand their limitations, especially in low-resource language settings. Method: The authors created the MATA dataset and evaluated 11 open-weight and closed-source LLMs, analyzing performance and comparing LLM-as-a-judge evaluation with human evaluation. Result: 729 curated questions were created, performance analysis of 11 LLMs was conducted, and insights were drawn on model reliability and dependence on superficial heuristics. Conclusion: MATA serves as a foundation for future research in Telugu NLP and highlights the importance of fine-grained evaluation to understand model limitations. Abstract: In this paper, we introduce MATA, a novel evaluation dataset to assess the ability of Large Language Models (LLMs) in Telugu language, comprising 729 carefully curated multiple-choice and open-ended questions that span diverse linguistic dimensions. We evaluate 11 open-weight and closed-source LLMs on our dataset and present a fine-grained analysis of their performance. Further, we empirically show how LLMs rely on superficial heuristics such as answer position and distractor patterns for multiple-choice questions. Finally, we also compare LLM-as-a-judge evaluation with human evaluation for open-ended questions and draw some conclusions on its reliability in a low-resource language. We argue that such fine-grained evaluation is essential for understanding model limitations and can inform the development of more linguistically capable LLMs, while also serving as a foundation for future research in Telugu NLP.[11] Compressed Models are NOT Trust-equivalent to Their Large Counterparts
Rohit Raj Rai,Chirag Kothari,Siddhesh Shelke,Amit Awekar
Main category: cs.CL
TL;DR: 研究发现压缩模型与原始大模型在信任度上并不等价,即使它们的性能相似。
Details
Motivation: 现有工作研究了压缩对准确性和相关性能指标的影响,但性能一致性并不能保证信任等价性。 Method: 提出了一个二维框架,用于评估模型的信任等价性,包括可解释性对齐和校准相似性。 Result: 实验结果显示,即使模型准确率几乎相同,也存在较低的可解释性对齐和显著的校准相似性不匹配。 Conclusion: 压缩模型不能完全替代大模型的信任度,部署压缩模型需要超越性能一致性的评估。 Abstract: Large Deep Learning models are often compressed before being deployed in a resource-constrained environment. Can we trust the prediction of compressed models just as we trust the prediction of the original large model? Existing work has keenly studied the effect of compression on accuracy and related performance measures. However, performance parity does not guarantee trust-equivalence. We propose a two-dimensional framework for trust-equivalence evaluation. First, interpretability alignment measures whether the models base their predictions on the same input features. We use LIME and SHAP tests to measure the interpretability alignment. Second, calibration similarity measures whether the models exhibit comparable reliability in their predicted probabilities. It is assessed via ECE, MCE, Brier Score, and reliability diagrams. We conducted experiments using BERT-base as the large model and its multiple compressed variants. We focused on two text classification tasks: natural language inference and paraphrase identification. Our results reveal low interpretability alignment and significant mismatch in calibration similarity. It happens even when the accuracies are nearly identical between models. These findings show that compressed models are not trust-equivalent to their large counterparts. Deploying compressed models as a drop-in replacement for large models requires careful assessment, going beyond performance parity.[12] A Comparative Study of Decoding Strategies in Medical Text Generation
Oriana Presacan,Alireza Nik,Vajira Thambawita,Bogdan Ionescu,Michael Riegler
Main category: cs.CL
TL;DR: This study demonstrates that decoding strategies significantly impact output quality in medical LLM applications, sometimes more than model choice, with deterministic methods like beam search yielding the best results.
Details
Motivation: Decoding strategies significantly affect the output quality of Large Language Models (LLMs), but their impact remains underexplored in critical domains like healthcare. Method: The study evaluates 11 decoding strategies using medically specialized and general-purpose LLMs across five medical tasks: translation, summarization, question answering, dialogue, and image captioning. Result: Deterministic decoding strategies, such as beam search, generally outperformed stochastic ones. Larger models performed better overall but were slower and no more robust to decoding methods. Medical LLMs showed no overall performance advantage over general-purpose models and were more sensitive to decoding choices. Conclusion: The selection of decoding methods is crucial in medical applications, as its impact can sometimes surpass that of model choice. Deterministic strategies like beam search are preferable for better output quality. Abstract: Large Language Models (LLMs) rely on various decoding strategies to generate text, and these choices can significantly affect output quality. In healthcare, where accuracy is critical, the impact of decoding strategies remains underexplored. We investigate this effect in five open-ended medical tasks, including translation, summarization, question answering, dialogue, and image captioning, evaluating 11 decoding strategies with medically specialized and general-purpose LLMs of different sizes. Our results show that deterministic strategies generally outperform stochastic ones: beam search achieves the highest scores, while {\eta} and top-k sampling perform worst. Slower decoding methods tend to yield better quality. Larger models achieve higher scores overall but have longer inference times and are no more robust to decoding. Surprisingly, while medical LLMs outperform general ones in two of the five tasks, statistical analysis shows no overall performance advantage and reveals greater sensitivity to decoding choice. We further compare multiple evaluation metrics and find that correlations vary by task, with MAUVE showing weak agreement with BERTScore and ROUGE, as well as greater sensitivity to the decoding strategy. These results highlight the need for careful selection of decoding methods in medical applications, as their influence can sometimes exceed that of model choice.[13] Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM
Dariia Puhach,Amir H. Payberah,Éva Székely
Main category: cs.CL
TL;DR: The study explores gender bias in Speech-LLMs using speaker assignment as a bias cue, analyzing Bark's responses to gender-stereotyped prompts.
Details
Motivation: To investigate if gender bias in text-based Large Language Models extends to Speech-LLMs, using speaker assignment as an explicit bias cue. Method: Constructed two datasets (Professions and Gender-Colored Words) to analyze Bark's speaker assignments for textual prompts. Result: Bark's speaker selection does not show systematic bias, but it does demonstrate gender awareness and some gender inclinations. Conclusion: Bark does not exhibit systematic gender bias but demonstrates gender awareness and some inclinations. Abstract: Similar to text-based Large Language Models (LLMs), Speech-LLMs exhibit emergent abilities and context awareness. However, whether these similarities extend to gender bias remains an open question. This study proposes a methodology leveraging speaker assignment as an analytic tool for bias investigation. Unlike text-based models, which encode gendered associations implicitly, Speech-LLMs must produce a gendered voice, making speaker selection an explicit bias cue. We evaluate Bark, a Text-to-Speech (TTS) model, analyzing its default speaker assignments for textual prompts. If Bark's speaker selection systematically aligns with gendered associations, it may reveal patterns in its training data or model design. To test this, we construct two datasets: (i) Professions, containing gender-stereotyped occupations, and (ii) Gender-Colored Words, featuring gendered connotations. While Bark does not exhibit systematic bias, it demonstrates gender awareness and has some gender inclinations.[14] AdaDocVQA: Adaptive Framework for Long Document Visual Question Answering in Low-Resource Settings
Haoxuan Li,Wei Song,Aofan Liu,Peiwu Qin
Main category: cs.CL
TL;DR: 该研究提出了一种名为AdaDocVQA的统一自适应框架,旨在解决低资源环境中处理长文档时文档VQA面临的挑战。
Details
Motivation: 文档VQA在处理低资源环境中的长文档时面临重大挑战,由于上下文限制和训练数据不足。 Method: 提出了一个统一的自适应框架,包括混合文本检索架构、智能数据增强流水线和自适应集成推理。 Result: 实验结果显示,在JDocQA数据集上的是/否问题准确率达到83.04%,事实问题达到52.66%,数值问题达到44.12%,LAVA数据集的准确率达到59%。 Conclusion: AdaDocVQA通过其统一的自适应框架,为日语文档VQA设定了新的最先进结果,并为其他低资源语言和专业领域提供了可扩展的基础。 Abstract: Document Visual Question Answering (Document VQA) faces significant challenges when processing long documents in low-resource environments due to context limitations and insufficient training data. This paper presents AdaDocVQA, a unified adaptive framework addressing these challenges through three core innovations: a hybrid text retrieval architecture for effective document segmentation, an intelligent data augmentation pipeline that automatically generates high-quality reasoning question-answer pairs with multi-level verification, and adaptive ensemble inference with dynamic configuration generation and early stopping mechanisms. Experiments on Japanese document VQA benchmarks demonstrate substantial improvements with 83.04\% accuracy on Yes/No questions, 52.66\% on factual questions, and 44.12\% on numerical questions in JDocQA, and 59\% accuracy on LAVA dataset. Ablation studies confirm meaningful contributions from each component, and our framework establishes new state-of-the-art results for Japanese document VQA while providing a scalable foundation for other low-resource languages and specialized domains. Our code available at: https://github.com/Haoxuanli-Thu/AdaDocVQA.[15] CRISP: Persistent Concept Unlearning via Sparse Autoencoders
Tomer Ashuach,Dana Arad,Aaron Mueller,Martin Tutek,Yonatan Belinkov
Main category: cs.CL
TL;DR: CRISP是一种利用稀疏自编码器进行持久性概念遗忘的参数高效方法,可在保留模型效用的同时精确抑制有害知识。
Details
Motivation: 随着大型语言模型(LLMs)在实际应用中的部署增加,需要在保留模型效用的同时选择性地删除不需要的知识。现有的稀疏自编码器(SAE)方法主要在推理时进行干预,无法对模型参数造成持久性改变,因此提出了CRISP方法。 Method: CRISP通过自动识别多层中的显著SAE特征并抑制其激活,使用稀疏自编码器进行持久性概念遗忘。 Result: 实验表明,CRISP在WMDP基准测试的安全关键遗忘任务中优于先前方法,成功去除了有害知识,同时保留了通用和特定领域的功能。特征级分析显示,CRISP能够实现目标和良性概念之间的语义连贯分离,精确抑制目标特征。 Conclusion: CRISP是一个有效的参数方法,用于持久性概念遗忘,能够在保持模型实用性的前提下,精确地抑制有害知识。 Abstract: As large language models (LLMs) are increasingly deployed in real-world applications, the need to selectively remove unwanted knowledge while preserving model utility has become paramount. Recent work has explored sparse autoencoders (SAEs) to perform precise interventions on monosemantic features. However, most SAE-based methods operate at inference time, which does not create persistent changes in the model's parameters. Such interventions can be bypassed or reversed by malicious actors with parameter access. We introduce CRISP, a parameter-efficient method for persistent concept unlearning using SAEs. CRISP automatically identifies salient SAE features across multiple layers and suppresses their activations. We experiment with two LLMs and show that our method outperforms prior approaches on safety-critical unlearning tasks from the WMDP benchmark, successfully removing harmful knowledge while preserving general and in-domain capabilities. Feature-level analysis reveals that CRISP achieves semantically coherent separation between target and benign concepts, allowing precise suppression of the target features.[16] ViExam: Are Vision Language Models Better than Humans on Vietnamese Multimodal Exam Questions?
Vy Tuong Dang,An Vo,Quang Tau,Duc Dm,Daeyoung Kim
Main category: cs.CL
TL;DR: 本研究评估了视觉语言模型在越南语多模态教育考试中的表现,发现其性能有限,仅有少数模型能超过人类平均水平,但远低于人类最佳表现。
Details
Motivation: 探索视觉语言模型在低资源语言(如越南语)的多模态教育内容中的应用能力,尤其是这些主要使用英文数据训练的模型是否能有效处理跨语言的多模态推理任务。 Method: 研究提出了ViExam基准测试,包含2,548个多模态问题,用于评估VLMs在越南语教育考试中的表现,并测试了跨语言提示和人机协作对模型性能的影响。 Result: 最先进的VLMs在7个学术领域中的平均准确率为57.74%,开源模型为27.70%。只有o3模型(74.07%)超过了人类平均成绩(66.54%),但远低于人类最佳成绩(99.60%)。使用英文指令的跨语言提示未能提升性能,人机协作可将性能提升5个百分点。 Conclusion: 尽管最先进的视觉语言模型(VLMs)在英文多模态任务中表现出色,但在低资源语言如越南语的真实多模态教育内容上的表现仍然有限。只有极少数模型(如o3)超过了人类平均水平,但与人类最佳表现仍有很大差距。 Abstract: Vision language models (VLMs) demonstrate remarkable capabilities on English multimodal tasks, but their performance on low-resource languages with genuinely multimodal educational content remains largely unexplored. In this work, we test how VLMs perform on Vietnamese educational assessments, investigating whether VLMs trained predominantly on English data can handle real-world cross-lingual multimodal reasoning. Our work presents the first comprehensive evaluation of VLM capabilities on multimodal Vietnamese exams through proposing ViExam, a benchmark containing 2,548 multimodal questions. We find that state-of-the-art VLMs achieve only 57.74% while open-source models achieve 27.70% mean accuracy across 7 academic domains, including Mathematics, Physics, Chemistry, Biology, Geography, Driving Test, and IQ Test. Most VLMs underperform average human test-takers (66.54%), with only the thinking VLM o3 (74.07%) exceeding human average performance, yet still falling substantially short of human best performance (99.60%). Cross-lingual prompting with English instructions while maintaining Vietnamese content fails to improve performance, decreasing accuracy by 1 percentage point for SOTA VLMs. Human-in-the-loop collaboration can partially improve VLM performance by 5 percentage points. Code and data are available at: https://vi-exam.github.io.[17] Generics and Default Reasoning in Large Language Models
James Ravi Kirkpatrick,Rachel Katharine Sterken
Main category: cs.CL
TL;DR: 这篇论文评估了28个大型语言模型在涉及泛化通则的20种可废止推理模式上的推理能力。结果显示,虽然一些前沿模型能够很好地处理默认推理问题,但不同模型和提示风格之间的表现差异很大,尤其是在链式思维提示下表现下降明显。
Details
Motivation: 泛化通则(如“鸟会飞”、“乌鸦是黑色的”)在非单调逻辑中至关重要,因其具有允许例外的复杂行为,并与默认推理、认知及概念获取密切相关。语言学家、哲学家、逻辑学家和认知科学家对它们特别感兴趣。因此,研究LLMs在这些推理模式上的表现有助于理解它们在默认推理方面的能力和局限。 Method: 本文系统评估了28个大型语言模型在20种涉及泛化通则的可废止推理模式上的表现。研究分析了不同模型和提示风格(如零样本、少样本和链式思维提示)对推理能力的影响。 Result: 尽管一些前沿模型在处理默认推理问题上表现良好,但整体表现差异显著。链式思维提示(CoT)反而导致部分模型表现严重下降(平均准确率下降11.14%,标准差15.74%),而少样本提示仅对部分模型有轻微提升作用。许多模型难以区分可废止推理和演绎推理,或将泛化通则误解为全称命题。 Conclusion: 研究结果表明,当前大型语言模型在默认推理方面既有潜力也有明显局限,特别是在处理例外情况和理解泛化通则方面。提示方法(如CoT)可能并不总是提升推理能力,甚至可能产生负面影响。 Abstract: This paper evaluates the capabilities of 28 large language models (LLMs) to reason with 20 defeasible reasoning patterns involving generic generalizations (e.g., 'Birds fly', 'Ravens are black') central to non-monotonic logic. Generics are of special interest to linguists, philosophers, logicians, and cognitive scientists because of their complex exception-permitting behaviour and their centrality to default reasoning, cognition, and concept acquisition. We find that while several frontier models handle many default reasoning problems well, performance varies widely across models and prompting styles. Few-shot prompting modestly improves performance for some models, but chain-of-thought (CoT) prompting often leads to serious performance degradation (mean accuracy drop -11.14%, SD 15.74% in models performing above 75% accuracy in zero-shot condition, temperature 0). Most models either struggle to distinguish between defeasible and deductive inference or misinterpret generics as universal statements. These findings underscore both the promise and limits of current LLMs for default reasoning.[18] Prediction is not Explanation: Revisiting the Explanatory Capacity of Mapping Embeddings
Hanna Herasimchyk,Alhassan Abdelhalim,Sören Laue,Michaela Regneri
Main category: cs.CL
TL;DR: This paper challenges the assumption that prediction accuracy implies genuine semantic knowledge in word embeddings, suggesting that such methods mainly reflect geometric similarity rather than true semantic properties.
Details
Motivation: The motivation is to understand what knowledge is implicitly encoded in deep learning models in order to improve the interpretability of AI systems. Method: The paper examines common methods to explain the knowledge encoded in word embeddings by mapping embeddings onto collections of human-interpretable semantic features, known as feature norms. Result: The paper shows that prediction accuracy alone does not reliably indicate genuine feature-based interpretability as these methods can successfully predict even random information. Conclusion: The paper concludes that mapping embeddings onto semantic features primarily reflect geometric similarity within vector spaces rather than indicating the genuine emergence of semantic properties. Abstract: Understanding what knowledge is implicitly encoded in deep learning models is essential for improving the interpretability of AI systems. This paper examines common methods to explain the knowledge encoded in word embeddings, which are core elements of large language models (LLMs). These methods typically involve mapping embeddings onto collections of human-interpretable semantic features, known as feature norms. Prior work assumes that accurately predicting these semantic features from the word embeddings implies that the embeddings contain the corresponding knowledge. We challenge this assumption by demonstrating that prediction accuracy alone does not reliably indicate genuine feature-based interpretability. We show that these methods can successfully predict even random information, concluding that the results are predominantly determined by an algorithmic upper bound rather than meaningful semantic representation in the word embeddings. Consequently, comparisons between datasets based solely on prediction performance do not reliably indicate which dataset is better captured by the word embeddings. Our analysis illustrates that such mappings primarily reflect geometric similarity within vector spaces rather than indicating the genuine emergence of semantic properties.[19] EEG-MedRAG: Enhancing EEG-based Clinical Decision-Making via Hierarchical Hypergraph Retrieval-Augmented Generation
Yi Wang,Haoran Luo,Lu Meng
Main category: cs.CL
TL;DR: 该论文介绍EEG-MedRAG,一种用于整合和检索大规模脑电图数据的超图框架,并提出了首个跨疾病、跨角色的脑电图临床问答基准。
Details
Motivation: 随着脑电图(EEG)在神经科学和临床实践中的广泛应用,高效检索和语义解释大规模、多源、异构的EEG数据已成为一个紧迫的挑战。 Method: 提出EEG-MedRAG,一个基于三层超图的检索增强生成框架,将EEG领域知识、个体病例和大规模存储库整合成一个可遍历的n元关系超图,实现联合语义-时间检索和因果链诊断生成。 Result: EEG-MedRAG在答案准确性和检索方面显著优于TimeRAG和HyperGraphRAG。 Conclusion: EEG-MedRAG显著优于TimeRAG和HyperGraphRAG,在回答准确性和检索方面表现出色,展示了其在实际临床决策支持中的潜力。 Abstract: With the widespread application of electroencephalography (EEG) in neuroscience and clinical practice, efficiently retrieving and semantically interpreting large-scale, multi-source, heterogeneous EEG data has become a pressing challenge. We propose EEG-MedRAG, a three-layer hypergraph-based retrieval-augmented generation framework that unifies EEG domain knowledge, individual patient cases, and a large-scale repository into a traversable n-ary relational hypergraph, enabling joint semantic-temporal retrieval and causal-chain diagnostic generation. Concurrently, we introduce the first cross-disease, cross-role EEG clinical QA benchmark, spanning seven disorders and five authentic clinical perspectives. This benchmark allows systematic evaluation of disease-agnostic generalization and role-aware contextual understanding. Experiments show that EEG-MedRAG significantly outperforms TimeRAG and HyperGraphRAG in answer accuracy and retrieval, highlighting its strong potential for real-world clinical decision support. Our data and code are publicly available at https://github.com/yi9206413-boop/EEG-MedRAG.[20] Sycophancy under Pressure: Evaluating and Mitigating Sycophantic Bias via Adversarial Dialogues in Scientific QA
Kaiwei Zhang,Qi Jia,Zijian Chen,Wei Sun,Xiangyang Zhu,Chunyi Li,Dandan Zhu,Guangtao Zhai
Main category: cs.CL
TL;DR: This paper introduces a framework to evaluate and mitigate sycophancy in large language models, particularly in scientific QA. The proposed method, Pressure-Tune, improves resistance to sycophancy without compromising model accuracy or responsiveness.
Details
Motivation: The motivation stems from the increasing use of large language models in domains requiring factual rigor, where sycophancy can lead to serious risks in high-stakes settings like scientific question answering. This behavior, reinforced by preference-based alignment techniques, remains underexamined in factual QA contexts. Method: A unified evaluation framework was introduced to quantify the impact of sycophantic context on model behavior in scientific QA. Adversarial prompting setups and targeted metrics were used, alongside the development of Pressure-Tune, a lightweight post-training method involving synthetic adversarial dialogues and chain-of-thought rationales. Result: Systematic evaluations revealed pervasive sycophantic tendencies across open-source and proprietary models. Experiments showed that Pressure-Tune significantly enhances sycophancy resistance while maintaining accuracy and responsiveness to valid feedback. Conclusion: The study concludes that sycophantic tendencies in large language models are widespread and primarily influenced by alignment strategies rather than model size. The proposed method, Pressure-Tune, effectively enhances sycophancy resistance without sacrificing accuracy or responsiveness. Abstract: Large language models (LLMs), while increasingly used in domains requiring factual rigor, often display a troubling behavior: sycophancy, the tendency to align with user beliefs regardless of correctness. This tendency is reinforced by preference-based alignment techniques that optimize for user satisfaction but can undermine truthfulness. While relatively benign in casual dialogue, sycophancy poses serious risks in high-stakes settings such as scientific question answering (QA), where model outputs may shape collaborative reasoning, decision-making, and knowledge formation. Despite its importance, this phenomenon remains underexamined in factual QA contexts. We address this gap by introducing a unified evaluation framework to quantify the impact of sycophantic context on model behavior in scientific QA, measuring how much user-imposed social pressure distorts model outputs. The framework incorporates adversarial prompting setups and targeted metrics, such as misleading resistance and sycophancy resistance, that capture a model's ability to maintain factual consistency under misleading cues. Systematic evaluations across open-source and proprietary models reveal pervasive sycophantic tendencies, driven more by alignment strategy than by model size. To mitigate this issue, we propose Pressure-Tune, a lightweight post-training method that fine-tunes models on synthetic adversarial dialogues paired with chain-of-thought rationales. These rationales reject user misinformation while reinforcing factual commitments. Experiments on challenging scientific QA benchmarks show that Pressure-Tune significantly enhances sycophancy resistance without compromising accuracy or responsiveness to valid feedback, offering a practical pathway toward more truthful and principled model behavior.[21] MGT-Prism: Enhancing Domain Generalization for Machine-Generated Text Detection via Spectral Alignment
Shengchao Liu,Xiaoming Liu,Chengzhengxu Li,Zhaohan Zhang,Guoxin Ma,Yu Lan,Shuai Xiao
Main category: cs.CL
TL;DR: 本文提出了一种新的机器生成文本检测方法MGT-Prism,该方法从频率域的角度出发,通过低频域滤波模块和动态频谱对齐策略,提高检测器在不同领域中的泛化能力。
Details
Motivation: 现有的机器生成文本检测器在训练和测试领域相同时表现良好,但在面对未见过的领域时泛化能力较差,这是由于不同来源的数据之间存在领域偏移。 Method: 本文提出了一种新的机器生成文本检测方法MGT-Prism,该方法从频率域的角度出发,设计了低频域滤波模块和动态频谱对齐策略,以过滤对领域偏移敏感的文档级特征,并提取任务特定且领域不变的特征。 Result: 实验结果表明,MGT-Prism在三个领域泛化场景中的11个测试数据集上,平均准确率和F1分数分别超过了最先进的基线方法0.90%和0.92%。 Conclusion: MGT-Prism是一种有效的机器生成文本检测方法,它通过从频率域的角度出发,提高了检测器在不同领域中的泛化能力。 Abstract: Large Language Models have shown growing ability to generate fluent and coherent texts that are highly similar to the writing style of humans. Current detectors for Machine-Generated Text (MGT) perform well when they are trained and tested in the same domain but generalize poorly to unseen domains, due to domain shift between data from different sources. In this work, we propose MGT-Prism, an MGT detection method from the perspective of the frequency domain for better domain generalization. Our key insight stems from analyzing text representations in the frequency domain, where we observe consistent spectral patterns across diverse domains, while significant discrepancies in magnitude emerge between MGT and human-written texts (HWTs). The observation initiates the design of a low frequency domain filtering module for filtering out the document-level features that are sensitive to domain shift, and a dynamic spectrum alignment strategy to extract the task-specific and domain-invariant features for improving the detector's performance in domain generalization. Extensive experiments demonstrate that MGT-Prism outperforms state-of-the-art baselines by an average of 0.90% in accuracy and 0.92% in F1 score on 11 test datasets across three domain-generalization scenarios.[22] Can Large Language Models (LLMs) Describe Pictures Like Children? A Comparative Corpus Study
Hanna Woloszyn,Benjamin Gagl
Main category: cs.CL
TL;DR: 该研究分析大型语言模型是否能模仿儿童语言,发现LLM生成的文本在长度、词汇和语义特征上与儿童语言存在显著差异,尽管少样本提示略有改善,但整体仍难以复制儿童语言模式。
Details
Motivation: 随着大型语言模型在教育领域的作用日益增强,但目前尚不清楚这些模型生成的语言是否能够接近儿童语言特征。因此,研究者希望通过分析LLMs是否能够复制儿童语言特征,以评估其在儿童教育和心理语言学研究中的适用性。 Method: 研究者使用两个大型语言模型生成的语料库与德国儿童描述图片故事的语料进行比较分析。生成模型采用了两种提示类型:零样本提示和少样本提示。分析涵盖了心理语言学文本属性,包括词频、词汇丰富度、句子和单词长度、词性标签以及通过词嵌入表示的语义相似性。 Result: LLM生成的文本比儿童语言更长,但词汇丰富度较低,更依赖高频词汇,名词使用不足。语义向量空间分析显示两者之间的相似性较低,表明在语义层面存在显著差异。少样本提示略微提高了LLM与儿童文本之间的相似性,但仍未有效复制儿童语言的词汇和语义模式。 Conclusion: 该研究发现大型语言模型(LLMs)在模仿儿童语言方面存在一定局限性,尽管通过多模态提示(文本+图像)可以部分逼近儿童语言特征,但其生成的文本在词汇和语义模式上仍与儿童语言存在显著差异。研究结果对LLMs在心理语言学研究和教育中的应用提供了启示,同时引发了关于LLMs生成语言是否适合用于面向儿童的教育工具的重要问题。 Abstract: The role of large language models (LLMs) in education is increasing, yet little attention has been paid to whether LLM-generated text resembles child language. This study evaluates how LLMs replicate child-like language by comparing LLM-generated texts to a collection of German children's descriptions of picture stories. We generated two LLM-based corpora using the same picture stories and two prompt types: zero-shot and few-shot prompts specifying a general age from the children corpus. We conducted a comparative analysis across psycholinguistic text properties, including word frequency, lexical richness, sentence and word length, part-of-speech tags, and semantic similarity with word embeddings. The results show that LLM-generated texts are longer but less lexically rich, rely more on high-frequency words, and under-represent nouns. Semantic vector space analysis revealed low similarity, highlighting differences between the two corpora on the level of corpus semantics. Few-shot prompt increased similarities between children and LLM text to a minor extent, but still failed to replicate lexical and semantic patterns. The findings contribute to our understanding of how LLMs approximate child language through multimodal prompting (text + image) and give insights into their use in psycholinguistic research and education while raising important questions about the appropriateness of LLM-generated language in child-directed educational tools.[23] TracSum: A New Benchmark for Aspect-Based Summarization with Sentence-Level Traceability in Medical Domain
Bohao Chu,Meijie Li,Sameh Frihat,Chengyu Gu,Georg Lodde,Elisabeth Livingstone,Norbert Fuhr
Main category: cs.CL
TL;DR: 本研究提出了一种新的医学摘要基准 TracSum,通过句子级引用追踪提升摘要的准确性与完整性。
Details
Motivation: 解决 LLMs 在医学领域生成摘要时存在的事实准确性问题。 Method: 引入 TracSum 基准和 Track-Then-Sum 摘要流程,并设计细粒度评估框架。 Result: 通过实验和人工评估验证 TracSum 的有效性,发现摘要前进行句子级追踪可提高准确性,结合全文上下文可进一步提高完整性。 Conclusion: TracSum 是一个有效的可追溯、基于方面的摘要任务基准,能够提高生成准确性与完整性。 Abstract: While document summarization with LLMs has enhanced access to textual information, concerns about the factual accuracy of these summaries persist, especially in the medical domain. Tracing evidence from which summaries are derived enables users to assess their accuracy, thereby alleviating this concern. In this paper, we introduce TracSum, a novel benchmark for traceable, aspect-based summarization, in which generated summaries are paired with sentence-level citations, enabling users to trace back to the original context. First, we annotate 500 medical abstracts for seven key medical aspects, yielding 3.5K summary-citation pairs. We then propose a fine-grained evaluation framework for this new task, designed to assess the completeness and consistency of generated content using four metrics. Finally, we introduce a summarization pipeline, Track-Then-Sum, which serves as a baseline method for comparison. In experiments, we evaluate both this baseline and a set of LLMs on TracSum, and conduct a human evaluation to assess the evaluation results. The findings demonstrate that TracSum can serve as an effective benchmark for traceable, aspect-based summarization tasks. We also observe that explicitly performing sentence-level tracking prior to summarization enhances generation accuracy, while incorporating the full context further improves completeness.[24] Beyond Human Judgment: A Bayesian Evaluation of LLMs' Moral Values Understanding
Maciej Skorski,Alina Landowska
Main category: cs.CL
TL;DR: Large language models outperform the average human in detecting moral dimensions with higher accuracy and fewer false negatives, as revealed by a Bayesian evaluation framework.
Details
Motivation: To understand how large language models compare to humans in perceiving moral dimensions, moving beyond deterministic ground truth approaches. Method: A GPU-optimized Bayesian framework was used to evaluate top language models (Claude Sonnet 4, DeepSeek-V3, Llama 4 Maverick) on 250K+ annotations from ~700 annotators across 100K+ texts. The method captured both aleatoric and epistemic uncertainty. Result: Language models ranked in the top 25% of human annotators, showed better-than-average balanced accuracy, and produced significantly fewer false negatives. Conclusion: AI models demonstrate superior moral detection capabilities compared to the average human, with fewer false negatives and better balanced accuracy. Abstract: How do large language models understand moral dimensions compared to humans? This first large-scale Bayesian evaluation of market-leading language models provides the answer. In contrast to prior work using deterministic ground truth (majority or inclusion rules), we model annotator disagreements to capture both aleatoric uncertainty (inherent human disagreement) and epistemic uncertainty (model domain sensitivity). We evaluate top language models (Claude Sonnet 4, DeepSeek-V3, Llama 4 Maverick) across 250K+ annotations from ~700 annotators on 100K+ texts spanning social media, news, and forums. Our GPU-optimized Bayesian framework processed 1M+ model queries, revealing that AI models typically rank among the top 25\% of human annotators, achieving much better-than-average balanced accuracy. Importantly, we find that AI produces far fewer false negatives than humans, highlighting their more sensitive moral detection capabilities.[25] Prompt-Based One-Shot Exact Length-Controlled Generation with LLMs
Juncheng Xie,Hung-yi Lee
Main category: cs.CL
TL;DR: 该论文提出了一种基于提示的方法,使大型语言模型在不进行微调或迭代采样的情况下,能够精确控制生成文本的长度。
Details
Motivation: 大型语言模型(LLMs)在遵循明确的长度指令时经常出现过长或过短的情况,这是因为它们无法可靠地进行内部标记计数。 Method: 该方法通过在提示中添加倒计时标记和明确的计数规则,使模型在生成文本的同时进行自我计数,从而精确控制生成长度。 Result: 在多个任务中测试后,使用GPT-4.1模型的严格长度合规率从低于30%提升至超过95%,同时保持了回答质量。 Conclusion: 论文表明,通过提示工程即可实现对生成文本长度的精确控制,提供了一种比训练或解码方法更轻量级的替代方案。 Abstract: Controlling the length of text produced by large language models (LLMs) remains challenging: models frequently overshoot or undershoot explicit length instructions because they cannot reliably keep an internal token count. We present a prompt-based, one-shot strategy that compels an off-the-shelf LLM to generate exactly a desired number of tokens - words (English) or characters (Chinese) - without any fine-tuning or iterative sampling. The prompt appends countdown markers and explicit counting rules so that the model "writes while counting." We evaluate on four settings: open-ended generation (1-1000 tokens), XSUM summarization, MT-Bench-LI instruction following, and the LIFEBENCH equal-length track. On MT-Bench-LI, strict length compliance with GPT-4.1 leaps from below 30% under naive prompts to above 95% with our countdown prompt, surpassing the popular draft-then-revise baseline, while judged answer quality is preserved. These results show that precise length control can be achieved through prompt engineering alone, offering a lightweight alternative to training- or decoding-based methods.[26] The illusion of a perfect metric: Why evaluating AI's words is harder than it looks
Maria Paz Oliva,Adriana Correia,Ivan Vankov,Viktor Botev
Main category: cs.CL
TL;DR: This paper analyzes the limitations of current NLG evaluation metrics and highlights the need for task-specific metric selection and improved validation practices.
Details
Motivation: The motivation stems from the lack of a definitive evaluation metric for NLG despite the practical importance of such metrics in AI adoption and the increasing reliance on automatic evaluation methods. Method: The paper conducts a comprehensive examination of existing evaluation metrics, their strengths and limitations, validation methods, and correlations with human judgment. Result: Key challenges identified include metrics capturing only specific aspects of text quality, varying effectiveness across tasks and datasets, inconsistent validation practices, and unreliable correlations with human judgment, even in recent LLM-based evaluators and RAG evaluations. Conclusion: The paper concludes that there is no perfect evaluation metric for NLG and emphasizes the need to select metrics based on task-specific requirements. It advocates for complementary evaluations and improved validation methodologies for new metrics. Abstract: Evaluating Natural Language Generation (NLG) is crucial for the practical adoption of AI, but has been a longstanding research challenge. While human evaluation is considered the de-facto standard, it is expensive and lacks scalability. Practical applications have driven the development of various automatic evaluation metrics (AEM), designed to compare the model output with human-written references, generating a score which approximates human judgment. Over time, AEMs have evolved from simple lexical comparisons, to semantic similarity models and, more recently, to LLM-based evaluators. However, it seems that no single metric has emerged as a definitive solution, resulting in studies using different ones without fully considering the implications. This paper aims to show this by conducting a thorough examination of the methodologies of existing metrics, their documented strengths and limitations, validation methods, and correlations with human judgment. We identify several key challenges: metrics often capture only specific aspects of text quality, their effectiveness varies by task and dataset, validation practices remain unstructured, and correlations with human judgment are inconsistent. Importantly, we find that these challenges persist in the most recent type of metric, LLM-as-a-Judge, as well as in the evaluation of Retrieval Augmented Generation (RAG), an increasingly relevant task in academia and industry. Our findings challenge the quest for the 'perfect metric'. We propose selecting metrics based on task-specific needs and leveraging complementary evaluations and advocate that new metrics should focus on enhanced validation methodologies.[27] Extracting Structured Requirements from Unstructured Building Technical Specifications for Building Information Modeling
Insaf Nahri,Romain Pinquié,Philippe Véron,Nicolas Bus,Mathieu Thorel
Main category: cs.CL
TL;DR: This study integrates BIM with NLP to automate requirement extraction from French BTS documents, using NER and RE techniques with transformer-based and Random Forest models.
Details
Motivation: The study aims to automate the extraction of requirements from unstructured French Building Technical Specification documents in the construction industry. Method: Named Entity Recognition and Relation Extraction techniques were employed using CamemBERT, Fr_core_news_lg, and Random Forest models. A custom feature vector was used for RE, and a hand-crafted annotated dataset was used for evaluation. Result: CamemBERT and Fr_core_news_lg achieved F1-scores over 90% in NER, and Random Forest attained an F1 score above 80% in RE. Conclusion: CamemBERT and Fr_core_news_lg performed best in NER, while Random Forest was most effective in RE. The results aim to be represented as a knowledge graph in future work. Abstract: This study explores the integration of Building Information Modeling (BIM) with Natural Language Processing (NLP) to automate the extraction of requirements from unstructured French Building Technical Specification (BTS) documents within the construction industry. Employing Named Entity Recognition (NER) and Relation Extraction (RE) techniques, the study leverages the transformer-based model CamemBERT and applies transfer learning with the French language model Fr\_core\_news\_lg, both pre-trained on a large French corpus in the general domain. To benchmark these models, additional approaches ranging from rule-based to deep learning-based methods are developed. For RE, four different supervised models, including Random Forest, are implemented using a custom feature vector. A hand-crafted annotated dataset is used to compare the effectiveness of NER approaches and RE models. Results indicate that CamemBERT and Fr\_core\_news\_lg exhibited superior performance in NER, achieving F1-scores over 90\%, while Random Forest proved most effective in RE, with an F1 score above 80\%. The outcomes are intended to be represented as a knowledge graph in future work to further enhance automatic verification systems.[28] MME-SCI: A Comprehensive and Challenging Science Benchmark for Multimodal Large Language Models
Jiacheng Ruan,Dan Jiang,Xian Gao,Ting Liu,Yuzhuo Fu,Yangyang Kang
Main category: cs.CL
TL;DR: 本文提出了MME-SCI,一个用于评估多模态大语言模型的全新科学领域基准测试,解决了现有基准测试在多语言、多模态和细粒度知识标注方面的不足。
Details
Motivation: 现有的科学领域基准测试在多语言场景、多模态覆盖和科学知识点标注方面存在不足。 Method: 收集了1019个高质量问答对,涵盖四个学科和五种语言,并在16个开源模型和4个闭源模型上进行了实验。 Result: 实验结果显示,MME-SCI对现有MLLMs具有广泛挑战性,例如在仅图像评估模式下,o4-mini在不同学科中的准确率显著较低。 Conclusion: MME-SCI是一个具有挑战性的科学领域多模态基准,能够全面评估现有MLLMs的推理能力和多语言、多模态覆盖能力。 Abstract: Recently, multimodal large language models (MLLMs) have achieved significant advancements across various domains, and corresponding evaluation benchmarks have been continuously refined and improved. In this process, benchmarks in the scientific domain have played an important role in assessing the reasoning capabilities of MLLMs. However, existing benchmarks still face three key challenges: 1) Insufficient evaluation of models' reasoning abilities in multilingual scenarios; 2) Inadequate assessment of MLLMs' comprehensive modality coverage; 3) Lack of fine-grained annotation of scientific knowledge points. To address these gaps, we propose MME-SCI, a comprehensive and challenging benchmark. We carefully collected 1,019 high-quality question-answer pairs, which involve 3 distinct evaluation modes. These pairs cover four subjects, namely mathematics, physics, chemistry, and biology, and support five languages: Chinese, English, French, Spanish, and Japanese. We conducted extensive experiments on 16 open-source models and 4 closed-source models, and the results demonstrate that MME-SCI is widely challenging for existing MLLMs. For instance, under the Image-only evaluation mode, o4-mini achieved accuracy of only 52.11%, 24.73%, 36.57%, and 29.80% in mathematics, physics, chemistry, and biology, respectively, indicating a significantly higher difficulty level compared to existing benchmarks. More importantly, using MME-SCI's multilingual and fine-grained knowledge attributes, we analyzed existing models' performance in depth and identified their weaknesses in specific domains. The Data and Evaluation Code are available at https://github.com/JCruan519/MME-SCI.[29] ReviewGraph: A Knowledge Graph Embedding Based Framework for Review Rating Prediction with Sentiment Features
A. J. W. de Vink,Natalia Amat-Lefort,Lifeng Han
Main category: cs.CL
TL;DR: 本文提出ReviewGraph,一种基于图的评论评分预测方法,性能与大型语言模型相当,但计算成本更低,且具备更好的可解释性和集成到RAG系统的潜力。
Details
Motivation: 酒店行业需要理解影响客户评分的因素,以提升客户满意度和业务表现。现有的NLP方法和大型语言模型虽然有效,但计算成本高且缺乏可解释性。 Method: ReviewGraph通过提取(主体, 谓词, 客体)三元组并结合情感评分将评论转化为知识图谱,利用Node2Vec生成图嵌入,并通过机器学习分类器预测评分。 Result: ReviewGraph在HotelRec数据集上的表现与最先进的模型相当,甚至在Cohen's Kappa等指标上优于传统基线方法,并具有更好的可解释性和集成潜力。 Conclusion: ReviewGraph是一个基于图的评论评分预测框架,它将文本评论转换为知识图谱,结合图嵌入和情感特征,实现与大型语言模型相当的预测性能,但计算成本更低。 Abstract: In the hospitality industry, understanding the factors that drive customer review ratings is critical for improving guest satisfaction and business performance. This work proposes ReviewGraph for Review Rating Prediction (RRP), a novel framework that transforms textual customer reviews into knowledge graphs by extracting (subject, predicate, object) triples and associating sentiment scores. Using graph embeddings (Node2Vec) and sentiment features, the framework predicts review rating scores through machine learning classifiers. We compare ReviewGraph performance with traditional NLP baselines (such as Bag of Words, TF-IDF, and Word2Vec) and large language models (LLMs), evaluating them in the HotelRec dataset. In comparison to the state of the art literature, our proposed model performs similar to their best performing model but with lower computational cost (without ensemble). While ReviewGraph achieves comparable predictive performance to LLMs and outperforms baselines on agreement-based metrics such as Cohen's Kappa, it offers additional advantages in interpretability, visual exploration, and potential integration into Retrieval-Augmented Generation (RAG) systems. This work highlights the potential of graph-based representations for enhancing review analytics and lays the groundwork for future research integrating advanced graph neural networks and fine-tuned LLM-based extraction methods. We will share ReviewGraph output and platform open-sourced on our GitHub page https://github.com/aaronlifenghan/ReviewGraph[30] Chunks as Arms: Multi-Armed Bandit-Guided Sampling for Long-Context LLM Preference Optimization
Shaohua Duan,Xinze Li,Zhenghao Liu,Xiaoyuan Yi,Yukun Yan,Shuo Wang,Yu Gu,Ge Yu,Maosong Sun
Main category: cs.CL
TL;DR: LongMab-PO is a framework that enhances long-context capabilities of Large Language Models by leveraging a Multi-Armed Bandit strategy to generate high-quality and diverse responses, leading to improved performance on long-context reasoning tasks.
Details
Motivation: Long-context modeling is crucial for many real-world tasks, but existing approaches are limited by low diversity and factual inconsistencies in generated data. Method: LongMab-PO uses a Multi-Armed Bandit (MAB) rollout strategy to identify the most informative chunks from long context for generating high-quality and diverse responses and constructing preference data pairs for Direct Preference Optimization (DPO) training. Result: Experimental results show that LongMab-PO significantly improves the diversity and quality of preference data pairs. Conclusion: LongMab-PO improves the diversity and quality of preference data pairs, achieving state-of-the-art performance on long-context reasoning benchmarks. Abstract: Long-context modeling is critical for a wide range of real-world tasks, including long-context question answering, summarization, and complex reasoning tasks. Recent studies have explored fine-tuning Large Language Models (LLMs) with synthetic data to enhance their long-context capabilities. However, the effectiveness of such approaches is often limited by the low diversity and factual inconsistencies in the generated data. To address these challenges, we propose LongMab-PO, a novel framework that leverages a Multi-Armed Bandit (MAB) rollout strategy to identify the most informative chunks from the given long context for sampling high-quality and diverse responses and constructing preference data pairs for Direct Preference Optimization (DPO) training. Specifically, we treat context chunks as arms of MAB, select chunks based on their expected reward scores to input into LLMs to generate responses, and iteratively update these scores based on reward feedback. This exploration and exploitation process enables the model to focus on the most relevant context segments, thereby generating and collecting high-quality and diverse responses. Finally, we collect these generated responses from the rollout process and apply the DPO method to further optimize the LLM. Experimental results show that LongMab-PO significantly improves the diversity and quality of preference data pairs, achieving state-of-the-art performance on long-context reasoning benchmarks. All code and data will be released on https://github.com/NEUIR/LongMab-PO.[31] Ask Good Questions for Large Language Models
Qi Wu,Zhongqi Lu
Main category: cs.CL
TL;DR: 本文提出Ask-Good-Question框架,通过结合改进的概念增强项目反应理论模型和大语言模型,有效提升对话系统在问答过程中的信息检索效率和用户体验。
Details
Motivation: 当前对话系统在提供话题指导时常常失败,因为它们无法识别用户在相关概念上的困惑。 Method: 引入了改进的概念增强项目反应理论模型,并与大语言模型结合,用于生成引导性问题。 Result: 与其他基线方法相比,所提出的方法显著提升了信息检索效率和用户体验。 Conclusion: 通过结合改进的概念增强项目反应理论模型和大语言模型,Ask-Good-Question框架能够显著提升用户在问答过程中的信息检索体验。 Abstract: Recent advances in large language models (LLMs) have significantly improved the performance of dialog systems, yet current approaches often fail to provide accurate guidance of topic due to their inability to discern user confusion in related concepts. To address this, we introduce the Ask-Good-Question (AGQ) framework, which features an improved Concept-Enhanced Item Response Theory (CEIRT) model to better identify users' knowledge levels. Our contributions include applying the CEIRT model along with LLMs to directly generate guiding questions based on the inspiring text, greatly improving information retrieval efficiency during the question & answer process. Through comparisons with other baseline methods, our approach outperforms by significantly enhencing the users' information retrieval experiences.[32] Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR
Xiao Liang,Zhongzhi Li,Yeyun Gong,Yelong Shen,Ying Nian Wu,Zhijiang Guo,Weizhu Chen
Main category: cs.CL
TL;DR: The paper proposes an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training to maintain policy entropy and improve Pass@k performance.
Details
Motivation: Vanilla RLVR training improves Pass@1 performance at the expense of policy entropy, reducing generation diversity and limiting Pass@k performance. Method: An online Self-play with Variational problem Synthesis (SvS) strategy was proposed for RLVR training. Result: The SvS strategy substantially improves Pass@k compared with standard RLVR, achieving absolute gains of 18.3% and 22.8% in Pass@32 performance on AIME24 and AIME25 benchmarks. Conclusion: The proposed SvS strategy effectively maintains policy entropy and improves Pass@k performance in RLVR training. Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a key paradigm for post-training Large Language Models (LLMs), particularly for complex reasoning tasks. However, vanilla RLVR training has been shown to improve Pass@1 performance at the expense of policy entropy, leading to reduced generation diversity and limiting the Pass@k performance, which typically represents the upper bound of LLM reasoning capability. In this paper, we systematically analyze the policy's generation diversity from the perspective of training problems and find that augmenting and updating training problems helps mitigate entropy collapse during training. Based on these observations, we propose an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training, which uses the policy's correct solutions to synthesize variational problems while ensuring their reference answers remain identical to the originals. This self-improving strategy effectively maintains policy entropy during training and substantially improves Pass@k compared with standard RLVR, sustaining prolonged improvements and achieving absolute gains of 18.3% and 22.8% in Pass@32 performance on the competition-level AIME24 and AIME25 benchmarks. Experiments on 12 reasoning benchmarks across varying model sizes from 3B to 32B consistently demonstrate the generalizability and robustness of SvS.[33] Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation
Dongyoon Hahm,Taywon Min,Woogyeol Jin,Kimin Lee
Main category: cs.CL
TL;DR: 本文提出了一种名为PING的新方法,通过在代理响应前添加自然语言前缀,提高大型语言模型在执行代理任务时的安全性,避免执行有害任务,并通过实验验证了其有效性。
Details
Motivation: 大型语言模型(LLMs)已经演变为能够规划和与外部工具交互的代理系统,但安全问题在微调过程中经常被忽视。本文旨在解决微调后的LLMs可能无意中变得不安全的问题,并提高其拒绝有害任务的能力。 Method: 本文介绍了一种迭代方法,交替进行(1)生成候选前缀和(2)选择优化任务性能和拒绝行为的前缀。此外,通过线性探测分析内部隐藏状态,验证了前缀标记对行为修改的重要性。 Result: 实验结果表明,PING显著提高了微调后的LLM代理的安全性,同时保持了其在良性任务上的性能。PING在各种基准测试中均优于现有的提示方法,涵盖了网页导航和代码生成任务。 Conclusion: 本文提出了一种名为PING的方法,通过在代理响应前添加自动生成的自然语言前缀,以提高微调后的LLM代理的安全性,同时保持其有效性。 Abstract: Beyond simple text generation, Large Language Models (LLMs) have evolved into agentic systems capable of planning and interacting with external tools to solve complex tasks. This evolution involves fine-tuning LLMs on agent-specific tasks to enhance their proficiency. However, safety concerns are frequently overlooked during this fine-tuning process. In this work, we show that aligned LLMs can become unintentionally misaligned, leading to a higher likelihood of executing harmful tasks and a reduced tendency to refuse them when fine-tuned to execute agentic tasks. To address these safety challenges, we propose Prefix INjection Guard (PING), a simple yet effective method that prepends automatically generated natural language prefixes to agent responses, guiding them to refuse harmful requests while preserving performance on benign tasks. Specifically, we introduce an iterative approach that alternates between (1) generating candidate prefixes and (2) selecting those that optimize both task performance and refusal behavior. Experimental results demonstrate that PING significantly enhances the safety of fine-tuned LLM agents without sacrificing their effectiveness. PING consistently outperforms existing prompting approaches across diverse benchmarks in both web navigation and code generation tasks. Our analysis of internal hidden states via linear probes reveals that prefix tokens are crucial for behavior modification, explaining the performance gains. WARNING: This paper contains contents that are unethical or offensive in nature.[34] The Promise of Large Language Models in Digital Health: Evidence from Sentiment Analysis in Online Health Communities
Xiancheng Li,Georgios D. Karampatakis,Helen E. Wood,Chris J. Griffiths,Borislava Mihaylova,Neil S. Coulson,Alessio Pasinato,Pietro Panzarasa,Marco Viviani,Anna De Simoni
Main category: cs.CL
TL;DR: 本研究提出了一种利用大型语言模型结合专家知识进行医疗情感分析的新方法,解决了传统方法在数据、隐私和专业知识方面的限制。
Details
Motivation: 数字健康分析面临专家领域知识稀缺、数据短缺和隐私限制等挑战,尤其是在在线健康社区(OHCs)中,情感分析需要专业知识来处理混合情感、临床术语和隐含情感表达。 Method: 开发了一个结构化编码手册,以系统编码专家解读指南,并通过有针对性的提示使LLMs应用领域特定知识,而非依赖大量训练数据。 Result: LLMs在情感分析任务中表现出色,与BioBERT变体和基于词典的方法相比,LLMs(包括六个GPT模型、DeepSeek和LLaMA 3.1)在400条专家标注的OHC帖子上实现了与专家水平相当的性能,且无统计学显著差异。 Conclusion: 本研究通过在上下文中整合专家知识,展示了大型语言模型(LLMs)如何为数字健康分析提供可扩展的解决方案,解决了医疗健康领域专家知识短缺的关键挑战。 Abstract: Digital health analytics face critical challenges nowadays. The sophisticated analysis of patient-generated health content, which contains complex emotional and medical contexts, requires scarce domain expertise, while traditional ML approaches are constrained by data shortage and privacy limitations in healthcare settings. Online Health Communities (OHCs) exemplify these challenges with mixed-sentiment posts, clinical terminology, and implicit emotional expressions that demand specialised knowledge for accurate Sentiment Analysis (SA). To address these challenges, this study explores how Large Language Models (LLMs) can integrate expert knowledge through in-context learning for SA, providing a scalable solution for sophisticated health data analysis. Specifically, we develop a structured codebook that systematically encodes expert interpretation guidelines, enabling LLMs to apply domain-specific knowledge through targeted prompting rather than extensive training. Six GPT models validated alongside DeepSeek and LLaMA 3.1 are compared with pre-trained language models (BioBERT variants) and lexicon-based methods, using 400 expert-annotated posts from two OHCs. LLMs achieve superior performance while demonstrating expert-level agreement. This high agreement, with no statistically significant difference from inter-expert agreement levels, suggests knowledge integration beyond surface-level pattern recognition. The consistent performance across diverse LLM models, supported by in-context learning, offers a promising solution for digital health analytics. This approach addresses the critical challenge of expert knowledge shortage in digital health research, enabling real-time, expert-quality analysis for patient monitoring, intervention assessment, and evidence-based health strategies.cs.CV [Back]
[35] YOLO11-CR: a Lightweight Convolution-and-Attention Framework for Accurate Fatigue Driving Detection
Zhebin Jin,Ligang Dong
Main category: cs.CV
TL;DR: This paper proposes YOLO11-CR, a lightweight and efficient vision-based model for real-time driver fatigue detection, which outperforms existing methods by incorporating novel modules that enhance feature expressiveness and spatial localization accuracy.
Details
Motivation: Driver fatigue detection is crucial for reducing road traffic accidents. Vision-based methods are non-intrusive and scalable but face challenges like poor detection of small or occluded objects and limited multi-scale feature modeling. Method: YOLO11-CR introduces two modules: Convolution-and-Attention Fusion Module (CAFM) and Rectangular Calibration Module (RCM), which enhance feature expressiveness and spatial localization accuracy. Result: On the DSM dataset, YOLO11-CR achieved precision of 87.17%, recall of 83.86%, mAP@50 of 88.09%, and mAP@50-95 of 55.93%, outperforming baseline models. Ablation studies confirmed the effectiveness of CAFM and RCM. Conclusion: YOLO11-CR provides a practical and high-performing solution for in-vehicle fatigue monitoring and has strong potential for real-world deployment and future enhancements. Abstract: Driver fatigue detection is of paramount importance for intelligent transportation systems due to its critical role in mitigating road traffic accidents. While physiological and vehicle dynamics-based methods offer accuracy, they are often intrusive, hardware-dependent, and lack robustness in real-world environments. Vision-based techniques provide a non-intrusive and scalable alternative, but still face challenges such as poor detection of small or occluded objects and limited multi-scale feature modeling. To address these issues, this paper proposes YOLO11-CR, a lightweight and efficient object detection model tailored for real-time fatigue detection. YOLO11-CR introduces two key modules: the Convolution-and-Attention Fusion Module (CAFM), which integrates local CNN features with global Transformer-based context to enhance feature expressiveness; and the Rectangular Calibration Module (RCM), which captures horizontal and vertical contextual information to improve spatial localization, particularly for profile faces and small objects like mobile phones. Experiments on the DSM dataset demonstrated that YOLO11-CR achieves a precision of 87.17%, recall of 83.86%, mAP@50 of 88.09%, and mAP@50-95 of 55.93%, outperforming baseline models significantly. Ablation studies further validate the effectiveness of the CAFM and RCM modules in improving both sensitivity and localization accuracy. These results demonstrate that YOLO11-CR offers a practical and high-performing solution for in-vehicle fatigue monitoring, with strong potential for real-world deployment and future enhancements involving temporal modeling, multi-modal data integration, and embedded optimization.[36] MIRAGE: Towards AI-Generated Image Detection in the Wild
Cheng Xia,Manxi Lin,Jiexiang Tan,Xiaoxiong Du,Yang Qiu,Junjun Zheng,Xiangheng Kong,Yuning Jiang,Bo Zheng
Main category: cs.CV
TL;DR: 该论文提出了一个名为Mirage的基准测试,用于评估在真实环境中检测AI生成图像的能力,并提出了一种名为Mirage-R1的模型来提高检测性能。
Details
Motivation: 现有的AI生成图像检测器在实验室环境中虽然有效,但在真实世界场景中泛化能力不足。 Method: 构建了一个名为Mirage的挑战性基准,来源包括互联网上的AI生成图像和合成数据集,并提出了一种名为Mirage-R1的视觉-语言模型进行检测。 Result: Mirage-R1模型在Mirage基准和公共基准测试中分别领先现有检测器5%和10%。 Conclusion: 提出的Mirage-R1模型能够有效平衡推理速度和性能,提高了AI生成图像的检测能力。 Abstract: The spreading of AI-generated images (AIGI), driven by advances in generative AI, poses a significant threat to information security and public trust. Existing AIGI detectors, while effective against images in clean laboratory settings, fail to generalize to in-the-wild scenarios. These real-world images are noisy, varying from ``obviously fake" images to realistic ones derived from multiple generative models and further edited for quality control. We address in-the-wild AIGI detection in this paper. We introduce Mirage, a challenging benchmark designed to emulate the complexity of in-the-wild AIGI. Mirage is constructed from two sources: (1) a large corpus of Internet-sourced AIGI verified by human experts, and (2) a synthesized dataset created through the collaboration between multiple expert generators, closely simulating the realistic AIGI in the wild. Building on this benchmark, we propose Mirage-R1, a vision-language model with heuristic-to-analytic reasoning, a reflective reasoning mechanism for AIGI detection. Mirage-R1 is trained in two stages: a supervised-fine-tuning cold start, followed by a reinforcement learning stage. By further adopting an inference-time adaptive thinking strategy, Mirage-R1 is able to provide either a quick judgment or a more robust and accurate conclusion, effectively balancing inference speed and performance. Extensive experiments show that our model leads state-of-the-art detectors by 5% and 10% on Mirage and the public benchmark, respectively. The benchmark and code will be made publicly available.[37] DianJin-OCR-R1: Enhancing OCR Capabilities via a Reasoning-and-Tool Interleaved Vision-Language Model
Qian Chen,Xianyin Zhang,Lifan Guo,Feng Chen,Chi Zhang
Main category: cs.CV
TL;DR: 本文提出了一种推理增强的OCR框架DianJin-OCR-R1,通过结合专家模型的结果减少视觉-语言模型在OCR任务中的幻觉问题,并在多个数据集上验证了其优越性能。
Details
Motivation: 生成式LVLMs在OCR任务中容易产生幻觉(即生成输入图像中不存在的内容),并且相比于针对特定领域训练的专家模型,其在OCR任务上的效果往往较差。因此,需要一种能够减少幻觉并提升OCR性能的方法。 Method: 提出了DianJin-OCR-R1框架,该框架通过训练推理与工具交织的视觉-语言模型(VLMs)来解决生成式LVLMs在OCR任务中容易产生幻觉的问题。具体来说,该模型首先利用自身的OCR能力识别输入图像内容,然后调用其他专家模型获取参考结果,最后重新审视图像并重新思考推理过程以提供最终的识别结果。 Result: DianJin-OCR-R1在多个OCR任务中表现出色,尤其是在ReST和OmniDocBench数据集上,其性能优于非推理版本的模型和专家OCR模型,证明了推理与工具交织方法的有效性。 Conclusion: DianJin-OCR-R1有效减少了LVLMs在OCR任务中的幻觉问题,并且在ReST和OmniDocBench数据集上的实验结果表明,DianJin-OCR-R1模型始终优于非推理版本的模型和专家OCR模型,证明了该方法的有效性。 Abstract: Recent advances in large vision-language models (LVLMs) have enabled a new paradigm of end-to-end document image parsing, excelling in Optical Character Recognition (OCR) tasks such as text, table, and formula recognition. However, generative LVLMs, similarly to large language models (LLMs), are prone to hallucinations--generating words that do not exist in input images. Furthermore, LVLMs are designed for general purposes and tend to be less effective on OCR tasks compared to expert models that are trained on domain-specific datasets. In this paper, we propose DianJin-OCR-R1, a reasoning-enhanced framework designed to address these limitations through training reasoning-and-tool interleaved VLMs. Given a recognition instruction, our DianJin-OCR-R1 model first recognizes the content in the input image by its own OCR capabilities, and then calls other tools (i.e., other expert models) to obtain their results as references, finally looks again the image and rethinks about the reasoning process to provide the final recognized content. Since architectures of expert models are tailored for specific OCR tasks, which makes them less prone to hallucinations, their results can help VLMs mitigate hallucinations. Additionally, expert models are typically smaller in scale and easy to iterate, enabling performance improvements for VLMs at a lower cost. We evaluate our model on ReST and OmniDocBench, and experimental results show that our DianJin-OCR-R1 models consistently outperform their non-reasoning counterparts and expert OCR models, which proves the effectiveness of our method.[38] Exploration of Deep Learning Based Recognition for Urdu Text
Sumaiya Fazal,Sheeraz Ahmed
Main category: cs.CV
TL;DR: 本文介绍了一种使用卷积神经网络(CNN)进行乌尔都语光学字符识别的方法,通过生成特定数据集并应用分层神经网络结构,实现了高效的组件分类准确率。
Details
Motivation: 乌尔都语具有复杂的几何和形态结构,由于其上下文敏感性,传统的基于分割的识别方法错误率较高,因此需要一种高效的特征学习技术来提高识别准确率。 Method: 本文使用了卷积神经网络(CNN)进行自动特征学习,并通过置换过程生成乌尔都语文本数据集,应用连通组件技术去除不必要的图像以获得连字符。此外,实现了一个两级的分层神经网络来处理字符置换和组件分类。 Result: 提出的模型在组件分类上成功达到了0.99%的准确率。 Conclusion: 本文提出了一种基于卷积神经网络的乌尔都语光学字符识别系统,成功实现了0.99%的组件分类准确率。 Abstract: Urdu is a cursive script language and has similarities with Arabic and many other South Asian languages. Urdu is difficult to classify due to its complex geometrical and morphological structure. Character classification can be processed further if segmentation technique is efficient, but due to context sensitivity in Urdu, segmentation-based recognition often results with high error rate. Our proposed approach for Urdu optical character recognition system is a component-based classification relying on automatic feature learning technique called convolutional neural network. CNN is trained and tested on Urdu text dataset, which is generated through permutation process of three characters and further proceeds to discarding unnecessary images by applying connected component technique in order to obtain ligature only. Hierarchical neural network is implemented with two levels to deal with three degrees of character permutations and component classification Our model successfully achieved 0.99% for component classification.[39] CLoE: Curriculum Learning on Endoscopic Images for Robust MES Classification
Zeynep Ozdemir,Hacer Yalim Keles,Omer Ozgur Tanriover
Main category: cs.CV
TL;DR: 本文提出了一种新的课程学习框架 CLoE,用于在标签不确定性下改进序数分类,通过利用图像质量作为注释置信度的代理,并结合 ResizeMix 增强来提高鲁棒性。
Details
Motivation: 由于观察者间变异性导致的标签噪声以及评分的序数性质,标准模型常常忽略这些问题,导致 MES 分类仍然具有挑战性。 Method: 提出了一种课程学习框架 CLoE,利用图像质量作为注释置信度的代理,并结合 ResizeMix 增强来提高鲁棒性。 Result: 实验表明,CLoE 在 LIMUC 和 HyperKvasir 数据集上均优于强监督和自监督基线模型。 Conclusion: CLoE 通过考虑标签可靠性和顺序结构,提高了在标签不确定性下的序数分类性能。 Abstract: Estimating disease severity from endoscopic images is essential in assessing ulcerative colitis, where the Mayo Endoscopic Subscore (MES) is widely used to grade inflammation. However, MES classification remains challenging due to label noise from inter-observer variability and the ordinal nature of the score, which standard models often ignore. We propose CLoE, a curriculum learning framework that accounts for both label reliability and ordinal structure. Image quality, estimated via a lightweight model trained on Boston Bowel Preparation Scale (BBPS) labels, is used as a proxy for annotation confidence to order samples from easy (clean) to hard (noisy). This curriculum is further combined with ResizeMix augmentation to improve robustness. Experiments on the LIMUC and HyperKvasir datasets, using both CNNs and Transformers, show that CLoE consistently improves performance over strong supervised and self-supervised baselines. For instance, ConvNeXt-Tiny reaches 82.5\% accuracy and a QWK of 0.894 on LIMUC with low computational cost. These results highlight the potential of difficulty-aware training strategies for improving ordinal classification under label uncertainty. Code will be released at https://github.com/zeynepozdemir/CLoE.[40] GaitCrafter: Diffusion Model for Biometric Preserving Gait Synthesis
Sirshapan Mitra,Yogesh S. Rawat
Main category: cs.CV
TL;DR: GaitCrafter 是一种基于扩散模型的步态合成方法,能生成高质量、可控且隐私保护的步态数据,提高了步态识别的效果。
Details
Motivation: 由于缺乏大规模标注数据集以及在保护隐私的同时收集多样化步态样本的困难,步态识别技术受到限制,因此需要一种有效的合成方法。 Method: 提出了一种从头训练的视频扩散模型,专注于步态轮廓数据,可生成身份保持且时间一致的步态序列,并支持基于不同协变量的可控生成。 Result: 合成的步态样本提高了步态识别模型的性能,尤其是在具有挑战性的条件下,同时引入了一种通过插值身份嵌入生成新身份的机制,这些新身份具有独特的步态模式并保护隐私。 Conclusion: GaitCrafter 是一种基于扩散模型的可控且注重隐私的步态数据生成方法,为步态识别任务提供了高质量的合成数据。 Abstract: Gait recognition is a valuable biometric task that enables the identification of individuals from a distance based on their walking patterns. However, it remains limited by the lack of large-scale labeled datasets and the difficulty of collecting diverse gait samples for each individual while preserving privacy. To address these challenges, we propose GaitCrafter, a diffusion-based framework for synthesizing realistic gait sequences in the silhouette domain. Unlike prior works that rely on simulated environments or alternative generative models, GaitCrafter trains a video diffusion model from scratch, exclusively on gait silhouette data. Our approach enables the generation of temporally consistent and identity-preserving gait sequences. Moreover, the generation process is controllable-allowing conditioning on various covariates such as clothing, carried objects, and view angle. We show that incorporating synthetic samples generated by GaitCrafter into the gait recognition pipeline leads to improved performance, especially under challenging conditions. Additionally, we introduce a mechanism to generate novel identities-synthetic individuals not present in the original dataset-by interpolating identity embeddings. These novel identities exhibit unique, consistent gait patterns and are useful for training models while maintaining privacy of real subjects. Overall, our work takes an important step toward leveraging diffusion models for high-quality, controllable, and privacy-aware gait data generation.[41] Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving
Minhao Xiong,Zichen Wen,Zhuangcheng Gu,Xuyang Liu,Rui Zhang,Hengrui Kang,Jiabing Yang,Junyuan Zhang,Weijia Li,Conghui He,Yafei Wang,Linfeng Zhang
Main category: cs.CV
TL;DR: Prune2Drive is a framework that efficiently reduces computational overhead in Vision-Language Models used for autonomous driving by pruning visual tokens without significant performance loss.
Details
Motivation: The computational overhead in processing high-resolution, multi-view images hinders the deployment of Vision-Language Models in autonomous driving. Method: Prune2Drive uses a diversity-aware token selection mechanism and a view-adaptive pruning controller to reduce the number of visual tokens processed. Result: Experiments showed significant speedups and memory savings while maintaining or improving task performance on DriveLM and DriveLMM-o1 benchmarks. Conclusion: Prune2Drive offers a practical solution to reduce computational overhead in Vision-Language Models for autonomous driving without significant performance loss. Abstract: Vision-Language Models (VLMs) have emerged as a promising paradigm in autonomous driving (AD), offering a unified framework for perception, reasoning, and decision-making by jointly modeling visual inputs and natural language instructions. However, their deployment is hindered by the significant computational overhead incurred when processing high-resolution, multi-view images, a standard setup in AD systems with six or more synchronized cameras. This overhead stems from the large number of visual tokens generated during encoding, increasing inference latency and memory consumption due to the quadratic complexity of self-attention. To address these challenges, we propose Prune2Drive, a plug-and-play visual token pruning framework for multi-view VLMs in autonomous driving. Prune2Drive introduces two core innovations: (i) a diversity-aware token selection mechanism inspired by farthest point sampling, which prioritizes semantic and spatial coverage across views rather than relying solely on attention scores, and (ii) a view-adaptive pruning controller that learns optimal pruning ratios for each camera view based on their importance to downstream driving tasks. Unlike prior methods, Prune2Drive does not require model retraining or access to attention maps, making it compatible with modern efficient attention implementations. Extensive experiments on two large-scale multi-view driving benchmarks, DriveLM and DriveLMM-o1, show that Prune2Drive achieves significant speedups and memory savings while maintaining or improving task performance. When retaining only 10% of the visual tokens, our method achieves a 6.40$\times$ speedup in the prefilling phase and consumes 13.4% of the original FLOPs, with only a 3% performance drop on the DriveLM benchmark.[42] DAASH: A Meta-Attack Framework for Synthesizing Effective and Stealthy Adversarial Examples
Abdullah Al Nomaan Nafi,Habibur Rahaman,Zafaryab Haider,Tanzim Mahfuz,Fnu Suya,Swarup Bhunia,Prabuddha Chakraborty
Main category: cs.CV
TL;DR: DAASH是一种生成感知对齐对抗示例的元攻击框架,通过多阶段策略和元损失函数实现了更高的攻击成功率和视觉质量。
Details
Motivation: 现有的Lp范数约束对抗示例通常与人类感知不一致,因此需要一种能够生成感知对齐对抗示例的框架。 Method: DAASH采用多阶段方式,通过学习的自适应权重聚合来自多个基础攻击的候选对抗示例,并利用新颖的元损失函数指导生成过程。 Result: DAASH在CIFAR-10、CIFAR-100和ImageNet上的对抗训练模型评估结果显示,其攻击成功率和视觉质量均优于现有感知攻击方法。 Conclusion: DAASH是一个有效的元攻击框架,能够生成感知对齐的对抗性示例,且在未见防御上具有良好的泛化能力。 Abstract: Numerous techniques have been proposed for generating adversarial examples in white-box settings under strict Lp-norm constraints. However, such norm-bounded examples often fail to align well with human perception, and only recently have a few methods begun specifically exploring perceptually aligned adversarial examples. Moreover, it remains unclear whether insights from Lp-constrained attacks can be effectively leveraged to improve perceptual efficacy. In this paper, we introduce DAASH, a fully differentiable meta-attack framework that generates effective and perceptually aligned adversarial examples by strategically composing existing Lp-based attack methods. DAASH operates in a multi-stage fashion: at each stage, it aggregates candidate adversarial examples from multiple base attacks using learned, adaptive weights and propagates the result to the next stage. A novel meta-loss function guides this process by jointly minimizing misclassification loss and perceptual distortion, enabling the framework to dynamically modulate the contribution of each base attack throughout the stages. We evaluate DAASH on adversarially trained models across CIFAR-10, CIFAR-100, and ImageNet. Despite relying solely on Lp-constrained based methods, DAASH significantly outperforms state-of-the-art perceptual attacks such as AdvAD -- achieving higher attack success rates (e.g., 20.63\% improvement) and superior visual quality, as measured by SSIM, LPIPS, and FID (improvements $\approx$ of 11, 0.015, and 5.7, respectively). Furthermore, DAASH generalizes well to unseen defenses, making it a practical and strong baseline for evaluating robustness without requiring handcrafted adaptive attacks for each new defense.[43] Automated Assessment of Aesthetic Outcomes in Facial Plastic Surgery
Pegah Varghaei,Kiran Abraham-Aggarwal,Manoj T. Abraham,Arun Ross
Main category: cs.CV
TL;DR: 该论文介绍了一种基于计算机视觉的量化面部整形手术美学效果的框架,结合了自动化地标检测、几何对称性计算、深度学习年龄估计和鼻部形态分析,并构建了目前最大的术前术后图像数据集,结果表明该方法在鼻部测量和面部对称性改善方面具有显著效果。
Details
Motivation: 面部整形手术的美学结果通常依赖主观评价,该论文旨在通过开发一种可扩展且可解释的计算机视觉框架,提供可量化的评估方法,从而改进手术规划和患者咨询。 Method: 该论文首先构建了目前最大的包含术前和术后面部图像的数据集,利用自动化地标检测、几何面部对称性计算、深度学习年龄估计和鼻部形态分析进行分析,并分析了患者身份一致性以及不同医师之间改善率的变异性。 Result: 论文结果显示,96.2%的鼻整形手术患者在至少一项鼻部测量中有所改善,其中在鼻翼宽度与面部宽度比例(77.0%)、鼻子长度与面部高度比例(41.5%)以及鼻翼宽度与内眦间距比例(39.3%)上具有显著改善;在更广泛的面部对称性和感知年龄分析中,71.3%的患者表现出显著增强;此外,术后患者身份识别的准确率分别达到99.5%和99.6%。 Conclusion: 该论文提出的计算机视觉框架为面部整形手术的美学效果提供了可扩展、可解释的量化方法,通过自动化地标检测、几何面部对称性计算、基于深度学习的年龄估计和鼻部形态分析,为数据驱动的手术规划、患者咨询和客观结果评估提供了新的可能性。 Abstract: We introduce a scalable, interpretable computer-vision framework for quantifying aesthetic outcomes of facial plastic surgery using frontal photographs. Our pipeline leverages automated landmark detection, geometric facial symmetry computation, deep-learning-based age estimation, and nasal morphology analysis. To perform this study, we first assemble the largest curated dataset of paired pre- and post-operative facial images to date, encompassing 7,160 photographs from 1,259 patients. This dataset includes a dedicated rhinoplasty-only subset consisting of 732 images from 366 patients, 96.2% of whom showed improvement in at least one of the three nasal measurements with statistically significant group-level change. Among these patients, the greatest statistically significant improvements (p < 0.001) occurred in the alar width to face width ratio (77.0%), nose length to face height ratio (41.5%), and alar width to intercanthal ratio (39.3%). Among the broader frontal-view cohort, comprising 989 rigorously filtered subjects, 71.3% exhibited significant enhancements in global facial symmetry or perceived age (p < 0.01). Importantly, our analysis shows that patient identity remains consistent post-operatively, with True Match Rates of 99.5% and 99.6% at a False Match Rate of 0.01% for the rhinoplasty-specific and general patient cohorts, respectively. Additionally, we analyze inter-practitioner variability in improvement rates. By providing reproducible, quantitative benchmarks and a novel dataset, our pipeline facilitates data-driven surgical planning, patient counseling, and objective outcome evaluation across practices.[44] Applications of Small Language Models in Medical Imaging Classification with a Focus on Prompt Strategies
Yiting Wang,Ziwei Wang,Jiachen Zhong,Di Zhu,Weiyi Li
Main category: cs.CV
TL;DR: 该研究探讨了小型语言模型(SLMs)在医疗影像分类任务中的表现,通过比较不同模型和提示设计,发现经过良好设计的提示可以显著提高SLMs的性能,使其在医疗应用中具有竞争力。
Details
Motivation: 大型语言模型(LLMs)在自然语言处理和多模态理解方面表现出色,但其高计算成本、有限的可访问性和数据隐私问题限制了其在资源受限的医疗环境中的应用。因此,研究小型语言模型(SLMs)在医疗任务中的潜力具有重要意义。 Method: 使用NIH胸部X光数据集,评估多种小型语言模型(SLMs)在胸部X光片位置分类任务中的表现,包括基线指令、增量摘要提示和纠错式反思提示三种提示策略。 Result: 研究结果表明,某些小型语言模型在经过精心设计的提示下能够实现与大型模型相当的准确性,提示工程可以显著提升SLMs在医疗应用中的性能。 Conclusion: 该研究表明,通过优化提示设计,小型语言模型可以在医疗应用中实现高效且准确的性能,而无需终端用户具备深厚的AI专业知识。 Abstract: Large language models (LLMs) have shown remarkable capabilities in natural language processing and multi-modal understanding. However, their high computational cost, limited accessibility, and data privacy concerns hinder their adoption in resource-constrained healthcare environments. This study investigates the performance of small language models (SLMs) in a medical imaging classification task, comparing different models and prompt designs to identify the optimal combination for accuracy and usability. Using the NIH Chest X-ray dataset, we evaluate multiple SLMs on the task of classifying chest X-ray positions (anteroposterior [AP] vs. posteroanterior [PA]) under three prompt strategies: baseline instruction, incremental summary prompts, and correction-based reflective prompts. Our results show that certain SLMs achieve competitive accuracy with well-crafted prompts, suggesting that prompt engineering can substantially enhance SLM performance in healthcare applications without requiring deep AI expertise from end users.[45] AIM 2025 Rip Current Segmentation (RipSeg) Challenge Report
Andrei Dumitriu,Florin Miron,Florin Tatui,Radu Tudor Ionescu,Radu Timofte,Aakash Ralhan,Florin-Alexandru Vasluianu,Shenyang Qian,Mitchell Harley,Imran Razzak,Yang Song,Pu Luo,Yumei Li,Cong Xu,Jinming Chai,Kexin Zhang,Licheng Jiao,Lingling Li,Siqi Yu,Chao Zhang,Kehuan Song,Fang Liu,Puhua Chen,Xu Liu,Jin Hu,Jinyang Xu,Biao Liu
Main category: cs.CV
TL;DR: 本文介绍了AIM 2025 RipSeg挑战赛,旨在推动自动离岸流分割技术的发展,通过对RipVIS数据集的研究,使用深度学习等方法取得了进展。
Details
Motivation: 离岸流是海滩安全的重大威胁,因此准确的视觉检测显得尤为重要。本研究旨在推动自动离岸流分割技术的发展。 Method: 该论文基于RipVIS数据集,对单类实例分割任务进行了研究,并利用F1、F2、AP50和AP[50:95]综合评分评估参赛团队的表现。 Result: 共有75名参与者注册了本次比赛,产生了5个有效的测试提交。顶级方法利用深度学习架构、领域适应技术、预训练模型和领域泛化策略来提高在不同条件下的性能。 Conclusion: 该论文讨论了AIM 2025 RipSeg挑战赛的结果,强调了自动离岸流分割技术的进步,并指出了未来研究的方向。 Abstract: This report presents an overview of the AIM 2025 RipSeg Challenge, a competition designed to advance techniques for automatic rip current segmentation in still images. Rip currents are dangerous, fast-moving flows that pose a major risk to beach safety worldwide, making accurate visual detection an important and underexplored research task. The challenge builds on RipVIS, the largest available rip current dataset, and focuses on single-class instance segmentation, where precise delineation is critical to fully capture the extent of rip currents. The dataset spans diverse locations, rip current types, and camera orientations, providing a realistic and challenging benchmark. In total, $75$ participants registered for this first edition, resulting in $5$ valid test submissions. Teams were evaluated on a composite score combining $F_1$, $F_2$, $AP_{50}$, and $AP_{[50:95]}$, ensuring robust and application-relevant rankings. The top-performing methods leveraged deep learning architectures, domain adaptation techniques, pretrained models, and domain generalization strategies to improve performance under diverse conditions. This report outlines the dataset details, competition framework, evaluation metrics, and final results, providing insights into the current state of rip current segmentation. We conclude with a discussion of key challenges, lessons learned from the submissions, and future directions for expanding RipSeg.[46] Mitigating Easy Option Bias in Multiple-Choice Question Answering
Hao Zhang,Chen Li,Basura Fernando
Main category: cs.CV
TL;DR: This study identifies an Easy-Options Bias in VQA benchmarks and introduces GroundAttack to generate more realistic evaluations by creating hard negative options, providing a more accurate assessment of VLMs' QA abilities.
Details
Motivation: The motivation was to address the bias in VQA benchmarks that allow VLMs to select correct answers without needing the question, thus providing an unrealistic evaluation of VLMs' QA ability. Method: The researchers identified the EOB issue through grounding experiments and introduced GroundAttack to generate hard negative options. They applied it to the NExT-QA and MMStar datasets for evaluation. Result: Using GroundAttack, the researchers found that current VLMs approach random accuracies in (V+O) settings and drop to non-saturated accuracies in (V+Q+O) settings on EOB-free annotations, indicating a more realistic evaluation of VLMs' QA ability. Conclusion: The study concludes that current VLMs exhibit an Easy-Options Bias (EOB) in multiple-choice VQA benchmarks, which can be mitigated using the GroundAttack toolkit to generate more realistic evaluations. Abstract: In this early study, we observe an Easy-Options Bias (EOB) issue in some multiple-choice Visual Question Answering (VQA) benchmarks such as MMStar, RealWorldQA, SEED-Bench, Next-QA, STAR benchmark and Video-MME. This bias allows vision-language models (VLMs) to select the correct answer using only the vision (V) and options (O) as inputs, without the need for the question (Q). Through grounding experiments, we attribute the bias to an imbalance in visual relevance: the correct answer typically aligns more closely with the visual contents than the negative options in feature space, creating a shortcut for VLMs to infer the answer via simply vision-option similarity matching. To fix this, we introduce GroundAttack, a toolkit that automatically generates hard negative options as visually plausible as the correct answer. We apply it to the NExT-QA and MMStar datasets, creating new EOB-free annotations. On these EOB-free annotations, current VLMs approach to random accuracies under (V+O) settings, and drop to non-saturated accuracies under (V+Q+O) settings, providing a more realistic evaluation of VLMs' QA ability. Codes and new annotations will be released soon.[47] Structured Prompting and Multi-Agent Knowledge Distillation for Traffic Video Interpretation and Risk Inference
Yunxiang Yang,Ningning Xu,Jidong J. Yang
Main category: cs.CV
TL;DR: The paper proposes VISTA, a compact Vision-Language Model for traffic scene understanding and risk inference. It uses knowledge distillation and structured prompting with larger VLMs to generate high-quality training data, enabling the smaller model to achieve strong performance while being efficient enough for real-time edge deployment.
Details
Motivation: Traditional approaches struggle with scalability and generalization in complex real-world traffic environments. The authors aim to develop a more efficient and robust solution for traffic scene understanding and risk inference. Method: The paper introduces a structured prompting and knowledge distillation framework using two large Vision-Language Models (GPT-4o and o3-mini) to generate high-quality annotations and risk assessments. A smaller student model (VISTA) is then fine-tuned on these outputs. Result: The compact VISTA model (3B parameters) achieves strong performance across captioning metrics (BLEU-4, METEOR, ROUGE-L, CIDEr) compared to its larger teacher models, while enabling real-time processing on edge devices. Conclusion: VISTA demonstrates that lightweight VLMs can achieve strong performance in traffic scene understanding and risk inference through effective knowledge distillation and structured multi-agent supervision. Abstract: Comprehensive highway scene understanding and robust traffic risk inference are vital for advancing Intelligent Transportation Systems (ITS) and autonomous driving. Traditional approaches often struggle with scalability and generalization, particularly under the complex and dynamic conditions of real-world environments. To address these challenges, we introduce a novel structured prompting and knowledge distillation framework that enables automatic generation of high-quality traffic scene annotations and contextual risk assessments. Our framework orchestrates two large Vision-Language Models (VLMs): GPT-4o and o3-mini, using a structured Chain-of-Thought (CoT) strategy to produce rich, multi-perspective outputs. These outputs serve as knowledge-enriched pseudo-annotations for supervised fine-tuning of a much smaller student VLM. The resulting compact 3B-scale model, named VISTA (Vision for Intelligent Scene and Traffic Analysis), is capable of understanding low-resolution traffic videos and generating semantically faithful, risk-aware captions. Despite its significantly reduced parameter count, VISTA achieves strong performance across established captioning metrics (BLEU-4, METEOR, ROUGE-L, and CIDEr) when benchmarked against its teacher models. This demonstrates that effective knowledge distillation and structured multi-agent supervision can empower lightweight VLMs to capture complex reasoning capabilities. The compact architecture of VISTA facilitates efficient deployment on edge devices, enabling real-time risk monitoring without requiring extensive infrastructure upgrades.[48] EDTalk++: Full Disentanglement for Controllable Talking Head Synthesis
Shuai Tan,Bin Ji
Main category: cs.CV
TL;DR: EDTalk++ is a novel framework for controllable talking head generation that enables individual manipulation of facial features using four distinct latent spaces and an Audio-to-Motion module for audio-driven synthesis, demonstrating effective results.
Details
Motivation: The motivation is to achieve disentangled control over multiple facial motions and accommodate diverse input modalities, which enhances the application and entertainment of talking head generation by ensuring independent operation of facial features and their compatibility with different modal inputs. Method: The paper employs four lightweight modules to decompose facial dynamics into four distinct latent spaces representing mouth, pose, eye, and expression. They enforce orthogonality among bases and devise an efficient training strategy. Additionally, they propose an Audio-to-Motion module for audio-driven synthesis. Result: The experiments demonstrate the effectiveness of the proposed EDTalk++ framework in enabling individual manipulation of facial features like mouth shape, head pose, eye movement, and emotional expression with video or audio inputs. Conclusion: This paper concludes that EDTalk++ provides a novel full disentanglement framework for controllable talking head generation, allowing individual manipulation of facial features and accommodating diverse input modalities. Abstract: Achieving disentangled control over multiple facial motions and accommodating diverse input modalities greatly enhances the application and entertainment of the talking head generation. This necessitates a deep exploration of the decoupling space for facial features, ensuring that they a) operate independently without mutual interference and b) can be preserved to share with different modal inputs, both aspects often neglected in existing methods. To address this gap, this paper proposes EDTalk++, a novel full disentanglement framework for controllable talking head generation. Our framework enables individual manipulation of mouth shape, head pose, eye movement, and emotional expression, conditioned on video or audio inputs. Specifically, we employ four lightweight modules to decompose the facial dynamics into four distinct latent spaces representing mouth, pose, eye, and expression, respectively. Each space is characterized by a set of learnable bases whose linear combinations define specific motions. To ensure independence and accelerate training, we enforce orthogonality among bases and devise an efficient training strategy to allocate motion responsibilities to each space without relying on external knowledge. The learned bases are then stored in corresponding banks, enabling shared visual priors with audio input. Furthermore, considering the properties of each space, we propose an Audio-to-Motion module for audio-driven talking head synthesis. Experiments are conducted to demonstrate the effectiveness of EDTalk++.[49] Revisiting MLLM Token Technology through the Lens of Classical Visual Coding
Jinming Liu,Junyan Lin,Yuntao Wei,Kele Shao,Keda Tao,Jianguo Huang,Xudong Yang,Zhibo Chen,Huan Wang,Xin Jin
Main category: cs.CV
TL;DR: This paper compares MLLM token technology and visual coding, highlighting mutual insights for improving efficiency and outlining future research directions.
Details
Motivation: The motivation stems from the shared core objective of maximizing information fidelity while minimizing computational cost between classical visual coding and MLLM token technology. Method: The paper uses established principles of visual coding to reexamine MLLM token technology, developing a unified formulation for a comparative analysis and synthesizing bidirectional insights. Result: The paper provides a structured technology comparison, identifies bidirectional insights for enhancing efficiency and robustness, and outlines future research directions and challenges. Conclusion: The study successfully establishes a comprehensive comparison between MLLM token technology and visual coding, offering insights into mutual enhancements and future research directions. Abstract: Classical visual coding and Multimodal Large Language Model (MLLM) token technology share the core objective - maximizing information fidelity while minimizing computational cost. Therefore, this paper reexamines MLLM token technology, including tokenization, token compression, and token reasoning, through the established principles of long-developed visual coding area. From this perspective, we (1) establish a unified formulation bridging token technology and visual coding, enabling a systematic, module-by-module comparative analysis; (2) synthesize bidirectional insights, exploring how visual coding principles can enhance MLLM token techniques' efficiency and robustness, and conversely, how token technology paradigms can inform the design of next-generation semantic visual codecs; (3) prospect for promising future research directions and critical unsolved challenges. In summary, this study presents the first comprehensive and structured technology comparison of MLLM token and visual coding, paving the way for more efficient multimodal models and more powerful visual codecs simultaneously.[50] Vision Transformers for Kidney Stone Image Classification: A Comparative Study with CNNs
Ivan Reyes-Amezcua,Francisco Lopez-Tiro,Clement Larose,Andres Mendez-Vazquez,Gilberto Ochoa-Ruiz,Christian Daul
Main category: cs.CV
TL;DR: 该研究展示了Vision Transformers (ViTs)在肾结石图像分类任务中的优越性能,特别是在复杂视觉条件下。
Details
Motivation: 卷积神经网络(CNNs)在捕捉长距离依赖关系方面的能力有限,这可能在可变成像条件下影响性能。 Method: 比较分析了Vision Transformers (ViTs)和CNN-based模型在两个离体数据集上的性能,其中包括CCD相机和柔性输尿管镜图像。 Result: 在最具视觉复杂性的子集(内窥镜图像的切片)中,ViT模型达到了95.2%的准确率和95.1%的F1分数,而ResNet50基线模型只有64.5%和59.3%。 Conclusion: ViT-based架构在肾结石图像分类中优于传统CNN模型,提供可扩展的替代方案。 Abstract: Kidney stone classification from endoscopic images is critical for personalized treatment and recurrence prevention. While convolutional neural networks (CNNs) have shown promise in this task, their limited ability to capture long-range dependencies can hinder performance under variable imaging conditions. This study presents a comparative analysis between Vision Transformers (ViTs) and CNN-based models, evaluating their performance on two ex vivo datasets comprising CCD camera and flexible ureteroscope images. The ViT-base model pretrained on ImageNet-21k consistently outperformed a ResNet50 baseline across multiple imaging conditions. For instance, in the most visually complex subset (Section patches from endoscopic images), the ViT model achieved 95.2% accuracy and 95.1% F1-score, compared to 64.5% and 59.3% with ResNet50. In the mixed-view subset from CCD-camera images, ViT reached 87.1% accuracy versus 78.4% with CNN. These improvements extend across precision and recall as well. The results demonstrate that ViT-based architectures provide superior classification performance and offer a scalable alternative to conventional CNNs for kidney stone image analysis.[51] STER-VLM: Spatio-Temporal With Enhanced Reference Vision-Language Models
Tinh-Anh Nguyen-Nhu,Triet Dao Hoang Minh,Dat To-Thanh,Phuc Le-Gia,Tuan Vo-Lan,Tien-Huy Nguyen
Main category: cs.CV
TL;DR: This paper proposes STER-VLM, a computationally efficient vision-language framework for traffic analysis, achieving strong results on multiple datasets and the AI City Challenge 2025 Track 2.
Details
Motivation: Current vision-language models (VLMs) for traffic analysis require significant computational resources and struggle with fine-grained spatio-temporal understanding. This work aims to develop a more efficient and effective framework for real-world applications. Method: STER-VLM uses caption decomposition, temporal frame selection with best-view filtering, reference-driven understanding, and curated visual/textual prompt techniques to enhance VLM performance. Result: Experiments on WTS and BDD datasets show significant improvements in semantic richness and traffic scene interpretation, with a test score of 55.655 in the AI City Challenge 2025 Track 2. Conclusion: STER-VLM provides an effective, resource-efficient solution for traffic analysis, as evidenced by its performance in the AI City Challenge 2025 Track 2 and results on datasets like WTS and BDD. Abstract: Vision-language models (VLMs) have emerged as powerful tools for enabling automated traffic analysis; however, current approaches often demand substantial computational resources and struggle with fine-grained spatio-temporal understanding. This paper introduces STER-VLM, a computationally efficient framework that enhances VLM performance through (1) caption decomposition to tackle spatial and temporal information separately, (2) temporal frame selection with best-view filtering for sufficient temporal information, and (3) reference-driven understanding for capturing fine-grained motion and dynamic context and (4) curated visual/textual prompt techniques. Experimental results on the WTS \cite{kong2024wts} and BDD \cite{BDD} datasets demonstrate substantial gains in semantic richness and traffic scene interpretation. Our framework is validated through a decent test score of 55.655 in the AI City Challenge 2025 Track 2, showing its effectiveness in advancing resource-efficient and accurate traffic analysis for real-world applications.[52] MINR: Efficient Implicit Neural Representations for Multi-Image Encoding
Wenyong Zhou,Taiqiang Wu,Zhengwu Liu,Yuxin Cheng,Chen Zhang,Ngai Wong
Main category: cs.CV
TL;DR: The paper introduces MINR, a method to efficiently encode multiple images by sharing neural network layers, significantly reducing parameters while maintaining performance.
Details
Motivation: INRs are inefficient for multi-image encoding due to separate MLPs for each image. The authors aim to improve efficiency by exploiting similarities in layer weight distributions. Method: The authors propose MINR, which shares intermediate layers across multiple images while keeping input and output layers image-specific, along with a novel projection layer for capturing unique features. Result: MINR achieves up to 60% parameter reduction with comparable performance on image reconstruction and super-resolution, and scales well to 100 images with an average PSNR of 34 dB. Conclusion: MINR effectively reduces parameters by up to 60% while maintaining performance, showing robustness across various backbones. Abstract: Implicit Neural Representations (INRs) aim to parameterize discrete signals through implicit continuous functions. However, formulating each image with a separate neural network~(typically, a Multi-Layer Perceptron (MLP)) leads to computational and storage inefficiencies when encoding multi-images. To address this issue, we propose MINR, sharing specific layers to encode multi-image efficiently. We first compare the layer-wise weight distributions for several trained INRs and find that corresponding intermediate layers follow highly similar distribution patterns. Motivated by this, we share these intermediate layers across multiple images while preserving the input and output layers as input-specific. In addition, we design an extra novel projection layer for each image to capture its unique features. Experimental results on image reconstruction and super-resolution tasks demonstrate that MINR can save up to 60\% parameters while maintaining comparable performance. Particularly, MINR scales effectively to handle 100 images, maintaining an average peak signal-to-noise ratio (PSNR) of 34 dB. Further analysis of various backbones proves the robustness of the proposed MINR.[53] Distribution-Aware Hadamard Quantization for Hardware-Efficient Implicit Neural Representations
Wenyong Zhou,Jiachen Ren,Taiqiang Wu,Yuxin Cheng,Zhengwu Liu,Ngai Wong
Main category: cs.CV
TL;DR: This paper proposes DHQ, a novel quantization scheme for INRs that targets both weights and activations, offering significant hardware efficiency improvements over previous methods.
Details
Motivation: INRs depend on full-precision number representation, resulting in significant hardware overhead. Previous INR quantization approaches have offered limited hardware savings due to the lack of activation quantization. Method: Proposed DHQ, a distribution-aware Hadamard quantization scheme, to standardize diverse distributions of weights and activations in INRs before applying a standard quantizer. Result: Experiments showed that DHQ outperforms previous quantization methods, reducing latency by 32.7%, energy consumption by 40.1%, and resource utilization by up to 98.3% compared to full-precision counterparts. Conclusion: DHQ, a distribution-aware Hadamard quantization scheme for both weights and activations in INRs, offers significant hardware efficiency improvements, including reduced latency, energy consumption, and resource utilization. Abstract: Implicit Neural Representations (INRs) encode discrete signals using Multi-Layer Perceptrons (MLPs) with complex activation functions. While INRs achieve superior performance, they depend on full-precision number representation for accurate computation, resulting in significant hardware overhead. Previous INR quantization approaches have primarily focused on weight quantization, offering only limited hardware savings due to the lack of activation quantization. To fully exploit the hardware benefits of quantization, we propose DHQ, a novel distribution-aware Hadamard quantization scheme that targets both weights and activations in INRs. Our analysis shows that the weights in the first and last layers have distributions distinct from those in the intermediate layers, while the activations in the last layer differ significantly from those in the preceding layers. Instead of customizing quantizers individually, we utilize the Hadamard transformation to standardize these diverse distributions into a unified bell-shaped form, supported by both empirical evidence and theoretical analysis, before applying a standard quantizer. To demonstrate the practical advantages of our approach, we present an FPGA implementation of DHQ that highlights its hardware efficiency. Experiments on diverse image reconstruction tasks show that DHQ outperforms previous quantization methods, reducing latency by 32.7\%, energy consumption by 40.1\%, and resource utilization by up to 98.3\% compared to full-precision counterparts.[54] AIM 2025 challenge on Inverse Tone Mapping Report: Methods and Results
Chao Wang,Francesco Banterle,Bin Ren,Radu Timofte,Xin Lu,Yufeng Peng,Chengjie Ge,Zhijing Sun,Ziang Zhou,Zihao Li,Zishun Liao,Qiyu Kang,Xueyang Fu,Zheng-Jun Zha,Zhijing Sun,Xingbo Wang,Kean Liu,Senyan Xu,Yang Qiu,Yifan Ding,Gabriel Eilertsen,Jonas Unger,Zihao Wang,Ke Wu,Jinshan Pan,Zhen Liu,Zhongyang Li,Shuaicheng Liu,S. M Nadim Uddin
Main category: cs.CV
TL;DR: The AIM 2025 Challenge on Inverse Tone Mapping analyzed top-performing algorithms for HDR image reconstruction, achieving a lowest PU21-PSNR of 29.22 dB, and established benchmarks for future research.
Details
Motivation: The challenge aimed to advance the development of effective ITM algorithms for HDR image reconstruction from single LDR inputs, focusing on perceptual fidelity and numerical consistency. Method: A comprehensive review and analysis of the methodologies and performance of the top five teams from the AIM 2025 Challenge on Inverse Tone Mapping was conducted. Result: A total of 67 participants submitted 319 valid results, with the lowest PU21-PSNR among the top entries reaching 29.22 dB. Conclusion: The analysis of the AIM 2025 Challenge on Inverse Tone Mapping highlights innovative strategies for enhancing HDR reconstruction quality and establishes strong benchmarks for future research. Abstract: This paper presents a comprehensive review of the AIM 2025 Challenge on Inverse Tone Mapping (ITM). The challenge aimed to push forward the development of effective ITM algorithms for HDR image reconstruction from single LDR inputs, focusing on perceptual fidelity and numerical consistency. A total of \textbf{67} participants submitted \textbf{319} valid results, from which the best five teams were selected for detailed analysis. This report consolidates their methodologies and performance, with the lowest PU21-PSNR among the top entries reaching 29.22 dB. The analysis highlights innovative strategies for enhancing HDR reconstruction quality and establishes strong benchmarks to guide future research in inverse tone mapping.[55] Enhancing Robustness of Implicit Neural Representations Against Weight Perturbations
Wenyong Zhou,Yuxin Cheng,Zhengwu Liu,Taiqiang Wu,Chen Zhang,Ngai Wong
Main category: cs.CV
TL;DR: This paper addresses the vulnerability of Implicit Neural Representations (INRs) to weight perturbations, proposing a novel robust loss function that significantly enhances robustness and achieves up to a 7.5 dB improvement in PSNR under noisy conditions.
Details
Motivation: The motivation is to address the critical challenge of the vulnerability of INRs to weight perturbations, which leads to significant performance degradation in signal reconstruction and hinders their real-world deployment. Method: The authors formulate the robustness problem in INRs by minimizing the difference between loss with and without weight perturbations and derive a novel robust loss function to regulate the gradient of the reconstruction loss with respect to weights. Result: Extensive experiments show that the proposed method achieves up to a 7.5 dB improvement in peak signal-to-noise ratio (PSNR) values compared to original INRs under noisy conditions. Conclusion: The work introduces a novel robust loss function that enhances the robustness of Implicit Neural Representations (INRs) by regulating the gradient of the reconstruction loss with respect to weights, significantly improving performance under noisy conditions. Abstract: Implicit Neural Representations (INRs) encode discrete signals in a continuous manner using neural networks, demonstrating significant value across various multimedia applications. However, the vulnerability of INRs presents a critical challenge for their real-world deployments, as the network weights might be subjected to unavoidable perturbations. In this work, we investigate the robustness of INRs for the first time and find that even minor perturbations can lead to substantial performance degradation in the quality of signal reconstruction. To mitigate this issue, we formulate the robustness problem in INRs by minimizing the difference between loss with and without weight perturbations. Furthermore, we derive a novel robust loss function to regulate the gradient of the reconstruction loss with respect to weights, thereby enhancing the robustness. Extensive experiments on reconstruction tasks across multiple modalities demonstrate that our method achieves up to a 7.5~dB improvement in peak signal-to-noise ratio (PSNR) values compared to original INRs under noisy conditions.[56] FAMNet: Integrating 2D and 3D Features for Micro-expression Recognition via Multi-task Learning and Hierarchical Attention
Liangyu Fu,Xuecheng Wu,Danlei Huang,Xinyi Yin
Main category: cs.CV
TL;DR: 本文提出了一种新的微表情识别方法FAMNet,结合2D和3D卷积神经网络,通过多任务学习和注意力机制有效提取微表情的时空特征,在多个数据集上显著提升了识别性能。
Details
Motivation: 微表情识别在多个领域有重要应用,但由于其持续时间短、强度低,识别难度较大。现有的深度学习方法主要采用静态图像、动态图像序列或两者结合的方式,但如何有效提取微表情的细粒度时空特征仍是一个挑战。 Method: 提出了一种新的基于多任务学习和层次化注意力机制的微表情识别方法FAMNet,融合了2D和3D卷积神经网络(AMNet2D和AMNet3D),通过参数硬共享实现微表情识别(MER)任务和面部动作单元检测(FAUD)任务的联合训练。 Result: 在多个数据集(SAMM、CASME II、MMEW和CAS(ME)$^3$)上进行了大量实验,FAMNet在SAMM、CASME II和MMEW数据集上分别取得了83.75%(UAR)和84.03%(UF1),在CAS(ME)$^3$数据集上取得了51%(UAR)和43.42%(UF1)的优异成绩。 Conclusion: FAMNet通过多任务学习和分层注意力机制显著提升了微表情识别的性能,特别是在多个数据集上取得了较高的UAR和UF1分数。 Abstract: Micro-expressions recognition (MER) has essential application value in many fields, but the short duration and low intensity of micro-expressions (MEs) bring considerable challenges to MER. The current MER methods in deep learning mainly include three data loading methods: static images, dynamic image sequence, and a combination of the two streams. How to effectively extract MEs' fine-grained and spatiotemporal features has been difficult to solve. This paper proposes a new MER method based on multi-task learning and hierarchical attention, which fully extracts MEs' omni-directional features by merging 2D and 3D CNNs. The fusion model consists of a 2D CNN AMNet2D and a 3D CNN AMNet3D, with similar structures consisting of a shared backbone network Resnet18 and attention modules. During training, the model adopts different data loading methods to adapt to two specific networks respectively, jointly trains on the tasks of MER and facial action unit detection (FAUD), and adopts the parameter hard sharing for information association, which further improves the effect of the MER task, and the final fused model is called FAMNet. Extensive experimental results show that our proposed FAMNet significantly improves task performance. On the SAMM, CASME II and MMEW datasets, FAMNet achieves 83.75% (UAR) and 84.03% (UF1). Furthermore, on the challenging CAS(ME)$^3$ dataset, FAMNet achieves 51% (UAR) and 43.42% (UF1).[57] RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation
Tianyi Niu,Jaemin Cho,Elias Stengel-Eskin,Mohit Bansal
Main category: cs.CV
TL;DR: This study evaluates MLLMs' ability to identify rotated images using a new benchmark, RotBench. Despite their advanced capabilities, MLLMs struggle with spatial reasoning, especially distinguishing between 90° and 270° rotations, revealing a gap compared to human perception.
Details
Motivation: The task of identifying image orientation requires robust visual reasoning, which is crucial for MLLMs. Understanding the limitations of current models in this relatively simple task can highlight gaps in spatial reasoning capabilities. Method: The study introduces RotBench, a 350-image benchmark with lifestyle, portrait, and landscape images, to evaluate MLLMs' ability to identify rotations (0°, 90°, 180°, 270°). The study tests state-of-the-art models like GPT-5, o3, and Gemini-2.5-Pro, with additional experiments involving auxiliary information, chain-of-thought prompting, simultaneous orientation display, voting mechanisms, and fine-tuning. Result: State-of-the-art MLLMs do not reliably identify image rotations, especially between 90° and 270°. Models perform better with right-side-up (0°) images, and some identify upside-down (180°) images reliably. Auxiliary information and prompting techniques offer small and inconsistent improvements. Fine-tuning improves 180° identification but not 90° or 270°. Performance improves slightly when models see simultaneous orientations or use a voting setup. Conclusion: MLLMs have a significant gap in spatial reasoning capabilities compared to human perception in identifying image rotations, with most models struggling to reliably distinguish between 90° and 270° rotations. Abstract: We investigate to what extent Multimodal Large Language Models (MLLMs) can accurately identify the orientation of input images rotated 0{\deg}, 90{\deg}, 180{\deg}, and 270{\deg}. This task demands robust visual reasoning capabilities to detect rotational cues and contextualize spatial relationships within images, regardless of their orientation. To evaluate MLLMs on these abilities, we introduce RotBench -- a 350-image manually-filtered benchmark comprising lifestyle, portrait, and landscape images. Despite the relatively simple nature of this task, we show that several state-of-the-art open and proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably identify rotation in input images. Providing models with auxiliary information -- including captions, depth maps, and more -- or using chain-of-thought prompting offers only small and inconsistent improvements. Our results indicate that most models are able to reliably identify right-side-up (0{\deg}) images, while certain models are able to identify upside-down (180{\deg}) images. None can reliably distinguish between 90{\deg} and 270{\deg}. Simultaneously showing the image rotated in different orientations leads to moderate performance gains for reasoning models, while a modified setup using voting improves the performance of weaker models. We further show that fine-tuning does not improve models' ability to distinguish 90{\deg} and 270{\deg} rotations, despite substantially improving the identification of 180{\deg} images. Together, these results reveal a significant gap between MLLMs' spatial reasoning capabilities and human perception in identifying rotation.[58] CORENet: Cross-Modal 4D Radar Denoising Network with LiDAR Supervision for Autonomous Driving
Fuyang Liu,Jilin Mei,Fangyuan Mao,Chen Min,Yan Xing,Yu Hu
Main category: cs.CV
TL;DR: 提出了一个名为CORENet的跨模态去噪框架,它利用LiDAR监督提升4D雷达数据感知的有效性和鲁棒性。
Details
Motivation: 4D雷达点云的稀疏性和噪声性对有效感知提出了重大挑战,为了解决这个问题,提出了CORENet。 Method: CORENet被设计为即插即用的架构,在不改变现有管线的情况下,可以无缝集成到基于体素的检测框架中。 Result: 广泛的评估验证了CORENet在提高检测鲁棒性方面的有效性,并证明其性能优于现有的主流方法。 Conclusion: CORENet是一个新颖的跨模态去噪框架,它利用LiDAR监督来识别噪声模式并从原始4D雷达数据中提取判别特征。 Abstract: 4D radar-based object detection has garnered great attention for its robustness in adverse weather conditions and capacity to deliver rich spatial information across diverse driving scenarios. Nevertheless, the sparse and noisy nature of 4D radar point clouds poses substantial challenges for effective perception. To address the limitation, we present CORENet, a novel cross-modal denoising framework that leverages LiDAR supervision to identify noise patterns and extract discriminative features from raw 4D radar data. Designed as a plug-and-play architecture, our solution enables seamless integration into voxel-based detection frameworks without modifying existing pipelines. Notably, the proposed method only utilizes LiDAR data for cross-modal supervision during training while maintaining full radar-only operation during inference. Extensive evaluation on the challenging Dual-Radar dataset, which is characterized by elevated noise level, demonstrates the effectiveness of our framework in enhancing detection robustness. Comprehensive experiments validate that CORENet achieves superior performance compared to existing mainstream approaches.[59] Multi-view Clustering via Bi-level Decoupling and Consistency Learning
Shihao Dong,Yuhui Zheng,Huiying Xu,Xinzhong Zhu
Main category: cs.CV
TL;DR: This paper proposes a Bi-level Decoupling and Consistency Learning (BDCL) framework for multi-view clustering, effectively enhancing clustering performance by leveraging consistency and complementarity among views through a novel cluster-oriented representation learning approach.
Details
Motivation: Multi-view clustering can benefit from exploring both consistency and complementarity among views, but existing methods often neglect cluster-oriented representation learning, which is crucial for enhancing clustering performance. Method: The BDCL framework incorporates three modules: multi-view instance learning via reconstruction autoencoder and contrastive learning, bi-level decoupling of features and clusters, and consistency learning through clustering assignment consistency among positive pairs. Result: Experimental results on five benchmark datasets demonstrate that the BDCL framework outperforms state-of-the-art methods in multi-view clustering. Conclusion: The proposed BDCL framework effectively enhances multi-view clustering by improving inter-cluster discriminability and intra-cluster compactness, showing superior performance over state-of-the-art methods. Abstract: Multi-view clustering has shown to be an effective method for analyzing underlying patterns in multi-view data. The performance of clustering can be improved by learning the consistency and complementarity between multi-view features, however, cluster-oriented representation learning is often overlooked. In this paper, we propose a novel Bi-level Decoupling and Consistency Learning framework (BDCL) to further explore the effective representation for multi-view data to enhance inter-cluster discriminability and intra-cluster compactness of features in multi-view clustering. Our framework comprises three modules: 1) The multi-view instance learning module aligns the consistent information while preserving the private features between views through reconstruction autoencoder and contrastive learning. 2) The bi-level decoupling of features and clusters enhances the discriminability of feature space and cluster space. 3) The consistency learning module treats the different views of the sample and their neighbors as positive pairs, learns the consistency of their clustering assignments, and further compresses the intra-cluster space. Experimental results on five benchmark datasets demonstrate the superiority of the proposed method compared with the SOTA methods. Our code is published on https://github.com/LouisDong95/BDCL.[60] AdaptiveAE: An Adaptive Exposure Strategy for HDR Capturing in Dynamic Scenes
Tianyi Xu,Fan Zhang,Boxin Shi,Tianfan Xue,Yujin Wang
Main category: cs.CV
TL;DR: 本文提出了一种新的HDR成像方法AdaptiveAE,利用强化学习根据用户定义的曝光时间预算自适应选择最佳的快门速度和ISO组合,以提高动态场景中的HDR图像质量。
Details
Motivation: 现有的HDR方法通常忽略了快门速度和ISO之间的复杂相互作用,并未考虑动态场景中的运动模糊效应,因此难以实现高质量的HDR图像。 Method: AdaptiveAE采用基于强化学习的方法,并结合了包含运动模糊和噪声模拟的图像合成流程进行训练。 Result: 实验结果显示,AdaptiveAE在多个数据集中表现优异,达到了最先进的性能。 Conclusion: AdaptiveAE通过优化快门速度和ISO组合,实现了在动态环境下的HDR图像质量的最大化,优于传统方法。 Abstract: Mainstream high dynamic range imaging techniques typically rely on fusing multiple images captured with different exposure setups (shutter speed and ISO). A good balance between shutter speed and ISO is crucial for achieving high-quality HDR, as high ISO values introduce significant noise, while long shutter speeds can lead to noticeable motion blur. However, existing methods often overlook the complex interaction between shutter speed and ISO and fail to account for motion blur effects in dynamic scenes. In this work, we propose AdaptiveAE, a reinforcement learning-based method that optimizes the selection of shutter speed and ISO combinations to maximize HDR reconstruction quality in dynamic environments. AdaptiveAE integrates an image synthesis pipeline that incorporates motion blur and noise simulation into our training procedure, leveraging semantic information and exposure histograms. It can adaptively select optimal ISO and shutter speed sequences based on a user-defined exposure time budget, and find a better exposure schedule than traditional solutions. Experimental results across multiple datasets demonstrate that it achieves the state-of-the-art performance.[61] Bridging the Gap: Doubles Badminton Analysis with Singles-Trained Models
Seungheon Baek,Jinhyuk Yun
Main category: cs.CV
TL;DR: 本文提出了一种将单打训练模型应用于双打分析的方法,解决了数据可用性和多人追踪的挑战,为羽毛球双打分析奠定了基础。
Details
Motivation: 尽管双打比赛在国际比赛中比单打更常见,但先前的研究主要集中在单打上,因为存在数据可用性和多人追踪的挑战。 Method: 使用ViT-Pose从ShuttleSet单打比赛数据集中提取关键点,并通过基于ST-GCN的对比学习框架进行嵌入。为了解决快速和重叠的球员运动造成的ID切换问题,引入了定制的多目标跟踪算法。然后,基于Transformer的分类器根据学习的嵌入确定击球事件。 Result: 证明了基于姿态的击球识别扩展到双打羽毛球的可行性,扩展了分析能力。 Conclusion: 这项工作为双打特定数据集奠定了基础,以增强对这种主要但研究不足的快速球拍运动形式的理解。 Abstract: Badminton is known as one of the fastest racket sports in the world. Despite doubles matches being more prevalent in international tournaments than singles, previous research has mainly focused on singles due to the challenges in data availability and multi-person tracking. To address this gap, we designed an approach that transfers singles-trained models to doubles analysis. We extracted keypoints from the ShuttleSet single matches dataset using ViT-Pose and embedded them through a contrastive learning framework based on ST-GCN. To improve tracking stability, we incorporated a custom multi-object tracking algorithm that resolves ID switching issues from fast and overlapping player movements. A Transformer-based classifier then determines shot occurrences based on the learned embeddings. Our findings demonstrate the feasibility of extending pose-based shot recognition to doubles badminton, broadening analytics capabilities. This work establishes a foundation for doubles-specific datasets to enhance understanding of this predominant yet understudied format of the fast racket sport.[62] 2D Gaussians Meet Visual Tokenizer
Yiang Shi,Xiaoyang Guo,Wei Yin,Mingkai Jia,Qian Zhang,Xiaolin Hu,Wenyu Liu,Xinggang Wan
Main category: cs.CV
TL;DR: VGQ is a new image tokenizer that improves image reconstruction by incorporating geometric structure modeling using 2D Gaussians, achieving superior performance on benchmark tests.
Details
Motivation: Existing quantization-based tokenizers like VQ-GAN focus on appearance features such as texture and color while neglecting geometric structures due to their patch-based design. This work aims to incorporate more visual information, particularly structural details, into the tokenizer. Method: VGQ integrates 2D Gaussians into traditional visual codebook quantization frameworks to enhance structural modeling, encoding image latents as 2D Gaussian distributions that capture position, rotation, and scale. Result: VGQ achieves a state-of-the-art reconstruction rFID score of 0.556 and a PSNR of 24.93 on the ImageNet 256x256 benchmark, significantly outperforming existing methods. Conclusion: The proposed Visual Gaussian Quantization (VGQ) approach outperforms existing methods in image reconstruction by explicitly modeling geometric and spatial structures through 2D Gaussian distributions. Abstract: The image tokenizer is a critical component in AR image generation, as it determines how rich and structured visual content is encoded into compact representations. Existing quantization-based tokenizers such as VQ-GAN primarily focus on appearance features like texture and color, often neglecting geometric structures due to their patch-based design. In this work, we explored how to incorporate more visual information into the tokenizer and proposed a new framework named Visual Gaussian Quantization (VGQ), a novel tokenizer paradigm that explicitly enhances structural modeling by integrating 2D Gaussians into traditional visual codebook quantization frameworks. Our approach addresses the inherent limitations of naive quantization methods such as VQ-GAN, which struggle to model structured visual information due to their patch-based design and emphasis on texture and color. In contrast, VGQ encodes image latents as 2D Gaussian distributions, effectively capturing geometric and spatial structures by directly modeling structure-related parameters such as position, rotation and scale. We further demonstrate that increasing the density of 2D Gaussians within the tokens leads to significant gains in reconstruction fidelity, providing a flexible trade-off between token efficiency and visual richness. On the ImageNet 256x256 benchmark, VGQ achieves strong reconstruction quality with an rFID score of 1.00. Furthermore, by increasing the density of 2D Gaussians within the tokens, VGQ gains a significant boost in reconstruction capability and achieves a state-of-the-art reconstruction rFID score of 0.556 and a PSNR of 24.93, substantially outperforming existing methods. Codes will be released soon.[63] Calibrating Biased Distribution in VFM-derived Latent Space via Cross-Domain Geometric Consistency
Yanbiao Ma,Wei Dai,Bowei Liu,Jiayi Chen,Wenke Huang,Guancheng Wan,Zhiwu Lu,Junchi Yan
Main category: cs.CV
TL;DR: 该论文提出了一种基于几何知识引导的分布校准框架,通过利用基础模型提取特征,解决数据异构性和样本不平衡问题,在联邦学习和长尾识别任务中表现出色。
Details
Motivation: 深度学习面临的一个挑战是观察到的训练样本与潜在真实分布之间的差距,这种差距由采样偏差、噪声等因素引起。 Method: 利用基础模型(如CLIP、DINOv2)进行特征提取,并提出了一种几何知识引导的分布校准框架,应用于联邦学习和长尾识别两个场景。 Result: 在联邦学习中,该方法在隐私约束下获取全局几何形状并生成新样本来弥合本地与全局观察之间的差距;在长尾学习中,通过几何知识迁移恢复样本稀缺类别的真实分布,综合实验表明该框架有效提升了性能。 Conclusion: 几何知识引导的分布校准框架能够有效克服数据异构性和样本不平衡导致的信息缺陷,在多个基准测试中提升了性能。 Abstract: Despite the fast progress of deep learning, one standing challenge is the gap of the observed training samples and the underlying true distribution. There are multiple reasons for the causing of this gap e.g. sampling bias, noise etc. In the era of foundation models, we show that when leveraging the off-the-shelf (vision) foundation models (e.g., CLIP, DINOv2) for feature extraction, the geometric shapes of the resulting feature distributions exhibit remarkable transferability across domains and datasets. To verify its practical usefulness, we embody our geometric knowledge-guided distribution calibration framework in two popular and challenging settings: federated learning and long-tailed recognition. In the federated setting, we devise a technique of acquiring the global geometric shape under privacy constraints, then leverage this knowledge to generate new samples for clients, in the aim of bridging the gap between local and global observations. In long-tailed learning, it utilizes the geometric knowledge transferred from sample-rich categories to recover the true distribution for sample-scarce tail classes. Comprehensive experiments show that our proposed geometric knowledge-guided distribution calibration effectively overcomes information deficits caused by data heterogeneity and sample imbalance, with boosted performance across benchmarks.[64] Evaluating Open-Source Vision Language Models for Facial Emotion Recognition against Traditional Deep Learning Models
Vamsi Krishna Mulukutla,Sai Supriya Pavarala,Srinivasa Raju Rudraraju,Sridevi Bonthu
Main category: cs.CV
TL;DR: 研究比较了传统模型与视觉语言模型在面部情感识别中的表现,发现传统模型性能更优,突出了改进视觉语言模型以适应噪声环境的必要性。
Details
Motivation: 面部情感识别对于人机交互和心理健康诊断至关重要,但当前的视觉语言模型在低质量图像上的表现尚不明确。 Method: 对传统深度学习模型和开源视觉语言模型进行实证比较,并提出了一种结合GFPGAN图像修复的新型流水线。 Result: EfficientNet-B0 (86.44%)和ResNet-50 (85.72%)显著优于CLIP (64.07%)和Phi-3.5 Vision (51.66%),表明视觉语言模型在低质量视觉任务中的局限性。 Conclusion: 传统模型在低质量图像的情感识别任务上显著优于现有的开源视觉语言模型(VLMs),并强调了在噪声环境中改进VLMs的必要性。 Abstract: Facial Emotion Recognition (FER) is crucial for applications such as human-computer interaction and mental health diagnostics. This study presents the first empirical comparison of open-source Vision-Language Models (VLMs), including Phi-3.5 Vision and CLIP, against traditional deep learning models VGG19, ResNet-50, and EfficientNet-B0 on the challenging FER-2013 dataset, which contains 35,887 low-resolution grayscale images across seven emotion classes. To address the mismatch between VLM training assumptions and the noisy nature of FER data, we introduce a novel pipeline that integrates GFPGAN-based image restoration with FER evaluation. Results show that traditional models, particularly EfficientNet-B0 (86.44%) and ResNet-50 (85.72%), significantly outperform VLMs like CLIP (64.07%) and Phi-3.5 Vision (51.66%), highlighting the limitations of VLMs in low-quality visual tasks. In addition to performance evaluation using precision, recall, F1-score, and accuracy, we provide a detailed computational cost analysis covering preprocessing, training, inference, and evaluation phases, offering practical insights for deployment. This work underscores the need for adapting VLMs to noisy environments and provides a reproducible benchmark for future research in emotion recognition.[65] EAvatar: Expression-Aware Head Avatar Reconstruction with Generative Geometry Priors
Shikun Zhang,Cunjian Chen,Yiqun Wang,Qiuhong Ke,Yong Li
Main category: cs.CV
TL;DR: EAvatar improves head avatar reconstruction by combining expression-aware deformation and high-quality 3D priors for better accuracy and detail.
Details
Motivation: Existing 3DGS-based methods struggle with fine facial expressions and texture continuity in deformable regions. Method: EAvatar uses sparse expression control and leverages high-quality 3D priors from generative models for deformation and texture improvements. Result: The proposed method achieves more accurate and visually coherent reconstructions compared to existing approaches. Conclusion: EAvatar provides improved accuracy and visual coherence in head avatar reconstruction with better expression controllability. Abstract: High-fidelity head avatar reconstruction plays a crucial role in AR/VR, gaming, and multimedia content creation. Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated effectiveness in modeling complex geometry with real-time rendering capability and are now widely used in high-fidelity head avatar reconstruction tasks. However, existing 3DGS-based methods still face significant challenges in capturing fine-grained facial expressions and preserving local texture continuity, especially in highly deformable regions. To mitigate these limitations, we propose a novel 3DGS-based framework termed EAvatar for head reconstruction that is both expression-aware and deformation-aware. Our method introduces a sparse expression control mechanism, where a small number of key Gaussians are used to influence the deformation of their neighboring Gaussians, enabling accurate modeling of local deformations and fine-scale texture transitions. Furthermore, we leverage high-quality 3D priors from pretrained generative models to provide a more reliable facial geometry, offering structural guidance that improves convergence stability and shape accuracy during training. Experimental results demonstrate that our method produces more accurate and visually coherent head reconstructions with improved expression controllability and detail fidelity.[66] FLAIR: Frequency- and Locality-Aware Implicit Neural Representations
Sukhun Ko,Dahyeon Kye,Kyle Min,Chanho Eom,Jihyong Oh
Main category: cs.CV
TL;DR: This paper proposes FLAIR, a method to improve Implicit Neural Representations (INRs) by introducing frequency selectivity, spatial localization, and sparse representations, which leads to better performance in various vision tasks.
Details
Motivation: Existing Implicit Neural Representations (INRs) lack frequency selectivity, spatial localization, and sparse representations, leading to spectral bias and an inability to capture fine high-frequency details. Method: The proposed method, FLAIR, uses RC-GAUSS activation for explicit frequency selection and spatial localization, and Wavelet-Energy-Guided Encoding (WEGE) to guide frequency information using discrete wavelet transform (DWT). Result: The proposed FLAIR method consistently outperforms existing INRs in 2D image representation and restoration, as well as 3D reconstruction. Conclusion: FLAIR successfully addresses the limitations of existing INRs by incorporating frequency selectivity, spatial localization, and sparse representations, leading to improved performance in various tasks. Abstract: Implicit Neural Representations (INRs) leverage neural networks to map coordinates to corresponding signals, enabling continuous and compact representations. This paradigm has driven significant advances in various vision tasks. However, existing INRs lack frequency selectivity, spatial localization, and sparse representations, leading to an over-reliance on redundant signal components. Consequently, they exhibit spectral bias, tending to learn low-frequency components early while struggling to capture fine high-frequency details. To address these issues, we propose FLAIR (Frequency- and Locality-Aware Implicit Neural Representations), which incorporates two key innovations. The first is RC-GAUSS, a novel activation designed for explicit frequency selection and spatial localization under the constraints of the time-frequency uncertainty principle (TFUP). The second is Wavelet-Energy-Guided Encoding (WEGE), which leverages the discrete wavelet transform (DWT) to compute energy scores and explicitly guide frequency information to the network. Our method consistently outperforms existing INRs in 2D image representation and restoration, as well as 3D reconstruction.[67] GazeProphet: Software-Only Gaze Prediction for VR Foveated Rendering
Farhaan Ebadulla,Chiraag Mudlapur,Gaurav BV
Main category: cs.CV
TL;DR: 该论文提出了一种无需硬件的眼动追踪预测方法GazeProphet,结合了球形视觉变换器和LSTM时间编码器,用于VR环境中的注视渲染,实现了更广泛的适用性和性能提升。
Details
Motivation: 当前的注视渲染方法需要昂贵的基于硬件的眼动追踪系统,由于成本、校准复杂性和硬件兼容性限制,限制了广泛采用。 Method: 该方法结合了用于处理360度VR场景的球形视觉变换器和基于LSTM的时间编码器,并通过多模态融合网络整合空间场景特征和时间注视动态以预测未来的注视位置及相关的置信度估计。 Result: 实验评估显示,GazeProphet实现了3.83度的中值角度误差,比传统的显著性基线提高了24%,同时提供了可靠的置信度校准,并在不同空间区域和场景类型中保持一致的性能。 Conclusion: GazeProphet是一个无需专用眼动追踪硬件的VR环境注视位置预测软件方法,能够实现实际部署,使不同的VR平台和应用程序更易于获得性能提升。 Abstract: Foveated rendering significantly reduces computational demands in virtual reality applications by concentrating rendering quality where users focus their gaze. Current approaches require expensive hardware-based eye tracking systems, limiting widespread adoption due to cost, calibration complexity, and hardware compatibility constraints. This paper presents GazeProphet, a software-only approach for predicting gaze locations in VR environments without requiring dedicated eye tracking hardware. The approach combines a Spherical Vision Transformer for processing 360-degree VR scenes with an LSTM-based temporal encoder that captures gaze sequence patterns. A multi-modal fusion network integrates spatial scene features with temporal gaze dynamics to predict future gaze locations with associated confidence estimates. Experimental evaluation on a comprehensive VR dataset demonstrates that GazeProphet achieves a median angular error of 3.83 degrees, outperforming traditional saliency-based baselines by 24% while providing reliable confidence calibration. The approach maintains consistent performance across different spatial regions and scene types, enabling practical deployment in VR systems without additional hardware requirements. Statistical analysis confirms the significance of improvements across all evaluation metrics. These results show that software-only gaze prediction can work for VR foveated rendering, making this performance boost more accessible to different VR platforms and apps.[68] A Lightweight Dual-Mode Optimization for Generative Face Video Coding
Zihan Zhang,Shanzhi Yin,Bolin Chen,Ru-Ling Liao,Shiqi Wang,Yan Ye
Main category: cs.CV
TL;DR: 本文提出了一种轻量级生成式人脸视频编码(GFVC)框架,通过双模式优化(架构改进和操作优化)显著降低了计算复杂度,同时保持了高质量的重建效果,适用于资源受限设备的部署。
Details
Motivation: 现有的生成式人脸视频编码(GFVC)由于模型参数大和计算成本高而难以实际部署,因此需要一种轻量级的解决方案来解决这一问题。 Method: 论文采用了双模式优化方法,包括架构重新设计和操作优化。在架构上,用更高效层替代传统的3×3卷积;在操作上,开发了两阶段自适应通道剪枝策略,训练期间软剪枝和训练后硬剪枝相结合。 Result: 实验结果表明,与基线相比,该方法实现了90.4%的参数减少和88.9%的计算节省,并且在感知质量指标上优于最先进的视频编码标准VVC。 Conclusion: 该论文提出了一种轻量级的GFVC框架,通过双模式优化方法在保持重建质量的同时显著减少参数和计算成本,为资源受限环境下的高效GFVC部署提供了可能。 Abstract: Generative Face Video Coding (GFVC) achieves superior rate-distortion performance by leveraging the strong inference capabilities of deep generative models. However, its practical deployment is hindered by large model parameters and high computational costs. To address this, we propose a lightweight GFVC framework that introduces dual-mode optimization -- combining architectural redesign and operational refinement -- to reduce complexity whilst preserving reconstruction quality. Architecturally, we replace traditional 3 x 3 convolutions with slimmer and more efficient layers, reducing complexity without compromising feature expressiveness. Operationally, we develop a two-stage adaptive channel pruning strategy: (1) soft pruning during training identifies redundant channels via learnable thresholds, and (2) hard pruning permanently eliminates these channels post-training using a derived mask. This dual-phase approach ensures both training stability and inference efficiency. Experimental results demonstrate that the proposed lightweight dual-mode optimization for GFVC can achieve 90.4% parameter reduction and 88.9% computation saving compared to the baseline, whilst achieving superior performance compared to state-of-the-art video coding standard Versatile Video Coding (VVC) in terms of perceptual-level quality metrics. As such, the proposed method is expected to enable efficient GFVC deployment in resource-constrained environments such as mobile edge devices.[69] Color Spike Data Generation via Bio-inspired Neuron-like Encoding with an Artificial Photoreceptor Layer
Hsieh Ching-Teng,Wang Yuan-Kai
Main category: cs.CV
TL;DR: 本文提出了一种基于生物神经元操作原理的神经元编码方法,结合人工光感受层生成携带颜色和亮度信息的脉冲数据,提升了脉冲神经网络的性能,同时遵循了神经形态计算的原则。
Details
Motivation: 脉冲神经网络的性能落后于卷积神经网络,主要受限于基于脉冲的数据信息容量。现有的通过非脉冲输入(如静态图像)提升性能的方法偏离了神经形态计算的初衷,即基于脉冲的信息处理。 Method: 提出了一种神经元编码方法,通过模拟生物神经元的内在操作原理和功能生成脉冲数据,并引入人工光感受层,使脉冲数据能够携带颜色和亮度信息,形成完整的视觉脉冲信号。 Result: 使用Integrate-and-Fire神经元模型进行实验,结果表明该生物启发方法有效提升了脉冲信号的信息内容和脉冲神经网络的性能,同时保持了神经形态计算的原则。 Conclusion: 这种方法具有较强的发展潜力,有助于克服神经形态计算中的现有局限性,推动脉冲神经网络的广泛应用。 Abstract: In recent years, neuromorphic computing and spiking neural networks (SNNs) have ad-vanced rapidly through integration with deep learning. However, the performance of SNNs still lags behind that of convolutional neural networks (CNNs), primarily due to the limited information capacity of spike-based data. Although some studies have attempted to improve SNN performance by training them with non-spiking inputs such as static images, this approach deviates from the original intent of neuromorphic computing, which emphasizes spike-based information processing. To address this issue, we propose a Neuron-like Encoding method that generates spike data based on the intrinsic operational principles and functions of biological neurons. This method is further enhanced by the incorporation of an artificial pho-toreceptor layer, enabling spike data to carry both color and luminance information, thereby forming a complete visual spike signal. Experimental results using the Integrate-and-Fire neuron model demonstrate that this biologically inspired approach effectively increases the information content of spike signals and improves SNN performance, all while adhering to neuromorphic principles. We believe this concept holds strong potential for future development and may contribute to overcoming current limitations in neuro-morphic computing, facilitating broader applications of SNNs.[70] DictAS: A Framework for Class-Generalizable Few-Shot Anomaly Segmentation via Dictionary Lookup
Zhen Qu,Xian Tao,Xinyi Gong,ShiChen Qu,Xiaopei Zhang,Xingang Wang,Fei Shen,Zhengtao Zhang,Mukesh Prasad,Guiguang Ding
Main category: cs.CV
TL;DR: DictAS introduces a self-supervised dictionary lookup approach for few-shot anomaly segmentation, enabling cross-category generalization without retraining and improving performance over existing methods.
Details
Motivation: Existing vision-language models like CLIP rely heavily on prior knowledge of seen anomaly samples for cross-category generalization in few-shot anomaly segmentation. The authors aim to overcome this limitation by proposing a framework that works without retraining and requires only a few normal reference images. Method: DictAS consists of three components: Dictionary Construction using normal reference images, Dictionary Lookup via a sparse strategy, and Query Discrimination Regularization with Contrastive Query Constraint and Text Alignment Constraint to enhance anomaly detection. Result: Extensive experiments on seven public industrial and medical datasets show that DictAS consistently outperforms state-of-the-art few-shot anomaly segmentation methods. Conclusion: DictAS is a novel framework for few-shot anomaly segmentation that enables cross-category generalization without retraining, outperforming state-of-the-art methods by leveraging self-supervised learning and dictionary lookup capabilities. Abstract: Recent vision-language models (e.g., CLIP) have demonstrated remarkable class-generalizable ability to unseen classes in few-shot anomaly segmentation (FSAS), leveraging supervised prompt learning or fine-tuning on seen classes. However, their cross-category generalization largely depends on prior knowledge of real seen anomaly samples. In this paper, we propose a novel framework, namely DictAS, which enables a unified model to detect visual anomalies in unseen object categories without any retraining on the target data, only employing a few normal reference images as visual prompts. The insight behind DictAS is to transfer dictionary lookup capabilities to the FSAS task for unseen classes via self-supervised learning, instead of merely memorizing the normal and abnormal feature patterns from the training set. Specifically, DictAS mainly consists of three components: (1) **Dictionary Construction** - to simulate the index and content of a real dictionary using features from normal reference images. (2) **Dictionary Lookup** - to retrieve queried region features from the dictionary via a sparse lookup strategy. When a query feature cannot be retrieved, it is classified as an anomaly. (3) **Query Discrimination Regularization**- to enhance anomaly discrimination by making abnormal features harder to retrieve from the dictionary. To achieve this, Contrastive Query Constraint and Text Alignment Constraint are further proposed. Extensive experiments on seven public industrial and medical datasets demonstrate that DictAS consistently outperforms state-of-the-art FSAS methods.[71] Learnable SMPLify: A Neural Solution for Optimization-Free Human Pose Inverse Kinematics
Yuchen Yang,Linfeng Dong,Wei Wang,Zhihang Zhong,Xiao Sun
Main category: cs.CV
TL;DR: 本文提出 Learnable SMPLify,通过神经网络取代传统迭代优化方法,在保持精度的同时显著提高 3D 人体姿态和形状估计的速度和泛化能力。
Details
Motivation: SMPLify 在 3D 人体姿态和形状估计中稳健但计算成本高,受各领域使用数据驱动神经网络取代迭代优化趋势的启发,本文提出 Learnable SMPLify。 Method: 通过单次传递回归模型取代迭代优化,提出时间采样策略和人体中心归一化方案,并采用残差学习缩小解空间。 Result: 与 SMPLify 相比,运行速度提高了近 200 倍,在未见过的 3DPW 和 RICH 数据上表现出良好的泛化能力,并以插件工具方式在 LucidAction 上实现了模型无关操作。 Conclusion: Learnable SMPLify 作为一种实用且简单的基线方法,在保持准确性的同时显著提高了运行速度,并具有良好的泛化能力和模型无关的插件特性。 Abstract: In 3D human pose and shape estimation, SMPLify remains a robust baseline that solves inverse kinematics (IK) through iterative optimization. However, its high computational cost limits its practicality. Recent advances across domains have shown that replacing iterative optimization with data-driven neural networks can achieve significant runtime improvements without sacrificing accuracy. Motivated by this trend, we propose Learnable SMPLify, a neural framework that replaces the iterative fitting process in SMPLify with a single-pass regression model. The design of our framework targets two core challenges in neural IK: data construction and generalization. To enable effective training, we propose a temporal sampling strategy that constructs initialization-target pairs from sequential frames. To improve generalization across diverse motions and unseen poses, we propose a human-centric normalization scheme and residual learning to narrow the solution space. Learnable SMPLify supports both sequential inference and plug-in post-processing to refine existing image-based estimators. Extensive experiments demonstrate that our method establishes itself as a practical and simple baseline: it achieves nearly 200x faster runtime compared to SMPLify, generalizes well to unseen 3DPW and RICH, and operates in a model-agnostic manner when used as a plug-in tool on LucidAction. The code is available at https://github.com/Charrrrrlie/Learnable-SMPLify.[72] The 9th AI City Challenge
Zheng Tang,Shuo Wang,David C. Anastasiu,Ming-Ching Chang,Anuj Sharma,Quan Kong,Norimasa Kobori,Munkhjargal Gochoo,Ganzorig Batnasan,Munkh-Erdene Otgonbold,Fady Alnajjar,Jun-Wei Hsieh,Tomasz Kornuta,Xiaolong Li,Yilin Zhao,Han Zhang,Subhashree Radhakrishnan,Arihant Jain,Ratnesh Kumar,Vidya N. Murali,Yuxing Wang,Sameer Satish Pusegaonkar,Yizhou Wang,Sujit Biswas,Xunlei Wu,Zhedong Zheng,Pranamesh Chakraborty,Rama Chellappa
Main category: cs.CV
TL;DR: The ninth AI City Challenge advanced real-world AI applications with four tracks, increased participation, and new benchmarks, promoting reproducibility and innovation in computer vision.
Details
Motivation: To advance the practical applications of computer vision and AI in critical domains such as transportation, industrial automation, and public safety, while promoting innovation and reproducibility through competitive challenges. Method: The challenge included four tracks focusing on 3D multi-camera tracking, video question answering for traffic safety, fine-grained spatial reasoning in warehouse environments, and efficient road object detection using fisheye cameras. Datasets were generated using NVIDIA Omniverse, and a robust evaluation framework was implemented to ensure fairness and mitigate overfitting. Result: The 2025 edition of the challenge experienced a 17% increase in participation, with 245 teams from 15 countries. Over 30,000 dataset downloads were recorded. Several teams achieved top-tier results, setting new benchmarks in their respective tracks. Conclusion: The ninth AI City Challenge successfully advanced real-world applications of computer vision and AI with increased participation and the introduction of four specialized tracks. It established new benchmarks in multiple tasks, ensured fair evaluation, and promoted reproducibility. Abstract: The ninth AI City Challenge continues to advance real-world applications of computer vision and AI in transportation, industrial automation, and public safety. The 2025 edition featured four tracks and saw a 17% increase in participation, with 245 teams from 15 countries registered on the evaluation server. Public release of challenge datasets led to over 30,000 downloads to date. Track 1 focused on multi-class 3D multi-camera tracking, involving people, humanoids, autonomous mobile robots, and forklifts, using detailed calibration and 3D bounding box annotations. Track 2 tackled video question answering in traffic safety, with multi-camera incident understanding enriched by 3D gaze labels. Track 3 addressed fine-grained spatial reasoning in dynamic warehouse environments, requiring AI systems to interpret RGB-D inputs and answer spatial questions that combine perception, geometry, and language. Both Track 1 and Track 3 datasets were generated in NVIDIA Omniverse. Track 4 emphasized efficient road object detection from fisheye cameras, supporting lightweight, real-time deployment on edge devices. The evaluation framework enforced submission limits and used a partially held-out test set to ensure fair benchmarking. Final rankings were revealed after the competition concluded, fostering reproducibility and mitigating overfitting. Several teams achieved top-tier results, setting new benchmarks in multiple tasks.[73] Generative Model-Based Feature Attention Module for Video Action Analysis
Guiqin Wang,Peng Zhao,Cong Zhao,Jing Huang,Siyan Guo,Shusen Yang
Main category: cs.CV
TL;DR: This paper proposes a novel generative attention-based model for video action analysis that improves feature semantics learning, making it highly effective for demanding IoT applications such as autonomous driving.
Details
Motivation: Existing methodologies for video action analysis overlook feature semantics and focus only on optimizing action proposals, which limits their applicability in high-performance IoT environments like autonomous driving. Method: The method involves a generative attention-based model that learns the relation of feature semantics, leveraging the differences between actions' foreground and background to simultaneously learn frame- and segment-dependencies of temporal action feature semantics. Result: The model was evaluated on two benchmark video tasks—action recognition and action detection—showing superior performance on widely recognized datasets. The method was also validated across a broader range of tasks. Conclusion: The proposed generative attention-based model effectively enhances video action analysis by focusing on feature semantics, making it suitable for high-performance IoT applications like autonomous driving. Abstract: Video action analysis is a foundational technology within the realm of intelligent video comprehension, particularly concerning its application in Internet of Things(IoT). However, existing methodologies overlook feature semantics in feature extraction and focus on optimizing action proposals, thus these solutions are unsuitable for widespread adoption in high-performance IoT applications due to the limitations in precision, such as autonomous driving, which necessitate robust and scalable intelligent video analytics analysis. To address this issue, we propose a novel generative attention-based model to learn the relation of feature semantics. Specifically, by leveraging the differences of actions' foreground and background, our model simultaneously learns the frame- and segment-dependencies of temporal action feature semantics, which takes advantage of feature semantics in the feature extraction effectively. To evaluate the effectiveness of our model, we conduct extensive experiments on two benchmark video task, action recognition and action detection. In the context of action detection tasks, we substantiate the superiority of our approach through comprehensive validation on widely recognized datasets. Moreover, we extend the validation of the effectiveness of our proposed method to a broader task, video action recognition. Our code is available at https://github.com/Generative-Feature-Model/GAF.[74] Temporal-Conditional Referring Video Object Segmentation with Noise-Free Text-to-Video Diffusion Model
Ruixin Zhang,Jiaqing Fan,Yifan Liao,Qian Qiao,Fanzhang Li
Main category: cs.CV
TL;DR: 本文提出了一种新的视频对象分割模型,通过优化分割头设计和改进特征提取方法,在多个基准测试中实现了最先进的性能。
Details
Motivation: 当前的RVOS方法过度强调特征提取和时序建模,而忽略了分割头的设计,这在很大程度上限制了分割边界的效果。论文旨在改进分割头设计以提升整体分割能力。 Method: 该论文的方法包括:1) 利用文本到视频扩散模型进行特征提取;2) 去除传统噪声预测模块以避免噪声随机性影响分割精度;3) 设计了Temporal Context Mask Refinement (TCMR) 模块以提升分割质量。 Result: 论文方法在四个公开的RVOS基准上均取得了最先进的性能,证明了该方法的有效性和优越性。 Conclusion: 该论文提出了一种新的时序条件参考视频对象分割模型,通过改进分割头设计、引入文本到视频扩散模型、去除传统噪声预测模块以及设计TCMR模块,显著提升了分割性能,并在四个公共RVOS基准上实现了最先进的结果。 Abstract: Referring Video Object Segmentation (RVOS) aims to segment specific objects in a video according to textual descriptions. We observe that recent RVOS approaches often place excessive emphasis on feature extraction and temporal modeling, while relatively neglecting the design of the segmentation head. In fact, there remains considerable room for improvement in segmentation head design. To address this, we propose a Temporal-Conditional Referring Video Object Segmentation model, which innovatively integrates existing segmentation methods to effectively enhance boundary segmentation capability. Furthermore, our model leverages a text-to-video diffusion model for feature extraction. On top of this, we remove the traditional noise prediction module to avoid the randomness of noise from degrading segmentation accuracy, thereby simplifying the model while improving performance. Finally, to overcome the limited feature extraction capability of the VAE, we design a Temporal Context Mask Refinement (TCMR) module, which significantly improves segmentation quality without introducing complex designs. We evaluate our method on four public RVOS benchmarks, where it consistently achieves state-of-the-art performance.[75] Bridging Clear and Adverse Driving Conditions
Yoel Shapiro,Yahia Showgan,Koustav Mullick
Main category: cs.CV
TL;DR: A hybrid diffusion-GAN method was developed to generate synthetic adverse weather images, enhancing the performance of autonomous driving systems in challenging conditions.
Details
Motivation: Autonomous Driving systems perform poorly under adverse weather conditions, and the lack of sufficient real-world data under such conditions necessitates a cost-effective alternative for training robust models. Method: The study employed a hybrid diffusion-GAN approach to synthesize adverse weather images from clear-weather images, using a combination of simulated and real data for training. Result: The method achieved a 1.85 percent overall improvement in semantic segmentation performance and a 4.62 percent improvement specifically under nighttime conditions on the ACDC dataset. Conclusion: The proposed hybrid diffusion-GAN approach effectively enhances the robustness of autonomous driving perception systems under adverse environmental conditions. Abstract: Autonomous Driving (AD) systems exhibit markedly degraded performance under adverse environmental conditions, such as low illumination and precipitation. The underrepresentation of adverse conditions in AD datasets makes it challenging to address this deficiency. To circumvent the prohibitive cost of acquiring and annotating adverse weather data, we propose a novel Domain Adaptation (DA) pipeline that transforms clear-weather images into fog, rain, snow, and nighttime images. Here, we systematically develop and evaluate several novel data-generation pipelines, including simulation-only, GAN-based, and hybrid diffusion-GAN approaches, to synthesize photorealistic adverse images from labelled clear images. We leverage an existing DA GAN, extend it to support auxiliary inputs, and develop a novel training recipe that leverages both simulated and real images. The simulated images facilitate exact supervision by providing perfectly matched image pairs, while the real images help bridge the simulation-to-real (sim2real) gap. We further introduce a method to mitigate hallucinations and artifacts in Stable-Diffusion Image-to-Image (img2img) outputs by blending them adaptively with their progenitor images. We finetune downstream models on our synthetic data and evaluate them on the Adverse Conditions Dataset with Correspondences (ACDC). We achieve 1.85 percent overall improvement in semantic segmentation, and 4.62 percent on nighttime, demonstrating the efficacy of our hybrid method for robust AD perception under challenging conditions.[76] Towards Efficient Vision State Space Models via Token Merging
Jinyoung Park,Minseok Son,Changick Kim
Main category: cs.CV
TL;DR: MaMe是一种专为基于SSM的视觉模型设计的高效token-merging策略,可有效提高计算效率并保持模型性能。
Details
Motivation: 为了在保持基于SSM的视觉模型性能的同时提高计算效率,需要一种有效的token reduction方法。 Method: MaMe利用状态转移参数Δ作为信息量度,并引入策略性token排列以保持序列信息流。 Result: MaMe在不同模型和任务中实现了高效性能的权衡,即使在极端token缩减情况下也保持了鲁棒性,并在视频和音频领域表现出良好的泛化能力。 Conclusion: MaMe为基于SSM的模型提供了一种高效的token-merging方法,适用于多种领域的实际应用。 Abstract: State Space Models (SSMs) have emerged as powerful architectures in computer vision, yet improving their computational efficiency remains crucial for practical and scalable deployment.While token reduction serves as an effective approach for model efficiency, applying it to SSMs requires careful consideration of their unique sequential modeling capabilities.In this work, we propose MaMe, a token-merging strategy tailored for SSM-based vision models.MaMe addresses two key challenges: quantifying token importance and preserving sequential properties. Our approach leverages the state transition parameter $\mathbf{\Delta}$ as an informativeness measure and introduces strategic token arrangements to preserve sequential information flow.Extensive experiments demonstrate that MaMe achieves superior efficiency-performance trade-offs for both fine-tuned and off-the-shelf models. Particularly, our approach maintains robustness even under aggressive token reduction where existing methods undergo significant performance degradation.Beyond image classification, MaMe shows strong generalization capabilities across video and audio domains, establishing an effective approach for enhancing efficiency in diverse SSM applications.[77] Unleashing Semantic and Geometric Priors for 3D Scene Completion
Shiyuan Chen,Wei Sui,Bohao Zhang,Zeyd Boukhers,John See,Cong Yang
Main category: cs.CV
TL;DR: FoundationSSC is a new framework for 3D semantic scene completion that decouples semantic and geometric processing, leading to significant performance improvements.
Details
Motivation: Existing methods for 3D semantic scene completion rely on a coupled encoder, which forces a trade-off between semantic and geometric priors, limiting overall performance. Method: FoundationSSC uses a dual-decoupling framework at both the source and pathway levels for improved semantic and geometric perception. Additionally, an Axis-Aware Fusion module is introduced to better merge features into a unified representation. Result: FoundationSSC achieves +0.23 mIoU and +2.03 IoU improvements on SemanticKITTI and state-of-the-art performance on SSCBench-KITTI-360 with 21.78 mIoU and 48.61 IoU. Conclusion: FoundationSSC provides significant improvements in both semantic and geometric metrics over existing methods. Abstract: Camera-based 3D semantic scene completion (SSC) provides dense geometric and semantic perception for autonomous driving and robotic navigation. However, existing methods rely on a coupled encoder to deliver both semantic and geometric priors, which forces the model to make a trade-off between conflicting demands and limits its overall performance. To tackle these challenges, we propose FoundationSSC, a novel framework that performs dual decoupling at both the source and pathway levels. At the source level, we introduce a foundation encoder that provides rich semantic feature priors for the semantic branch and high-fidelity stereo cost volumes for the geometric branch. At the pathway level, these priors are refined through specialised, decoupled pathways, yielding superior semantic context and depth distributions. Our dual-decoupling design produces disentangled and refined inputs, which are then utilised by a hybrid view transformation to generate complementary 3D features. Additionally, we introduce a novel Axis-Aware Fusion (AAF) module that addresses the often-overlooked challenge of fusing these features by anisotropically merging them into a unified representation. Extensive experiments demonstrate the advantages of FoundationSSC, achieving simultaneous improvements in both semantic and geometric metrics, surpassing prior bests by +0.23 mIoU and +2.03 IoU on SemanticKITTI. Additionally, we achieve state-of-the-art performance on SSCBench-KITTI-360, with 21.78 mIoU and 48.61 IoU. The code will be released upon acceptance.[78] PersonaVlog: Personalized Multimodal Vlog Generation with Multi-Agent Collaboration and Iterative Self-Correction
Xiaolu Hou,Bing Ma,Jiaxiang Cheng,Xuhua Ren,Kai Yu,Wenyue Li,Tianxiang Zheng,Qinglin Lu
Main category: cs.CV
TL;DR: PersonaVlog是一个自动化多模态个性化Vlog生成框架,通过多智能体协作和反馈机制实现高效内容创作。
Details
Motivation: 现有Vlog生成方法依赖于预定义脚本,缺乏动态性和个性化表达,需要一种更加自动化、具有高个性化和有效多模态协作的方法。 Method: 提出了一种基于多模态大语言模型的多智能体协作框架,并引入反馈和回滚机制,以及提出ThemeVlogEval基准测试框架。 Result: 实验表明PersonaVlog在生成自动化Vlog方面具有显著优势和潜力,优于多个基线方法。 Conclusion: PersonaVlog 是一种能够高效生成个性化Vlog的自动化框架,具备显著的优势和潜力。 Abstract: With the growing demand for short videos and personalized content, automated Video Log (Vlog) generation has become a key direction in multimodal content creation. Existing methods mostly rely on predefined scripts, lacking dynamism and personal expression. Therefore, there is an urgent need for an automated Vlog generation approach that enables effective multimodal collaboration and high personalization. To this end, we propose PersonaVlog, an automated multimodal stylized Vlog generation framework that can produce personalized Vlogs featuring videos, background music, and inner monologue speech based on a given theme and reference image. Specifically, we propose a multi-agent collaboration framework based on Multimodal Large Language Models (MLLMs). This framework efficiently generates high-quality prompts for multimodal content creation based on user input, thereby improving the efficiency and creativity of the process. In addition, we incorporate a feedback and rollback mechanism that leverages MLLMs to evaluate and provide feedback on generated results, thereby enabling iterative self-correction of multimodal content. We also propose ThemeVlogEval, a theme-based automated benchmarking framework that provides standardized metrics and datasets for fair evaluation. Comprehensive experiments demonstrate the significant advantages and potential of our framework over several baselines, highlighting its effectiveness and great potential for generating automated Vlogs.[79] Two-Factor Authentication Smart Entryway Using Modified LBPH Algorithm
Zakiah Ayop,Wan Mohamad Hariz Bin Wan Mohamad Rosdi,Looi Wei Hua,Syarulnaziah Anawar,Nur Fadzilah Othman
Main category: cs.CV
TL;DR: This paper proposes an IoT-based two-factor authentication system for smart entryway access control that combines facial recognition (using LBPH algorithms) and passcode verification. It includes automation for owner alerts and surveillance activation upon detecting strangers, with remote system control via Telegram on Raspberry Pi. The system achieves moderate accuracy and high user acceptance for future deployment.
Details
Motivation: Face mask detection has become crucial, especially during the COVID-19 pandemic. While many face detection models have been developed using IoT in smart entryways, there is still a lack of IoT-based developments for face mask detection. Method: The system uses the Local Binary Patterns Histograms (LBPH) algorithm for full face recognition and a modified LBPH algorithm for occluded face detection. It integrates facial recognition and passcode verification for two-factor authentication and uses Telegram for remote control via a Raspberry Pi platform. Result: On average, the system achieved an Accuracy of approximately 70%, a Precision of approximately 80%, and a Recall of approximately 83.26% across all tested users. Conclusion: The system can perform face recognition and mask detection, automate remote control operations for user registration, door locking/unlocking, and notify the owner. It received high acceptance in user tests for future use. Abstract: Face mask detection has become increasingly important recently, particularly during the COVID-19 pandemic. Many face detection models have been developed in smart entryways using IoT. However, there is a lack of IoT development on face mask detection. This paper proposes a two-factor authentication system for smart entryway access control using facial recognition and passcode verification and an automation process to alert the owner and activate the surveillance system when a stranger is detected and controls the system remotely via Telegram on a Raspberry Pi platform. The system employs the Local Binary Patterns Histograms for the full face recognition algorithm and modified LBPH algorithm for occluded face detection. On average, the system achieved an Accuracy of approximately 70%, a Precision of approximately 80%, and a Recall of approximately 83.26% across all tested users. The results indicate that the system is capable of conducting face recognition and mask detection, automating the operation of the remote control to register users, locking or unlocking the door, and notifying the owner. The sample participants highly accept it for future use in the user acceptance test.[80] TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis
Shunian Chen,Hejin Huang,Yexin Liu,Zihan Ye,Pengcheng Chen,Chenghao Zhu,Michael Guan,Rongsheng Wang,Junying Chen,Guanbin Li,Ser-Nam Lim,Harry Yang,Benyou Wang
Main category: cs.CV
TL;DR: This paper introduces TalkVid, a large and diverse dataset for audio-driven talking head synthesis, and TalkVid-Bench for evaluating generalization across demographics.
Details
Motivation: Existing models fail to generalize across human diversity due to limitations in training data scale, quality, and diversity. Method: Constructed a large-scale, high-quality, and diverse dataset named TalkVid, and introduced TalkVid-Bench for stratified evaluation. Result: Models trained on TalkVid outperform those trained on previous datasets in cross-dataset generalization. Conclusion: TalkVid improves audio-driven talking head synthesis generalization across human diversity, and TalkVid-Bench reveals hidden performance disparities. Abstract: Audio-driven talking head synthesis has achieved remarkable photorealism, yet state-of-the-art (SOTA) models exhibit a critical failure: they lack generalization to the full spectrum of human diversity in ethnicity, language, and age groups. We argue that this generalization gap is a direct symptom of limitations in existing training data, which lack the necessary scale, quality, and diversity. To address this challenge, we introduce TalkVid, a new large-scale, high-quality, and diverse dataset containing 1244 hours of video from 7729 unique speakers. TalkVid is curated through a principled, multi-stage automated pipeline that rigorously filters for motion stability, aesthetic quality, and facial detail, and is validated against human judgments to ensure its reliability. Furthermore, we construct and release TalkVid-Bench, a stratified evaluation set of 500 clips meticulously balanced across key demographic and linguistic axes. Our experiments demonstrate that a model trained on TalkVid outperforms counterparts trained on previous datasets, exhibiting superior cross-dataset generalization. Crucially, our analysis on TalkVid-Bench reveals performance disparities across subgroups that are obscured by traditional aggregate metrics, underscoring its necessity for future research. Code and data can be found in https://github.com/FreedomIntelligence/TalkVid[81] RCGNet: RGB-based Category-Level 6D Object Pose Estimation with Geometric Guidance
Sheng Yu,Di-Hua Zhai,Yuanqing Xia
Main category: cs.CV
TL;DR: This paper proposes an RGB-based category-level object pose estimation method using a transformer network and geometric feature-guided algorithm, achieving high efficiency and superior accuracy without requiring depth data.
Details
Motivation: Current RGB-D-based methods face challenges in environments lacking depth information, motivating the need for an approach that relies solely on RGB images for accurate and practical pose estimation. Method: We propose a transformer-based neural network to predict and fuse geometric features of the target object. A geometric feature-guided algorithm is introduced to enhance geometric representation, and the RANSAC-PnP algorithm is used to compute the object's pose, addressing challenges in variable object scales. Result: Experimental results show that the proposed method is highly efficient and achieves superior accuracy compared to previous RGB-based methods on benchmark datasets. Conclusion: The proposed RGB-based category-level object pose estimation method achieves high efficiency and superior accuracy compared to previous approaches, offering a new perspective for advancing pose estimation without relying on depth data. Abstract: While most current RGB-D-based category-level object pose estimation methods achieve strong performance, they face significant challenges in scenes lacking depth information. In this paper, we propose a novel category-level object pose estimation approach that relies solely on RGB images. This method enables accurate pose estimation in real-world scenarios without the need for depth data. Specifically, we design a transformer-based neural network for category-level object pose estimation, where the transformer is employed to predict and fuse the geometric features of the target object. To ensure that these predicted geometric features faithfully capture the object's geometry, we introduce a geometric feature-guided algorithm, which enhances the network's ability to effectively represent the object's geometric information. Finally, we utilize the RANSAC-PnP algorithm to compute the object's pose, addressing the challenges associated with variable object scales in pose estimation. Experimental results on benchmark datasets demonstrate that our approach is not only highly efficient but also achieves superior accuracy compared to previous RGB-based methods. These promising results offer a new perspective for advancing category-level object pose estimation using RGB images.[82] DiffIER: Optimizing Diffusion Models with Iterative Error Reduction
Ao Chen,Lihe Ding,Tianfan Xue
Main category: cs.CV
TL;DR: This paper proposes DiffIER, a novel optimization framework that minimizes errors during inference to improve the quality and stability of conditional generation in diffusion models.
Details
Motivation: The sensitivity of diffusion model outputs to guidance weight in Classifier-Free Guidance (CFG) highlights a training-inference gap, which undermines conditional generation performance. Method: The study introduces DiffIER, an optimization-based framework that minimizes errors at each inference step to address the training-inference gap in diffusion models. Result: Empirical results show that DiffIER outperforms baseline approaches in tasks like text-to-image generation, image super-resolution, and text-to-speech generation. Conclusion: The proposed DiffIER method effectively reduces the accumulated error during inference, enhancing the quality and reliability of conditional generation tasks. Abstract: Diffusion models have demonstrated remarkable capabilities in generating high-quality samples and enhancing performance across diverse domains through Classifier-Free Guidance (CFG). However, the quality of generated samples is highly sensitive to the selection of the guidance weight. In this work, we identify a critical ``training-inference gap'' and we argue that it is the presence of this gap that undermines the performance of conditional generation and renders outputs highly sensitive to the guidance weight. We quantify this gap by measuring the accumulated error during the inference stage and establish a correlation between the selection of guidance weight and minimizing this gap. Furthermore, to mitigate this gap, we propose DiffIER, an optimization-based method for high-quality generation. We demonstrate that the accumulated error can be effectively reduced by an iterative error minimization at each step during inference. By introducing this novel plug-and-play optimization framework, we enable the optimization of errors at every single inference step and enhance generation quality. Empirical results demonstrate that our proposed method outperforms baseline approaches in conditional generation tasks. Furthermore, the method achieves consistent success in text-to-image generation, image super-resolution, and text-to-speech generation, underscoring its versatility and potential for broad applications in future research.[83] OmniTry: Virtual Try-On Anything without Masks
Yutong Feng,Linlin Zhang,Hengyuan Cao,Yiming Chen,Xiaoduan Feng,Jian Cao,Yuxiong Wu,Bin Wang
Main category: cs.CV
TL;DR: OmniTry提出了一种通用的虚拟试穿框架,适用于各种可穿戴物品,通过两阶段训练方法实现更实用的遮罩自由设置,并在多个类别物品上展现出优越的性能。
Details
Motivation: 现有的虚拟试穿研究大多集中于衣物,而OmniTry旨在将虚拟试穿拓展到包括珠宝和配件在内的各种可穿戴物品,以提升实际应用的灵活性。 Method: OmniTry采用了一个两阶段的流程:第一阶段使用大规模未配对图像进行遮罩自由定位训练,第二阶段使用配对图像进一步微调模型以保持物体外观的一致性。 Result: OmniTry在12个常见类别的可穿戴物品上进行了评估,结果表明其在物体定位和ID保持方面优于现有方法。 Conclusion: OmniTry是一个统一的虚拟试穿框架,能够扩展到各种可穿戴物品,而不仅仅是衣物,且在遮罩自由设置下表现良好。 Abstract: Virtual Try-ON (VTON) is a practical and widely-applied task, for which most of existing works focus on clothes. This paper presents OmniTry, a unified framework that extends VTON beyond garment to encompass any wearable objects, e.g., jewelries and accessories, with mask-free setting for more practical application. When extending to various types of objects, data curation is challenging for obtaining paired images, i.e., the object image and the corresponding try-on result. To tackle this problem, we propose a two-staged pipeline: For the first stage, we leverage large-scale unpaired images, i.e., portraits with any wearable items, to train the model for mask-free localization. Specifically, we repurpose the inpainting model to automatically draw objects in suitable positions given an empty mask. For the second stage, the model is further fine-tuned with paired images to transfer the consistency of object appearance. We observed that the model after the first stage shows quick convergence even with few paired samples. OmniTry is evaluated on a comprehensive benchmark consisting of 12 common classes of wearable objects, with both in-shop and in-the-wild images. Experimental results suggest that OmniTry shows better performance on both object localization and ID-preservation compared with existing methods. The code, model weights, and evaluation benchmark of OmniTry will be made publicly available at https://omnitry.github.io/.[84] DeH4R: A Decoupled and Hybrid Method for Road Network Graph Extraction
Dengxian Gong,Shunping Ji
Main category: cs.CV
TL;DR: DeH4R improves road network graph extraction from remote sensing images by combining graph-generating efficiency with graph-growing dynamics, offering faster and more accurate results.
Details
Motivation: Existing methods either struggle with topology fidelity (segmentation-based), are computationally expensive (graph-growing), or limit dynamic vertex insertion (graph-generating). DeH4R aims to overcome these limitations by combining the strengths of both approaches. Method: DeH4R decouples the task into candidate vertex detection, adjacent vertex prediction, initial graph construction, and graph expansion, enabling dynamic vertex and edge insertions. Result: DeH4R achieves state-of-the-art performance on CityScale and SpaceNet benchmarks, outperforming the prior SOTA graph-growing method RNGDet++ by 4.62 APLS and 10.18 IoU on CityScale, while being approximately 10× faster. Conclusion: DeH4R is a novel hybrid model that combines the efficiency of graph-generating methods and the dynamic nature of graph-growing methods for extracting road network graphs from remote sensing imagery. It achieves state-of-the-art performance while being faster than prior methods. Abstract: The automated extraction of complete and precise road network graphs from remote sensing imagery remains a critical challenge in geospatial computer vision. Segmentation-based approaches, while effective in pixel-level recognition, struggle to maintain topology fidelity after vectorization postprocessing. Graph-growing methods build more topologically faithful graphs but suffer from computationally prohibitive iterative ROI cropping. Graph-generating methods first predict global static candidate road network vertices, and then infer possible edges between vertices. They achieve fast topology-aware inference, but limits the dynamic insertion of vertices. To address these challenges, we propose DeH4R, a novel hybrid model that combines graph-generating efficiency and graph-growing dynamics. This is achieved by decoupling the task into candidate vertex detection, adjacent vertex prediction, initial graph contruction, and graph expansion. This architectural innovation enables dynamic vertex (edge) insertions while retaining fast inference speed and enhancing both topology fidelity and spatial consistency. Comprehensive evaluations on CityScale and SpaceNet benchmarks demonstrate state-of-the-art (SOTA) performance. DeH4R outperforms the prior SOTA graph-growing method RNGDet++ by 4.62 APLS and 10.18 IoU on CityScale, while being approximately 10 $\times$ faster. The code will be made publicly available at https://github.com/7777777FAN/DeH4R.[85] HumanPCR: Probing MLLM Capabilities in Diverse Human-Centric Scenes
Keliang Li,Hongze Shen,Hao Shi,Ruibing Hou,Hong Chang,Jie Huang,Chenghao Jia,Wen Wang,Yiling Wu,Dongmei Jiang,Shiguang Shan,Xilin Chen
Main category: cs.CV
TL;DR: HumanPCR is an evaluation suite designed to assess multimodal models' ability to understand human-related visual contexts. It reveals significant challenges in current models' performance on tasks like spatial perception, temporal understanding, and mind modeling, highlighting the need for further research and development in human-centric visual understanding.
Details
Motivation: The motivation stems from the pursuit of artificial general intelligence that achieves human-comparable performance across diverse environments, focusing on human-related visual contexts that are often overlooked by existing benchmarks. Method: HumanPCR was developed as an evaluation suite with three hierarchical levels: Perception, Comprehension, and Reasoning. It includes over 6,000 human-verified multiple-choice questions and a manually curated video reasoning test. The evaluation involved over 30 state-of-the-art models. Result: The results showed that even state-of-the-art models face significant challenges in human-centric visual understanding tasks, particularly those involving detailed space perception, temporal understanding, and mind modeling. Advanced techniques like scaling visual contexts and test-time thinking provided only limited benefits. Conclusion: HumanPCR and the associated analysis highlight significant challenges in achieving human-centric visual understanding in multimodal models, especially in detailed spatial perception, temporal understanding, and modeling human minds. The study emphasizes the need for further research and development in extracting proactive visual evidence and reducing reliance on query-guided retrieval. Abstract: The aspiration for artificial general intelligence, fueled by the rapid progress of multimodal models, demands human-comparable performance across diverse environments. We propose HumanPCR, an evaluation suite for probing MLLMs' capacity about human-related visual contexts across three hierarchical levels: Perception, Comprehension, and Reasoning (denoted by Human-P, Human-C, and Human-R, respectively). Human-P and Human-C feature over 6,000 human-verified multiple choice questions, assessing massive tasks of 9 dimensions, including but not limited to essential skills frequently overlooked by existing benchmarks. Human-R offers a challenging manually curated video reasoning test that requires integrating multiple visual evidences, proactively extracting context beyond question cues, and applying human-like expertise. Each question includes human-annotated Chain-of-Thought (CoT) rationales with key visual evidence to support further research. Extensive evaluations on over 30 state-of-the-art models exhibit significant challenges in human-centric visual understanding, particularly in tasks involving detailed space perception, temporal understanding, and mind modeling. Moreover, analysis of Human-R reveals the struggle of models in extracting essential proactive visual evidence from diverse human scenes and their faulty reliance on query-guided retrieval. Even with advanced techniques like scaling visual contexts and test-time thinking yield only limited benefits. We hope HumanPCR and our findings will advance the development, evaluation, and human-centric application of multimodal models.[86] Diversity-enhanced Collaborative Mamba for Semi-supervised Medical Image Segmentation
Shumeng Li,Jian Zhang,Lei Qi,Luping Zhou,Yinghuan Shi,Yang Gao
Main category: cs.CV
TL;DR: 本文提出了一种用于半监督医学图像分割的多样性增强协同Mamba框架(DCMamba),通过利用数据、网络和特征的多样性显著优于其他方法。
Details
Motivation: 医学图像分割需要高质量的标注数据,而获取这些数据繁琐且昂贵。半监督分割技术通过利用未标记数据生成伪标签来减轻这一负担。 Method: 提出了一种新的多样性增强协同Mamba框架(DCMamba),从数据、网络和特征角度探索和利用多样性。具体方法包括:数据层面的块级弱强混合增强、网络层面的多样化扫描协作模块以及特征层面的不确定性加权对比学习机制。 Result: 实验表明,DCMamba显著优于其他半监督医学图像分割方法,在Synapse数据集上,使用20%的标注数据比最新的SSM-based方法高出6.69%。 Conclusion: DCMamba通过利用数据、网络和特征的多样性,在半监督医学图像分割任务中表现优异,有效减少了对大量标注数据的依赖。 Abstract: Acquiring high-quality annotated data for medical image segmentation is tedious and costly. Semi-supervised segmentation techniques alleviate this burden by leveraging unlabeled data to generate pseudo labels. Recently, advanced state space models, represented by Mamba, have shown efficient handling of long-range dependencies. This drives us to explore their potential in semi-supervised medical image segmentation. In this paper, we propose a novel Diversity-enhanced Collaborative Mamba framework (namely DCMamba) for semi-supervised medical image segmentation, which explores and utilizes the diversity from data, network, and feature perspectives. Firstly, from the data perspective, we develop patch-level weak-strong mixing augmentation with Mamba's scanning modeling characteristics. Moreover, from the network perspective, we introduce a diverse-scan collaboration module, which could benefit from the prediction discrepancies arising from different scanning directions. Furthermore, from the feature perspective, we adopt an uncertainty-weighted contrastive learning mechanism to enhance the diversity of feature representation. Experiments demonstrate that our DCMamba significantly outperforms other semi-supervised medical image segmentation methods, e.g., yielding the latest SSM-based method by 6.69% on the Synapse dataset with 20% labeled data.[87] Hierarchical Vision-Language Retrieval of Educational Metaverse Content in Agriculture
Ali Abdari,Alex Falcon,Giuseppe Serra
Main category: cs.CV
TL;DR: 这篇论文介绍了一个农业主题的元宇宙检索新数据集AgriMuseums和一种层次化视觉-语言模型,以提高元宇宙教育内容的检索效果。
Details
Motivation: 随着大量农业和园艺领域的教育内容被上传到网络,如何有效地组织这些内容成为提升学习效率的关键。元宇宙提供了一个交互式和沉浸式的环境来增强教育体验,但目前仍然缺乏有效的手段来检索与用户兴趣匹配的元宇宙场景。 Method: 该论文通过引入一个包含457个农业主题虚拟博物馆(AgriMuseums)的新数据集,结合层次化视觉-语言模型,利用自然语言查询来表示和检索相关的AgriMuseums。 Result: 实验结果显示,该方法在检索任务中表现良好,最高达到了约62%的R@1和78%的MRR(Mean Reciprocal Rank)指标,同时在现有基准测试中分别提升了6%的R@1和11%的MRR。 Conclusion: 该论文提出了一种层次化的视觉-语言模型,并引入了一个新的农业主题虚拟博物馆数据集(AgriMuseums),以解决当前元宇宙场景检索中存在的数据集不足的问题。实验表明,该方法在检索性能上优于现有方法,并取得了显著的改进。 Abstract: Every day, a large amount of educational content is uploaded online across different areas, including agriculture and gardening. When these videos or materials are grouped meaningfully, they can make learning easier and more effective. One promising way to organize and enrich such content is through the Metaverse, which allows users to explore educational experiences in an interactive and immersive environment. However, searching for relevant Metaverse scenarios and finding those matching users' interests remains a challenging task. A first step in this direction has been done recently, but existing datasets are small and not sufficient for training advanced models. In this work, we make two main contributions: first, we introduce a new dataset containing 457 agricultural-themed virtual museums (AgriMuseums), each enriched with textual descriptions; and second, we propose a hierarchical vision-language model to represent and retrieve relevant AgriMuseums using natural language queries. In our experimental setting, the proposed method achieves up to about 62\% R@1 and 78\% MRR, confirming its effectiveness, and it also leads to improvements on existing benchmarks by up to 6\% R@1 and 11\% MRR. Moreover, an extensive evaluation validates our design choices. Code and dataset are available at https://github.com/aliabdari/Agricultural_Metaverse_Retrieval .[88] Enhancing Targeted Adversarial Attacks on Large Vision-Language Models through Intermediate Projector Guidance
Yiming Cao,Yanjie Li,Kaisheng Liang,Yuni Lai,Bin Xiao
Main category: cs.CV
TL;DR: 本文提出了一种名为IPGA的攻击方法,通过利用视觉语言模型(VLMs)中的投影模块中间阶段,实现了对模型更精确和细粒度的对抗性攻击。
Details
Motivation: 当前的对抗攻击方法主要在编码器层面扰动图像以最大化与目标文本或参考图像的全局相似性,这限制了攻击的细粒度控制,并且忽略了投影模块的作用,导致攻击效果受限。 Method: 提出了一种新的对抗攻击方法Intermediate Projector Guided Attack (IPGA),利用Q-Former模块将全局图像嵌入转换为细粒度的视觉特征,从而实现对图像的精确控制。此外,还提出了Residual Query Alignment (RQA)以保留与目标无关的视觉内容。 Result: IPGA在多种视觉语言任务中表现出优于现有方法的效果,包括全局图像描述和细粒度视觉问答任务,并且在多个商业VLMs(如Google Gemini和OpenAI GPT)中实现了成功的迁移攻击。 Conclusion: IPGA通过操作投影模块的中间阶段,提供了更精确和可控的对抗性扰动,提升了攻击的有效性和跨模型的迁移能力。 Abstract: Targeted adversarial attacks are essential for proactively identifying security flaws in Vision-Language Models before real-world deployment. However, current methods perturb images to maximize global similarity with the target text or reference image at the encoder level, collapsing rich visual semantics into a single global vector. This limits attack granularity, hindering fine-grained manipulations such as modifying a car while preserving its background. Furthermore, these methods largely overlook the projector module, a critical semantic bridge between the visual encoder and the language model in VLMs, thereby failing to disrupt the full vision-language alignment pipeline within VLMs and limiting attack effectiveness. To address these issues, we propose the Intermediate Projector Guided Attack (IPGA), the first method to attack using the intermediate stage of the projector module, specifically the widely adopted Q-Former, which transforms global image embeddings into fine-grained visual features. This enables more precise control over adversarial perturbations by operating on semantically meaningful visual tokens rather than a single global representation. Specifically, IPGA leverages the Q-Former pretrained solely on the first vision-language alignment stage, without LLM fine-tuning, which improves both attack effectiveness and transferability across diverse VLMs. Furthermore, we propose Residual Query Alignment (RQA) to preserve unrelated visual content, thereby yielding more controlled and precise adversarial manipulations. Extensive experiments show that our attack method consistently outperforms existing methods in both standard global image captioning tasks and fine-grained visual question-answering tasks in black-box environment. Additionally, IPGA successfully transfers to multiple commercial VLMs, including Google Gemini and OpenAI GPT.[89] Mitigating Cross-Image Information Leakage in LVLMs for Multi-Image Tasks
Yeji Park,Minyoung Lee,Sanghyuk Chun,Junsuk Choe
Main category: cs.CV
TL;DR: 本文提出了一种名为FOCUS的解码策略,用于缓解大型视觉-语言模型在处理多图像输入时的跨图像信息泄漏问题,且无需训练或架构修改。
Details
Motivation: 大型视觉-语言模型在处理单图像任务时表现出色,但在处理多图像输入时性能显著下降,因为不同图像的视觉线索在输出中纠缠,即跨图像信息泄漏。 Method: FOCUS是一种训练无关且架构无关的解码策略。该方法通过顺序地将除一张图像外的所有图像用随机噪声遮蔽,引导模型专注于单一清晰图像。此过程在所有目标图像上重复,以获取部分遮蔽上下文下的logits。随后,通过使用仅噪声参考输入对比优化logits,从而抑制信息泄漏并提高输出准确性。 Result: FOCUS在四个多图像基准测试和多种大型视觉-语言模型家族中均持续提升了性能,证明了其作为通用且实用的解决方案的有效性。 Conclusion: 本文提出了一种通用且实用的解决方案FOCUS,用于增强大型视觉-语言模型在多图像推理任务中的性能,且无需额外训练或架构修改。 Abstract: Large Vision-Language Models (LVLMs) demonstrate strong performance on single-image tasks. However, we observe that their performance degrades significantly when handling multi-image inputs. This occurs because visual cues from different images become entangled in the model's output. We refer to this phenomenon as cross-image information leakage. To address this issue, we propose FOCUS, a training-free and architecture-agnostic decoding strategy that mitigates cross-image information leakage during inference. FOCUS sequentially masks all but one image with random noise, guiding the model to focus on the single clean image. We repeat this process across all target images to obtain logits under partially masked contexts. These logits are aggregated and then contrastively refined using a noise-only reference input, which suppresses the leakage and yields more accurate outputs. FOCUS consistently improves performance across four multi-image benchmarks and diverse LVLM families. This demonstrates that FOCUS offers a general and practical solution for enhancing multi-image reasoning without additional training or architectural modifications.[90] MR6D: Benchmarking 6D Pose Estimation for Mobile Robots
Anas Gouda,Shrutarv Awasthi,Christian Blesing,Lokeshwaran Manohar,Frank Hoffmann,Alice Kirchheim
Main category: cs.CV
TL;DR: MR6D是一个面向工业环境中移动机器人6D姿态估计的新数据集,它揭示了现有方法在处理移动平台特有挑战时的局限性。
Details
Motivation: 现有6D姿态估计数据集主要关注小型家用物体,忽略了移动机器人在工业环境中面临的主要挑战,如远距离感知、复杂遮挡和多样化的相机视角。 Method: 构建了一个包含92个真实场景的工业环境数据集,包含16种不同物体的静态和动态交互,以捕捉移动平台面临的特定挑战。 Result: 初步实验显示,当前6D姿态估计方法在MR6D数据集上表现不佳,特别是在2D分割方面存在明显困难。 Conclusion: MR6D数据集为移动机器人在工业环境中的6D姿态估计提供了一个新的评估基准,揭示了现有方法在处理移动平台特定挑战时的不足。 Abstract: Existing 6D pose estimation datasets primarily focus on small household objects typically handled by robot arm manipulators, limiting their relevance to mobile robotics. Mobile platforms often operate without manipulators, interact with larger objects, and face challenges such as long-range perception, heavy self-occlusion, and diverse camera perspectives. While recent models generalize well to unseen objects, evaluations remain confined to household-like settings that overlook these factors. We introduce MR6D, a dataset designed for 6D pose estimation for mobile robots in industrial environments. It includes 92 real-world scenes featuring 16 unique objects across static and dynamic interactions. MR6D captures the challenges specific to mobile platforms, including distant viewpoints, varied object configurations, larger object sizes, and complex occlusion/self-occlusion patterns. Initial experiments reveal that current 6D pipelines underperform in these settings, with 2D segmentation being another hurdle. MR6D establishes a foundation for developing and evaluating pose estimation methods tailored to the demands of mobile robotics. The dataset is available at https://huggingface.co/datasets/anas-gouda/mr6d.[91] Shape-from-Template with Generalised Camera
Agniva Sengupta,Stefan Zachow
Main category: cs.CV
TL;DR: This paper introduces new methods for non-rigidly registering 3D shapes to 2D keypoints observed by multiple cameras, improving reconstruction accuracy by utilizing mutual constraints between multiple views of a deforming object. The methods are validated on synthetic and real data.
Details
Motivation: The motivation behind this research is that while SfT has been widely studied using single images, extending it to incorporate information from multiple cameras opens new possibilities for applications like 3D shape registration in medical imaging and hand-held cameras. The use of a generalised camera model allows for greater flexibility in capturing deforming objects from multiple perspectives. Method: The authors propose multiple approaches for SfT: a first approach where corresponded keypoints lie on a direction vector from a known 3D point in space, a second approach where the keypoints lie on a direction vector from an unknown 3D point with known orientation, and a third approach where the silhouette of the imaged object is also known. The correspondence-based approaches are solved using convex programming, while the silhouette-based approach involves iterative refinement. Result: The key result is that the proposed approaches for SfT with generalised cameras offer improved reconstruction accuracy by leveraging additional information from mutual constraints between multiple views of a deformed object. The methods are demonstrated to be accurate on both synthetic and real data. Conclusion: The paper concludes that their proposed approaches for Shape-from-Template (SfT) with generalised cameras provide improved reconstruction accuracy by estimating deformed shapes using additional information from mutual constraints between multiple views of a deformed object. Abstract: This article presents a new method for non-rigidly registering a 3D shape to 2D keypoints observed by a constellation of multiple cameras. Non-rigid registration of a 3D shape to observed 2D keypoints, i.e., Shape-from-Template (SfT), has been widely studied using single images, but SfT with information from multiple-cameras jointly opens new directions for extending the scope of known use-cases such as 3D shape registration in medical imaging and registration from hand-held cameras, to name a few. We represent such multi-camera setup with the generalised camera model; therefore any collection of perspective or orthographic cameras observing any deforming object can be registered. We propose multiple approaches for such SfT: the first approach where the corresponded keypoints lie on a direction vector from a known 3D point in space, the second approach where the corresponded keypoints lie on a direction vector from an unknown 3D point in space but with known orientation w.r.t some local reference frame, and a third approach where, apart from correspondences, the silhouette of the imaged object is also known. Together, these form the first set of solutions to the SfT problem with generalised cameras. The key idea behind SfT with generalised camera is the improved reconstruction accuracy from estimating deformed shape while utilising the additional information from the mutual constraints between multiple views of a deformed object. The correspondence-based approaches are solved with convex programming while the silhouette-based approach is an iterative refinement of the results from the convex solutions. We demonstrate the accuracy of our proposed methods on many synthetic and real data[92] VisionLaw: Inferring Interpretable Intrinsic Dynamics from Visual Observations via Bilevel Optimization
Jiajing Lin,Shu Jiang,Qingyuan Zeng,Zhenzhong Wang,Min Jiang
Main category: cs.CV
TL;DR: VisionLaw is a new framework that effectively infers interpretable expressions of intrinsic dynamics from visual observations, outperforming current methods and showing strong generalization capabilities.
Details
Motivation: Existing methods face challenges in generalizing to complex scenarios or providing interpretable models of intrinsic dynamics, prompting the need for a more effective and interpretable approach. Method: VisionLaw uses a bilevel optimization framework with an LLMs-driven decoupled constitutive evolution strategy at the upper level and a vision-guided constitutive evaluation mechanism at the lower level. Result: Experiments show VisionLaw can infer interpretable intrinsic dynamics and significantly outperforms state-of-the-art methods. Conclusion: VisionLaw effectively infers interpretable intrinsic dynamics from visual observations, outperforming existing methods and showing strong generalization for interactive simulation in new scenarios. Abstract: The intrinsic dynamics of an object governs its physical behavior in the real world, playing a critical role in enabling physically plausible interactive simulation with 3D assets. Existing methods have attempted to infer the intrinsic dynamics of objects from visual observations, but generally face two major challenges: one line of work relies on manually defined constitutive priors, making it difficult to generalize to complex scenarios; the other models intrinsic dynamics using neural networks, resulting in limited interpretability and poor generalization. To address these challenges, we propose VisionLaw, a bilevel optimization framework that infers interpretable expressions of intrinsic dynamics from visual observations. At the upper level, we introduce an LLMs-driven decoupled constitutive evolution strategy, where LLMs are prompted as a knowledgeable physics expert to generate and revise constitutive laws, with a built-in decoupling mechanism that substantially reduces the search complexity of LLMs. At the lower level, we introduce a vision-guided constitutive evaluation mechanism, which utilizes visual simulation to evaluate the consistency between the generated constitutive law and the underlying intrinsic dynamics, thereby guiding the upper-level evolution. Experiments on both synthetic and real-world datasets demonstrate that VisionLaw can effectively infer interpretable intrinsic dynamics from visual observations. It significantly outperforms existing state-of-the-art methods and exhibits strong generalization for interactive simulation in novel scenarios.[93] A Fully Transformer Based Multimodal Framework for Explainable Cancer Image Segmentation Using Radiology Reports
Enobong Adahada,Isabel Sassoon,Kate Hone,Yongmin Li
Main category: cs.CV
TL;DR: Med-CTX是一个基于Transformer的多模态框架,用于可解释的乳腺癌超声分割,它结合了临床放射报告以提高性能和可解释性。
Details
Motivation: 为了提高乳腺癌超声分割的性能和可解释性,引入了Med-CTX,一个完全基于Transformer的多模态框架。 Method: Med-CTX采用双分支视觉编码器,结合ViT和Swin变压器,并利用不确定性感知融合。临床语言由BioClinicalBERT编码,并利用跨模态注意力与视觉特征结合。 Result: Med-CTX在BUS-BRA数据集上达到了99%的Dice分数和95%的IoU。临床文本在分割准确性和解释质量中起着关键作用,如消融研究显示Dice分数下降了5.4%,CIDEr下降了31%。 Conclusion: Med-CTX是一个基于Transformer的多模态框架,为乳腺癌超声分割提供了可解释的解决方案,提高了计算机辅助诊断的可信度和透明度。 Abstract: We introduce Med-CTX, a fully transformer based multimodal framework for explainable breast cancer ultrasound segmentation. We integrate clinical radiology reports to boost both performance and interpretability. Med-CTX achieves exact lesion delineation by using a dual-branch visual encoder that combines ViT and Swin transformers, as well as uncertainty aware fusion. Clinical language structured with BI-RADS semantics is encoded by BioClinicalBERT and combined with visual features utilising cross-modal attention, allowing the model to provide clinically grounded, model generated explanations. Our methodology generates segmentation masks, uncertainty maps, and diagnostic rationales all at once, increasing confidence and transparency in computer assisted diagnosis. On the BUS-BRA dataset, Med-CTX achieves a Dice score of 99% and an IoU of 95%, beating existing baselines U-Net, ViT, and Swin. Clinical text plays a key role in segmentation accuracy and explanation quality, as evidenced by ablation studies that show a -5.4% decline in Dice score and -31% in CIDEr. Med-CTX achieves good multimodal alignment (CLIP score: 85%) and increased confi dence calibration (ECE: 3.2%), setting a new bar for trustworthy, multimodal medical architecture.[94] Timestep-Compressed Attack on Spiking Neural Networks through Timestep-Level Backpropagation
Donghwa Kang,Doohyun Kim,Sang-Ki Ko,Jinkyu Lee,Hyeongboo Baek,Brent ByungHoon Kang
Main category: cs.CV
TL;DR: This paper proposes TCA, a novel framework for reducing attack latency in adversarial attacks on spiking neural networks while maintaining attack success rates.
Details
Motivation: SOTA gradient-based adversarial attacks on SNNs suffer from substantial attack latency due to multi-timestep processing, making them unsuitable for real-time applications. Method: The proposed TCA framework uses timestep-level backpropagation (TLBP) and adversarial membrane potential reuse (A-MPR) to reduce attack latency. Result: Experiments on VGG-11 and ResNet-17 with CIFAR-10/100 and CIFAR10-DVS datasets show that TCA reduces attack latency by up to 56.6% and 57.1% in white-box and black-box settings, respectively. Conclusion: TCA significantly reduces attack latency in SNNs while maintaining a comparable attack success rate to SOTA methods. Abstract: State-of-the-art (SOTA) gradient-based adversarial attacks on spiking neural networks (SNNs), which largely rely on extending FGSM and PGD frameworks, face a critical limitation: substantial attack latency from multi-timestep processing, rendering them infeasible for practical real-time applications. This inefficiency stems from their design as direct extensions of ANN paradigms, which fail to exploit key SNN properties. In this paper, we propose the timestep-compressed attack (TCA), a novel framework that significantly reduces attack latency. TCA introduces two components founded on key insights into SNN behavior. First, timestep-level backpropagation (TLBP) is based on our finding that global temporal information in backpropagation to generate perturbations is not critical for an attack's success, enabling per-timestep evaluation for early stopping. Second, adversarial membrane potential reuse (A-MPR) is motivated by the observation that initial timesteps are inefficiently spent accumulating membrane potential, a warm-up phase that can be pre-calculated and reused. Our experiments on VGG-11 and ResNet-17 with the CIFAR-10/100 and CIFAR10-DVS datasets show that TCA significantly reduces the required attack latency by up to 56.6% and 57.1% compared to SOTA methods in white-box and black-box settings, respectively, while maintaining a comparable attack success rate.[95] Unsupervised Urban Tree Biodiversity Mapping from Street-Level Imagery Using Spatially-Aware Visual Clustering
Diaa Addeen Abuhani,Marco Seccaroni,Martina Mazzarello,Imran Zualkernan,Fabio Duarte,Carlo Ratti
Main category: cs.CV
TL;DR: 本文提出了一种利用街景图像和空间种植模式进行无监督学习的新方法,可以高效、低成本地估算城市树木的生物多样性,适用于缺乏详细数据的城市。
Details
Motivation: 城市树木的生物多样性对于气候恢复力、生态稳定性和城市宜居性至关重要,但大多数市政当局缺乏对其树冠的详细知识。传统的基于实地的调查成本高且耗时,而现有的监督AI方法需要标注数据,难以泛化到不同地区。 Method: 整合街景图像的视觉嵌入和空间种植模式,采用无监督聚类框架来估计生物多样性。 Result: 在北美八个城市的测试中,该方法能够高保真地恢复属级多样性模式,Shannon和Simpson指数与真实数据的Wasserstein距离低,并保持了空间自相关性。 Conclusion: 该论文提出了一种无需标签的聚类框架,可用于估算城市树木的生物多样性,从而实现对缺乏详细清单城市的生物多样性绘图,并支持绿色空间的公平获取和城市生态系统的适应性管理。 Abstract: Urban tree biodiversity is critical for climate resilience, ecological stability, and livability in cities, yet most municipalities lack detailed knowledge of their canopies. Field-based inventories provide reliable estimates of Shannon and Simpson diversity but are costly and time-consuming, while supervised AI methods require labeled data that often fail to generalize across regions. We introduce an unsupervised clustering framework that integrates visual embeddings from street-level imagery with spatial planting patterns to estimate biodiversity without labels. Applied to eight North American cities, the method recovers genus-level diversity patterns with high fidelity, achieving low Wasserstein distances to ground truth for Shannon and Simpson indices and preserving spatial autocorrelation. This scalable, fine-grained approach enables biodiversity mapping in cities lacking detailed inventories and offers a pathway for continuous, low-cost monitoring to support equitable access to greenery and adaptive management of urban ecosystems.[96] Self-Aware Adaptive Alignment: Enabling Accurate Perception for Intelligent Transportation Systems
Tong Xiang,Hongxia Zhao,Fenghua Zhu,Yuanyuan Chen,Yisheng Lv
Main category: cs.CV
TL;DR: 本文提出了一种新的跨域目标检测方法SA3,通过高效的对齐机制和识别策略,显著提升了检测性能。
Details
Motivation: 在智能交通检测中,跨域检测仍存在诸多挑战,需要更高效的方法来提升性能。 Method: 本文提出了一种名为Self-Aware Adaptive Alignment (SA3)的方法,通过注意力机制对齐模块和实例到图像级别的对齐模块,实现源域和目标域之间的自适应对齐。 Result: 实验结果表明,SA3在流行的跨域目标检测基准上取得了优于现有方法的结果。 Conclusion: SA3在跨域目标检测中表现出色,优于之前最先进的方法。 Abstract: Achieving top-notch performance in Intelligent Transportation detection is a critical research area. However, many challenges still need to be addressed when it comes to detecting in a cross-domain scenario. In this paper, we propose a Self-Aware Adaptive Alignment (SA3), by leveraging an efficient alignment mechanism and recognition strategy. Our proposed method employs a specified attention-based alignment module trained on source and target domain datasets to guide the image-level features alignment process, enabling the local-global adaptive alignment between the source domain and target domain. Features from both domains, whose channel importance is re-weighted, are fed into the region proposal network, which facilitates the acquisition of salient region features. Also, we introduce an instance-to-image level alignment module specific to the target domain to adaptively mitigate the domain gap. To evaluate the proposed method, extensive experiments have been conducted on popular cross-domain object detection benchmarks. Experimental results show that SA3 achieves superior results to the previous state-of-the-art methods.[97] SAGA: Learning Signal-Aligned Distributions for Improved Text-to-Image Generation
Paul Grimal,Michaël Soumm,Hervé Le Borgne,Olivier Ferret,Akihiro Sugimoto
Main category: cs.CV
TL;DR: 本研究提出了一种改进的文本到图像生成方法,通过显式建模信号成分和提供细粒度控制,提高了对文本提示的对齐能力,同时具备训练免费和兼容性强的优点。
Details
Motivation: 当前的文本到图像模型在视觉上令人印象深刻,但往往难以精确对齐文本提示,导致关键元素缺失或不同概念的意外融合。 Method: 该方法在去噪过程中显式地建模信号成分,提供细粒度控制,从而减轻过度优化和分布外伪影,并且支持额外的条件模态(如边界框)以增强空间对齐。 Result: 大量实验表明,该方法在现有最先进方法的基础上表现出更优的性能。 Conclusion: 该研究提出了一种新的文本到图像生成方法,能够更好地对齐文本提示,同时具备训练免费和兼容性强的特点。 Abstract: State-of-the-art text-to-image models produce visually impressive results but often struggle with precise alignment to text prompts, leading to missing critical elements or unintended blending of distinct concepts. We propose a novel approach that learns a high-success-rate distribution conditioned on a target prompt, ensuring that generated images faithfully reflect the corresponding prompts. Our method explicitly models the signal component during the denoising process, offering fine-grained control that mitigates over-optimization and out-of-distribution artifacts. Moreover, our framework is training-free and seamlessly integrates with both existing diffusion and flow matching architectures. It also supports additional conditioning modalities -- such as bounding boxes -- for enhanced spatial alignment. Extensive experiments demonstrate that our approach outperforms current state-of-the-art methods. The code is available at https://github.com/grimalPaul/gsn-factory.[98] RED.AI Id-Pattern: First Results of Stone Deterioration Patterns with Multi-Agent Systems
Daniele Corradetti,José Delgado Rodrigues
Main category: cs.CV
TL;DR: The Id-Pattern system in the RED.AI project uses a multi-agent AI to automate and enhance the diagnosis of stone deterioration patterns, outperforming traditional methods.
Details
Motivation: Traditional methodologies for identifying stone deterioration patterns are accurate but time-consuming and resource-intensive, necessitating an automated solution. Method: The system uses a multi-agent AI approach based on a cognitive architecture that orchestrates five specialized agents: a lithologist, a pathologist, an environmental expert, a conservator-restorer, and a diagnostic coordinator. Result: The system demonstrated a significant improvement across all metrics when tested on 28 challenging images involving multiple deterioration patterns. Conclusion: The Id-Pattern system successfully simulates expert collaboration and significantly improves the diagnosis of stone deterioration patterns compared to the foundational model. Abstract: The Id-Pattern system within the RED.AI project (Reabilita\c{c}\~ao Estrutural Digital atrav\'es da AI) consists of an agentic system designed to assist in the identification of stone deterioration patterns. Traditional methodologies, based on direct observation by expert teams, are accurate but costly in terms of time and resources. The system developed here introduces and evaluates a multi-agent artificial intelligence (AI) system, designed to simulate collaboration between experts and automate the diagnosis of stone pathologies from visual evidence. The approach is based on a cognitive architecture that orchestrates a team of specialized AI agents which, in this specific case, are limited to five: a lithologist, a pathologist, an environmental expert, a conservator-restorer, and a diagnostic coordinator. To evaluate the system we selected 28 difficult images involving multiple deterioration patterns. Our first results showed a huge boost on all metrics of our system compared to the foundational model.[99] RICO: Two Realistic Benchmarks and an In-Depth Analysis for Incremental Learning in Object Detection
Matthias Neuwirth-Trapp,Maarten Bieshaar,Danda Pani Paudel,Luc Van Gool
Main category: cs.CV
TL;DR: This paper introduces realistic benchmarks for Incremental Learning (IL) in object detection, highlighting the shortcomings of current IL methods and demonstrating that replaying previous data and individual training outperform these methods.
Details
Motivation: The motivation is to address the limitations of synthetic and simplified benchmarks in capturing real-world performance of Incremental Learning (IL) methods by introducing more realistic benchmarks. Method: The researchers introduced two Realistic Incremental Object Detection Benchmarks (RICO), namely Domain RICO (D-RICO) and Expanding-Classes RICO (EC-RICO), built from 14 diverse datasets, and conducted experiments comparing IL methods, replaying previous data, and individual training. Result: The experiments showed that all IL methods underperform in adaptability and retention, replaying a small amount of previous data outperforms all IL methods, and individual training on the data remains superior. Conclusion: The study concludes that while IL methods struggle with adaptability and retention, replaying previous data outperforms these methods, with individual training remaining superior, attributed to issues like weak teachers in distillation and insufficient plasticity. Abstract: Incremental Learning (IL) trains models sequentially on new data without full retraining, offering privacy, efficiency, and scalability. IL must balance adaptability to new data with retention of old knowledge. However, evaluations often rely on synthetic, simplified benchmarks, obscuring real-world IL performance. To address this, we introduce two Realistic Incremental Object Detection Benchmarks (RICO): Domain RICO (D-RICO) features domain shifts with a fixed class set, and Expanding-Classes RICO (EC-RICO) integrates new domains and classes per IL step. Built from 14 diverse datasets covering real and synthetic domains, varying conditions (e.g., weather, time of day), camera sensors, perspectives, and labeling policies, both benchmarks capture challenges absent in existing evaluations. Our experiments show that all IL methods underperform in adaptability and retention, while replaying a small amount of previous data already outperforms all methods. However, individual training on the data remains superior. We heuristically attribute this gap to weak teachers in distillation, single models' inability to manage diverse tasks, and insufficient plasticity. Our code will be made publicly available.[100] In-hoc Concept Representations to Regularise Deep Learning in Medical Imaging
Valentina Corbetta,Floris Six Dijkstra,Regina Beets-Tan,Hoel Kervadec,Kristoffer Wickstrøm,Wilson Silva
Main category: cs.CV
TL;DR: LCRReg是一种新型正则化方法,通过潜在概念表示提升医疗影像深度学习模型的鲁棒性和泛化能力。
Details
Motivation: 深度学习模型在医疗影像中常依赖伪相关性,而非临床有意义的特征,导致分布变化下泛化能力差。 Method: 引入LCRReg,一种利用潜在概念表示(如CAVs)的正则化方法,通过辅助数据集合成高质量、解耦的概念示例,引导CNN在潜在子空间中激活。 Result: 在合成和真实世界医疗任务中评估LCRReg,包括糖尿病视网膜病变二分类任务,它显著提升了对注入伪相关性和分布外泛化的鲁棒性。 Conclusion: LCRReg是一种无需密集概念监督的轻量级策略,可提高模型的鲁棒性,适用于各种架构。 Abstract: Deep learning models in medical imaging often achieve strong in-distribution performance but struggle to generalise under distribution shifts, frequently relying on spurious correlations instead of clinically meaningful features. We introduce LCRReg, a novel regularisation approach that leverages Latent Concept Representations (LCRs) (e.g., Concept Activation Vectors (CAVs)) to guide models toward semantically grounded representations. LCRReg requires no concept labels in the main training set and instead uses a small auxiliary dataset to synthesise high-quality, disentangled concept examples. We extract LCRs for predefined relevant features, and incorporate a regularisation term that guides a Convolutional Neural Network (CNN) to activate within latent subspaces associated with those concepts. We evaluate LCRReg across synthetic and real-world medical tasks. On a controlled toy dataset, it significantly improves robustness to injected spurious correlations and remains effective even in multi-concept and multiclass settings. On the diabetic retinopathy binary classification task, LCRReg enhances performance under both synthetic spurious perturbations and out-of-distribution (OOD) generalisation. Compared to baselines, including multitask learning, linear probing, and post-hoc concept-based models, LCRReg offers a lightweight, architecture-agnostic strategy for improving model robustness without requiring dense concept supervision. Code is available at the following link: https://github.com/Trustworthy-AI-UU-NKI/lcr\_regularization[101] Forecasting Smog Events Using ConvLSTM: A Spatio-Temporal Approach for Aerosol Index Prediction in South Asia
Taimur Khan
Main category: cs.CV
TL;DR: 该研究使用卫星数据和深度学习模型预测南亚雾霾事件,尽管模型仍有改进空间。
Details
Motivation: 过去十年中,由于农作物残留燃烧、机动车排放和天气模式的变化,南亚雾霾事件加剧,因此需要建立区域规模的实时预测系统。 Method: 使用2019-2023年的哨兵-5P空气成分数据以及卷积长短期记忆(ConvLSTM)神经网络来预测气溶胶指数事件。 Result: 以340-380纳米紫外线气溶胶指数作为预测因子,结果表明气溶胶指数可以在五天内进行预测,均方误差约为0.0018,损失约为0.3995,结构相似性指数约为0.74。 Conclusion: 研究得出,利用ConvLSTM神经网络和哨兵-5P空气成分数据可以有效预测气溶胶事件,但通过整合更多数据和优化模型架构仍有改进空间。 Abstract: The South Asian Smog refers to the recurring annual air pollution events marked by high contaminant levels, reduced visibility, and significant socio-economic impacts, primarily affecting the Indo-Gangetic Plains (IGP) from November to February. Over the past decade, increased air pollution sources such as crop residue burning, motor vehicles, and changing weather patterns have intensified these smog events. However, real-time forecasting systems for increased particulate matter concentrations are still not established at regional scale. The Aerosol Index, closely tied to smog formation and a key component in calculating the Air Quality Index (AQI), reflects particulate matter concentrations. This study forecasts aerosol events using Sentinel-5P air constituent data (2019-2023) and a Convolutional Long-Short Term Memory (ConvLSTM) neural network, which captures spatial and temporal correlations more effectively than previous models. Using the Ultraviolet (UV) Aerosol Index at 340-380 nm as the predictor, results show the Aerosol Index can be forecasted at five-day intervals with a Mean Squared Error of ~0.0018, loss of ~0.3995, and Structural Similarity Index of ~0.74. While effective, the model can be improved by integrating additional data and refining its architecture.[102] SCRNet: Spatial-Channel Regulation Network for Medical Ultrasound Image Segmentation
Weixin Xu,Ziliang Wang
Main category: cs.CV
TL;DR: The paper proposes a novel framework called SCRNet for medical ultrasound image segmentation, combining convolution and cross-attention mechanisms to better capture both long-range dependencies and local contextual information, resulting in superior performance compared to existing methods.
Details
Motivation: Medical ultrasound image segmentation is a challenging task in computer vision. Traditional methods based on CNNs and Transformers have limitations in capturing long-range dependencies and local contextual information, which the proposed method aims to overcome. Method: The method involves designing a Feature Aggregation Module (FAM) that processes two input features through a Convolution and Cross-Attention Parallel Module (CCAPM), and integrating FAM into the Spatial-Channel Regulation Module (SCRM) within the UNet architecture encoder block. Result: Extensive experiments show that the proposed SCRNet achieves state-of-the-art (SOTA) performance compared to existing medical image segmentation methods. Conclusion: The proposed SCRNet framework, incorporating the Feature Aggregation Module (FAM) and the Spatial-Channel Regulation Module (SCRM), demonstrates superior performance in medical ultrasound image segmentation. Abstract: Medical ultrasound image segmentation presents a formidable challenge in the realm of computer vision. Traditional approaches rely on Convolutional Neural Networks (CNNs) and Transformer-based methods to address the intricacies of medical image segmentation. Nevertheless, inherent limitations persist, as CNN-based methods tend to disregard long-range dependencies, while Transformer-based methods may overlook local contextual information. To address these deficiencies, we propose a novel Feature Aggregation Module (FAM) designed to process two input features from the preceding layer. These features are seamlessly directed into two branches of the Convolution and Cross-Attention Parallel Module (CCAPM) to endow them with different roles in each of the two branches to help establish a strong connection between the two input features. This strategy enables our module to focus concurrently on both long-range dependencies and local contextual information by judiciously merging convolution operations with cross-attention mechanisms. Moreover, by integrating FAM within our proposed Spatial-Channel Regulation Module (SCRM), the ability to discern salient regions and informative features warranting increased attention is enhanced. Furthermore, by incorporating the SCRM into the encoder block of the UNet architecture, we introduce a novel framework dubbed Spatial-Channel Regulation Network (SCRNet). The results of our extensive experiments demonstrate the superiority of SCRNet, which consistently achieves state-of-the-art (SOTA) performance compared to existing methods.[103] PhysGM: Large Physical Gaussian Model for Feed-Forward 4D Synthesis
Chunji Lv,Zequn Chen,Donglin Di,Weinan Zhang,Hao Li,Wei Chen,Changsheng Li
Main category: cs.CV
TL;DR: PhysGM is a fast and effective framework for 4D motion synthesis, combining 3D Gaussian representations with physics prediction for high-fidelity results.
Details
Motivation: Current 3D motion synthesis methods rely on pre-reconstructed representations and either inflexible physical attributes or unstable video model guidance. Method: PhysGM jointly predicts 3D Gaussian representations and physical properties using a base model optimized with Direct Preference Optimization (DPO) and refined with reference videos. Result: PhysGM achieves high-fidelity 4D simulations from a single image in one minute, significantly faster than previous approaches. Conclusion: PhysGM is a new framework that enables fast and accurate 4D simulations from a single image by integrating physical properties into 3D Gaussian representations. Abstract: While physics-grounded 3D motion synthesis has seen significant progress, current methods face critical limitations. They typically rely on pre-reconstructed 3D Gaussian Splatting (3DGS) representations, while physics integration depends on either inflexible, manually defined physical attributes or unstable, optimization-heavy guidance from video models. To overcome these challenges, we introduce PhysGM, a feed-forward framework that jointly predicts a 3D Gaussian representation and its physical properties from a single image, enabling immediate, physical simulation and high-fidelity 4D rendering. We first establish a base model by jointly optimizing for Gaussian reconstruction and probabilistic physics prediction. The model is then refined with physically plausible reference videos to enhance both rendering fidelity and physics prediction accuracy. We adopt the Direct Preference Optimization (DPO) to align its simulations with reference videos, circumventing Score Distillation Sampling (SDS) optimization which needs back-propagating gradients through the complex differentiable simulation and rasterization. To facilitate the training, we introduce a new dataset PhysAssets of over 24,000 3D assets, annotated with physical properties and corresponding guiding videos. Experimental results demonstrate that our method effectively generates high-fidelity 4D simulations from a single image in one minute. This represents a significant speedup over prior works while delivering realistic rendering results. Our project page is at:https://hihixiaolv.github.io/PhysGM.github.io/[104] DIME-Net: A Dual-Illumination Adaptive Enhancement Network Based on Retinex and Mixture-of-Experts
Ziang Wang,Xiaoqin Wang,Dingyi Wang,Qiang Li,Shushan Qiao
Main category: cs.CV
TL;DR: 本文提出了一种名为DIME-Net的双光照增强框架,以解决复杂光照条件下(如低光和逆光场景)图像质量下降的问题。DIME-Net通过混合专家模型和稀疏门控机制,自适应选择适合的S型曲线专家网络,并结合Retinex理论进行光照估计和增强。此外,还设计了一个损坏修复模块,用于纠正光照引起的伪影和颜色失真。文章通过构建混合光照数据集MixBL,验证了模型在合成和真实世界数据集上的优异表现。
Details
Motivation: 复杂光照条件(如低光和逆光场景)会导致图像质量下降,影响后续视觉任务。现有的方法大多只关注单一类型的光照退化,缺乏统一处理多种光照条件的能力。因此,作者提出DIME-Net,旨在通过统一框架解决多种光照退化问题,提高图像质量和模型的泛化能力。 Method: 本文提出了一种双光照增强框架DIME-Net,其核心是一个混合专家(Mixture-of-Experts)光照估计模块,通过稀疏门控机制自适应选择合适的S型曲线专家网络。该模块结合Retinex理论,实现对低光和逆光图像的增强。此外,还设计了一个损坏修复模块,包含光照感知交叉注意力(Illumination-Aware Cross Attention)和顺序状态全局注意力(Sequential-State Global Attention)机制,用于纠正光照引起的伪影和颜色失真。为了训练和评估模型,作者构建了一个混合光照数据集MixBL。 Result: 实验结果表明,DIME-Net在合成和真实世界的低光及逆光数据集上均表现出色,无需重新训练即可适应不同光照条件。模型在图像增强任务中展现了良好的泛化能力和鲁棒性,证明了其在复杂光照条件下实际多媒体应用的潜力。 Conclusion: DIME-Net是一种有效的双光照增强框架,能够统一处理低光和逆光图像退化问题。通过混合专家模块和损坏修复模块的设计,结合构建的MixBL数据集,模型在多种光照条件下表现出良好的增强效果和泛化能力,适用于实际的多媒体应用。 Abstract: Image degradation caused by complex lighting conditions such as low-light and backlit scenarios is commonly encountered in real-world environments, significantly affecting image quality and downstream vision tasks. Most existing methods focus on a single type of illumination degradation and lack the ability to handle diverse lighting conditions in a unified manner. To address this issue, we propose a dual-illumination enhancement framework called DIME-Net. The core of our method is a Mixture-of-Experts illumination estimator module, where a sparse gating mechanism adaptively selects suitable S-curve expert networks based on the illumination characteristics of the input image. By integrating Retinex theory, this module effectively performs enhancement tailored to both low-light and backlit images. To further correct illumination-induced artifacts and color distortions, we design a damage restoration module equipped with Illumination-Aware Cross Attention and Sequential-State Global Attention mechanisms. In addition, we construct a hybrid illumination dataset, MixBL, by integrating existing datasets, allowing our model to achieve robust illumination adaptability through a single training process. Experimental results show that DIME-Net achieves competitive performance on both synthetic and real-world low-light and backlit datasets without any retraining. These results demonstrate its generalization ability and potential for practical multimedia applications under diverse and complex illumination conditions.[105] ViT-FIQA: Assessing Face Image Quality using Vision Transformers
Andrea Atzori,Fadi Boutros,Naser Damer
Main category: cs.CV
TL;DR: This paper introduces ViT-FIQA, a novel approach using Vision Transformers for Face Image Quality Assessment, which demonstrates superior performance compared to existing methods.
Details
Motivation: The motivation is to explore the potential of Vision Transformer architectures for Face Image Quality Assessment, as existing methods mainly rely on convolutional neural networks. Method: The paper proposes ViT-FIQA, which extends standard Vision Transformer backbones with a learnable quality token to predict the utility of face images. It uses global self-attention to aggregate contextual information and branches into two heads for face representation learning and utility prediction. Result: Extensive experiments show that ViT-FIQA consistently achieves top-tier performance across challenging benchmarks and various face recognition models, including both CNN- and ViT-based architectures. Conclusion: The paper concludes that ViT-FIQA is an effective method for face image quality assessment, highlighting the potential of Vision Transformer architectures in this field. Abstract: Face Image Quality Assessment (FIQA) aims to predict the utility of a face image for face recognition (FR) systems. State-of-the-art FIQA methods mainly rely on convolutional neural networks (CNNs), leaving the potential of Vision Transformer (ViT) architectures underexplored. This work proposes ViT-FIQA, a novel approach that extends standard ViT backbones, originally optimized for FR, through a learnable quality token designed to predict a scalar utility score for any given face image. The learnable quality token is concatenated with the standard image patch tokens, and the whole sequence is processed via global self-attention by the ViT encoders to aggregate contextual information across all patches. At the output of the backbone, ViT-FIQA branches into two heads: (1) the patch tokens are passed through a fully connected layer to learn discriminative face representations via a margin-penalty softmax loss, and (2) the quality token is fed into a regression head to learn to predict the face sample's utility. Extensive experiments on challenging benchmarks and several FR models, including both CNN- and ViT-based architectures, demonstrate that ViT-FIQA consistently achieves top-tier performance. These results underscore the effectiveness of transformer-based architectures in modeling face image utility and highlight the potential of ViTs as a scalable foundation for future FIQA research https://cutt.ly/irHlzXUC.[106] ROVR-Open-Dataset: A Large-Scale Depth Dataset for Autonomous Driving
Xianda Guo,Ruijun Zhang,Yiqun Duan,Ruilin Wang,Keyuan Zhou,Wenzhao Zheng,Wenke Huang,Gangwei Xu,Mike Horton,Yuan Si,Hao Zhao,Long Chen
Main category: cs.CV
TL;DR: This paper introduces a new, large-scale, diverse depth estimation dataset for dynamic outdoor driving environments, addressing limitations in existing datasets and providing a platform for advancing research.
Details
Motivation: The motivation stems from the limitations of existing depth datasets, which are approaching performance saturation and lack diversity and scalability needed for foundation models and multi-modal learning. Method: The authors propose a lightweight acquisition pipeline to create a large-scale, diverse dataset with sparse but statistically sufficient ground truth for depth estimation. Result: The dataset comprises 20K video frames, offering broad scene coverage at low cost, with benchmark experiments validating its utility and highlighting performance gaps in challenging conditions. Conclusion: The paper concludes that the introduced dataset offers greater diversity and challenges for depth estimation, establishing a new platform for advancing research in this area. Abstract: Depth estimation is a fundamental task for 3D scene understanding in autonomous driving, robotics, and augmented reality. Existing depth datasets, such as KITTI, nuScenes, and DDAD, have advanced the field but suffer from limitations in diversity and scalability. As benchmark performance on these datasets approaches saturation, there is an increasing need for a new generation of large-scale, diverse, and cost-efficient datasets to support the era of foundation models and multi-modal learning. To address these challenges, we introduce a large-scale, diverse, frame-wise continuous dataset for depth estimation in dynamic outdoor driving environments, comprising 20K video frames to evaluate existing methods. Our lightweight acquisition pipeline ensures broad scene coverage at low cost, while sparse yet statistically sufficient ground truth enables robust training. Compared to existing datasets, ours presents greater diversity in driving scenarios and lower depth density, creating new challenges for generalization. Benchmark experiments with standard monocular depth estimation models validate the dataset's utility and highlight substantial performance gaps in challenging conditions, establishing a new platform for advancing depth estimation research.[107] OmViD: Omni-supervised active learning for video action detection
Aayush Rana,Akash Kumar,Vibhav Vineet,Yogesh S Rawat
Main category: cs.CV
TL;DR: 本文提出了一种通过主动学习策略选择合适标注类型并使用时空3D-超像素方法生成伪标签的视频动作检测方法,有效降低了标注成本。
Details
Motivation: 视频动作检测需要密集的时空标注,但这些标注在获取上既具有挑战性又昂贵。因此,分析适合每个样本的标注类型及其对检测性能的影响是重要的。 Method: 研究探索了从视频级标签到像素级掩码的不同标注类型,并提出了主动学习策略来估计每个视频所需的标注类型,以及一种新的时空3D-超像素方法来生成伪标签。 Result: 在UCF101-24和JHMDB-21数据集上的验证表明,该方法显著降低了标注成本,且性能损失最小。 Conclusion: 该论文提出了一种有效的主动学习策略和时空3D-超像素方法,以减少视频动作检测中的标注成本,同时保持性能。 Abstract: Video action detection requires dense spatio-temporal annotations, which are both challenging and expensive to obtain. However, real-world videos often vary in difficulty and may not require the same level of annotation. This paper analyzes the appropriate annotation types for each sample and their impact on spatio-temporal video action detection. It focuses on two key aspects: 1) how to obtain varying levels of annotation for videos, and 2) how to learn action detection from different annotation types. The study explores video-level tags, points, scribbles, bounding boxes, and pixel-level masks. First, a simple active learning strategy is proposed to estimate the necessary annotation type for each video. Then, a novel spatio-temporal 3D-superpixel approach is introduced to generate pseudo-labels from these annotations, enabling effective training. The approach is validated on UCF101-24 and JHMDB-21 datasets, significantly cutting annotation costs with minimal performance loss.[108] Physics-Based 3D Simulation for Synthetic Data Generation and Failure Analysis in Packaging Stability Assessment
Samuel Seligardi,Pietro Musoni,Eleonora Iotti,Gianluca Contesso,Alessandro Dal Palù
Main category: cs.CV
TL;DR: 该研究开发了一种创新的物理仿真系统和深度学习模型,用于提高托盘运输的安全性、降低成本和环境影响。
Details
Motivation: 随着物流行业需求的增加,自动化系统的发展变得至关重要,同时广泛使用的塑料缠绕膜促使研究人员探索环保替代方案,同时确保安全标准。 Method: 开发了一个基于3D图形的虚拟环境模拟系统,支持多种配置,并训练了一个深度神经网络来评估模拟视频,作为托盘配置的碰撞测试预测器。 Result: 成功创建了一个可控且精确的物理模拟系统,能够复制移动托盘的行为,并通过深度学习模型增强了安全分析的能力。 Conclusion: 该论文提出了一种基于物理仿真的系统和深度神经网络模型,有效减少了实物测试的需求,降低了成本和环境影响,同时提高了安全分析的准确性。 Abstract: The design and analysis of pallet setups are essential for ensuring safety of packages transportation. With rising demands in the logistics sector, the development of automated systems utilizing advanced technologies has become increasingly crucial. Moreover, the widespread use of plastic wrapping has motivated researchers to investigate eco-friendly alternatives that still adhere to safety standards. We present a fully controllable and accurate physical simulation system capable of replicating the behavior of moving pallets. It features a 3D graphics-based virtual environment that supports a wide range of configurations, including variable package layouts, different wrapping materials, and diverse dynamic conditions. This innovative approach reduces the need for physical testing, cutting costs and environmental impact while improving measurement accuracy for analyzing pallet dynamics. Additionally, we train a deep neural network to evaluate the rendered videos generated by our simulator, as a crash-test predictor for pallet configurations, further enhancing the system's utility in safety analysis.[109] Self-Supervised Sparse Sensor Fusion for Long Range Perception
Edoardo Palladin,Samuel Brucker,Filippo Ghilotti,Praveen Narayanan,Mario Bijelic,Felix Heide
Main category: cs.CV
TL;DR: 本文提出了一种新的远距离自动驾驶感知方法,显著提高了感知距离和性能。
Details
Motivation: 城市之外的自动驾驶需要在高速公路上实现安全、长距离的高速行驶,这对感知距离提出了更高的要求。传统方法在远距离感知方面存在内存和计算成本的限制,因此需要更高效的感知方法。 Method: 使用稀疏表示,引入了高效的3D编码方法,并采用自监督预训练方案,以实现远距离感知。 Result: 将感知距离扩展到250米,目标检测的mAP提升了26.6%,LiDAR预测的Chamfer Distance减少了30.5%。 Conclusion: 本文提出了一种基于稀疏表示的高效3D多模态和时序特征编码方法,并结合自监督预训练方案,实现了250米范围内的感知距离,显著提高了目标检测和LiDAR预测的性能。 Abstract: Outside of urban hubs, autonomous cars and trucks have to master driving on intercity highways. Safe, long-distance highway travel at speeds exceeding 100 km/h demands perception distances of at least 250 m, which is about five times the 50-100m typically addressed in city driving, to allow sufficient planning and braking margins. Increasing the perception ranges also allows to extend autonomy from light two-ton passenger vehicles to large-scale forty-ton trucks, which need a longer planning horizon due to their high inertia. However, most existing perception approaches focus on shorter ranges and rely on Bird's Eye View (BEV) representations, which incur quadratic increases in memory and compute costs as distance grows. To overcome this limitation, we built on top of a sparse representation and introduced an efficient 3D encoding of multi-modal and temporal features, along with a novel self-supervised pre-training scheme that enables large-scale learning from unlabeled camera-LiDAR data. Our approach extends perception distances to 250 meters and achieves an 26.6% improvement in mAP in object detection and a decrease of 30.5% in Chamfer Distance in LiDAR forecasting compared to existing methods, reaching distances up to 250 meters. Project Page: https://light.princeton.edu/lrs4fusion/[110] ResPlan: A Large-Scale Vector-Graph Dataset of 17,000 Residential Floor Plans
Mohamed Abouagour,Eleftherios Garyfallidis
Main category: cs.CV
TL;DR: ResPlan is a new large-scale dataset of 17,000 realistic residential floor plans designed to advance spatial AI research, offering improved visual fidelity, structural diversity, and versatility for various applications.
Details
Motivation: The motivation is to advance spatial AI research by addressing key limitations of existing datasets, such as RPLAN and MSD, by offering enhanced visual fidelity and greater structural diversity in residential layouts. Method: The paper introduces ResPlan, a large-scale dataset of 17,000 structurally rich and realistic residential floor plans, with annotations of architectural elements and functional spaces. The dataset is provided in geometric and graph-based formats and includes an open-source pipeline for geometry cleaning, alignment, and annotation refinement. Result: ResPlan includes 17,000 detailed residential floor plans with precise annotations, structured representations of room connectivity, and support for a wide range of applications including robotics, reinforcement learning, generative AI, virtual and augmented reality, simulations, and game development. Conclusion: ResPlan offers a significant advance in scale, realism, and usability, providing a robust foundation for developing and benchmarking next-generation spatial intelligence systems. Abstract: We introduce ResPlan, a large-scale dataset of 17,000 detailed, structurally rich, and realistic residential floor plans, created to advance spatial AI research. Each plan includes precise annotations of architectural elements (walls, doors, windows, balconies) and functional spaces (such as kitchens, bedrooms, and bathrooms). ResPlan addresses key limitations of existing datasets such as RPLAN (Wu et al., 2019) and MSD (van Engelenburg et al., 2024) by offering enhanced visual fidelity and greater structural diversity, reflecting realistic and non-idealized residential layouts. Designed as a versatile, general-purpose resource, ResPlan supports a wide range of applications including robotics, reinforcement learning, generative AI, virtual and augmented reality, simulations, and game development. Plans are provided in both geometric and graph-based formats, enabling direct integration into simulation engines and fast 3D conversion. A key contribution is an open-source pipeline for geometry cleaning, alignment, and annotation refinement. Additionally, ResPlan includes structured representations of room connectivity, supporting graph-based spatial reasoning tasks. Finally, we present comparative analyses with existing benchmarks and outline several open benchmark tasks enabled by ResPlan. Ultimately, ResPlan offers a significant advance in scale, realism, and usability, providing a robust foundation for developing and benchmarking next-generation spatial intelligence systems.[111] Online 3D Gaussian Splatting Modeling with Novel View Selection
Byeonggwon Lee,Junkyu Park,Khang Truong Giang,Soohwan Song
Main category: cs.CV
TL;DR: This study proposes a novel method for online 3D Gaussian Splatting modeling that improves reconstruction completeness by adaptively selecting optimal non-keyframes and integrating them with keyframes, outperforming existing methods in complex outdoor environments.
Details
Motivation: Existing methods rely solely on keyframes, which leads to incomplete reconstructions. Incorporating frames from diverse viewpoints is necessary for building a generalizable model, but online processing imposes restrictions on the number of frames and training iterations. Method: The method uses adaptive view selection to identify optimal non-keyframes for additional training, integrating both keyframes and selected non-keyframes. It also incorporates an online multi-view stereo approach to ensure consistency in 3D information during modeling. Result: Experimental results show that the proposed method surpasses state-of-the-art approaches, particularly in handling complex outdoor scenes. Conclusion: The proposed method for 3D Gaussian Splatting modeling effectively improves model completeness and outperforms existing methods, especially in complex outdoor scenes. Abstract: This study addresses the challenge of generating online 3D Gaussian Splatting (3DGS) models from RGB-only frames. Previous studies have employed dense SLAM techniques to estimate 3D scenes from keyframes for 3DGS model construction. However, these methods are limited by their reliance solely on keyframes, which are insufficient to capture an entire scene, resulting in incomplete reconstructions. Moreover, building a generalizable model requires incorporating frames from diverse viewpoints to achieve broader scene coverage. However, online processing restricts the use of many frames or extensive training iterations. Therefore, we propose a novel method for high-quality 3DGS modeling that improves model completeness through adaptive view selection. By analyzing reconstruction quality online, our approach selects optimal non-keyframes for additional training. By integrating both keyframes and selected non-keyframes, the method refines incomplete regions from diverse viewpoints, significantly enhancing completeness. We also present a framework that incorporates an online multi-view stereo approach, ensuring consistency in 3D information throughout the 3DGS modeling process. Experimental results demonstrate that our method outperforms state-of-the-art methods, delivering exceptional performance in complex outdoor scenes.[112] Backdooring Self-Supervised Contrastive Learning by Noisy Alignment
Tuo Chen,Jie Gui,Minjing Dong,Ju Jia,Lanting Fang,Jian Liu
Main category: cs.CV
TL;DR: Noisy Alignment (NA) 是一种新的数据中毒后门攻击方法,通过抑制对比学习中毒图像中的噪声成分,实现对预训练编码器的高效攻击。
Details
Motivation: 现有的数据中毒对比学习攻击方法受限于后门与目标对象之间的隐式共现依赖以及对中毒图像中判别特征的抑制不足,因此效果有限。 Method: Noisy Alignment 通过策略性地操纵对比学习中的随机裁剪机制,将噪声对齐过程表述为一个图像布局优化问题,并使用理论推导的最优参数来实现。 Result: Noisy Alignment 在现有数据中毒对比学习攻击方法中表现出最先进的性能,同时保持了干净数据的准确性,并且对常见的后门防御方法具有鲁棒性。 Conclusion: Noisy Alignment 是一种简单而有效的数据中毒攻击方法,能够在对比学习中实现高效的后门攻击。 Abstract: Self-supervised contrastive learning (CL) effectively learns transferable representations from unlabeled data containing images or image-text pairs but suffers vulnerability to data poisoning backdoor attacks (DPCLs). An adversary can inject poisoned images into pretraining datasets, causing compromised CL encoders to exhibit targeted misbehavior in downstream tasks. Existing DPCLs, however, achieve limited efficacy due to their dependence on fragile implicit co-occurrence between backdoor and target object and inadequate suppression of discriminative features in backdoored images. We propose Noisy Alignment (NA), a DPCL method that explicitly suppresses noise components in poisoned images. Inspired by powerful training-controllable CL attacks, we identify and extract the critical objective of noisy alignment, adapting it effectively into data-poisoning scenarios. Our method implements noisy alignment by strategically manipulating contrastive learning's random cropping mechanism, formulating this process as an image layout optimization problem with theoretically derived optimal parameters. The resulting method is simple yet effective, achieving state-of-the-art performance compared to existing DPCLs, while maintaining clean-data accuracy. Furthermore, Noisy Alignment demonstrates robustness against common backdoor defenses. Codes can be found at https://github.com/jsrdcht/Noisy-Alignment.[113] InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing
Shaoshu Yang,Zhe Kong,Feng Gao,Meng Cheng,Xiangyu Liu,Yong Zhang,Zhuoliang Kang,Wenhan Luo,Xunliang Cai,Ran He,Xiaoming Wei
Main category: cs.CV
TL;DR: 本文提出了一种新的视频配音方法,通过保留关键帧和设计新的生成架构,解决了传统技术在面部表情和身体动作不协调的问题。
Details
Motivation: 传统视频配音技术局限于嘴部区域编辑,导致面部表情和身体动作不协调,影响沉浸感。 Method: 提出了一种稀疏帧视频配音方法,并设计了InfiniteTalk架构,利用时间上下文帧和采样策略实现无缝过渡和精细控制。 Result: 在HDTF、CelebV-HQ和EMTD数据集上实现了最先进的性能,定量指标确认了其在视觉真实感、情感连贯性和全身动作同步方面的优势。 Conclusion: InfiniteTalk实现了无限长度的音频驱动视频配音,通过保留关键帧和优化控制策略,显著提升了视觉真实感和情感连贯性。 Abstract: Recent breakthroughs in video AIGC have ushered in a transformative era for audio-driven human animation. However, conventional video dubbing techniques remain constrained to mouth region editing, resulting in discordant facial expressions and body gestures that compromise viewer immersion. To overcome this limitation, we introduce sparse-frame video dubbing, a novel paradigm that strategically preserves reference keyframes to maintain identity, iconic gestures, and camera trajectories while enabling holistic, audio-synchronized full-body motion editing. Through critical analysis, we identify why naive image-to-video models fail in this task, particularly their inability to achieve adaptive conditioning. Addressing this, we propose InfiniteTalk, a streaming audio-driven generator designed for infinite-length long sequence dubbing. This architecture leverages temporal context frames for seamless inter-chunk transitions and incorporates a simple yet effective sampling strategy that optimizes control strength via fine-grained reference frame positioning. Comprehensive evaluations on HDTF, CelebV-HQ, and EMTD datasets demonstrate state-of-the-art performance. Quantitative metrics confirm superior visual realism, emotional coherence, and full-body motion synchronization.[114] GeoSAM2: Unleashing the Power of SAM2 for 3D Part Segmentation
Ken Deng,Yunhan Yang,Jingxiang Sun,Xihui Liu,Yebin Liu,Ding Liang,Yan-Pei Cao
Main category: cs.CV
TL;DR: DetailGen3D通过数据依赖流和token匹配策略提升3D形状的几何细节,保持训练效率。
Details
Motivation: 现代3D生成方法由于计算限制,输出常常缺乏几何细节。 Method: 通过引入token匹配策略和数据依赖流,直接在潜在空间中建模从粗糙到精细的转换。 Result: 实验表明,DetailGen3D在合成高保真几何细节方面表现出色,且适用于各种3D生成和重建方法。 Conclusion: DetailGen3D有效地提升了3D生成和重建方法的几何细节,同时保持了训练效率。 Abstract: Modern 3D generation methods can rapidly create shapes from sparse or single views, but their outputs often lack geometric detail due to computational constraints. We present DetailGen3D, a generative approach specifically designed to enhance these generated 3D shapes. Our key insight is to model the coarse-to-fine transformation directly through data-dependent flows in latent space, avoiding the computational overhead of large-scale 3D generative models. We introduce a token matching strategy that ensures accurate spatial correspondence during refinement, enabling local detail synthesis while preserving global structure. By carefully designing our training data to match the characteristics of synthesized coarse shapes, our method can effectively enhance shapes produced by various 3D generation and reconstruction approaches, from single-view to sparse multi-view inputs. Extensive experiments demonstrate that DetailGen3D achieves high-fidelity geometric detail synthesis while maintaining efficiency in training.[115] Distilled-3DGS:Distilled 3D Gaussian Splatting
Lintao Xiang,Xinkai Chen,Jianhuang Lai,Guangcong Wang
Main category: cs.CV
TL;DR: This paper introduces Distilled-3DGS, a knowledge distillation framework that reduces memory usage in 3D Gaussian Splatting while maintaining rendering quality.
Details
Motivation: 3D Gaussian Splatting (3DGS) requires a large number of Gaussians for high-fidelity rendering, leading to high memory and storage costs. This work aims to reduce these costs while maintaining quality. Method: A knowledge distillation framework is proposed, using multiple teacher models (e.g., vanilla 3DGS, noise-augmented, and dropout-regularized versions) to guide a lightweight student model. A structural similarity loss is introduced to align geometric distributions. Result: The proposed Distilled-3DGS achieves competitive rendering results with improved storage efficiency across various datasets. Conclusion: Distilled-3DGS provides a lightweight solution for 3D Gaussian Splatting by using a knowledge distillation framework, achieving high rendering quality and storage efficiency. Abstract: 3D Gaussian Splatting (3DGS) has exhibited remarkable efficacy in novel view synthesis (NVS). However, it suffers from a significant drawback: achieving high-fidelity rendering typically necessitates a large number of 3D Gaussians, resulting in substantial memory consumption and storage requirements. To address this challenge, we propose the first knowledge distillation framework for 3DGS, featuring various teacher models, including vanilla 3DGS, noise-augmented variants, and dropout-regularized versions. The outputs of these teachers are aggregated to guide the optimization of a lightweight student model. To distill the hidden geometric structure, we propose a structural similarity loss to boost the consistency of spatial geometric distributions between the student and teacher model. Through comprehensive quantitative and qualitative evaluations across diverse datasets, the proposed Distilled-3DGS, a simple yet effective framework without bells and whistles, achieves promising rendering results in both rendering quality and storage efficiency compared to state-of-the-art methods. Project page: https://distilled3dgs.github.io . Code: https://github.com/lt-xiang/Distilled-3DGS .[116] Beyond Simple Edits: Composed Video Retrieval with Dense Modifications
Omkar Thawakar,Dmitry Demidov,Ritesh Thawkar,Rao Muhammad Anwer,Mubarak Shah,Fahad Shahbaz Khan,Salman Khan
Main category: cs.CV
TL;DR: 本文提出了一种新的视频检索模型和数据集,通过跨注意力融合技术,提高细粒度视频检索的性能。
Details
Motivation: 解决标准检索框架在处理细粒度组合查询和时间理解变化时的不足,提高检索能力。 Method: 开发了一个新模型,使用基于基础文本编码器的交叉注意力(CA)融合技术,集成视觉和文本信息。 Result: 新提出的模型在视觉+文本设置中实现了71.3%的Recall@1,并超越了现有方法3.4%。 Conclusion: 提出的新模型通过跨注意力融合实现了视觉和文本信息的精确对齐,达到了最先进的性能,在所有指标上均超越了现有方法。 Abstract: Composed video retrieval is a challenging task that strives to retrieve a target video based on a query video and a textual description detailing specific modifications. Standard retrieval frameworks typically struggle to handle the complexity of fine-grained compositional queries and variations in temporal understanding limiting their retrieval ability in the fine-grained setting. To address this issue, we introduce a novel dataset that captures both fine-grained and composed actions across diverse video segments, enabling more detailed compositional changes in retrieved video content. The proposed dataset, named Dense-WebVid-CoVR, consists of 1.6 million samples with dense modification text that is around seven times more than its existing counterpart. We further develop a new model that integrates visual and textual information through Cross-Attention (CA) fusion using grounded text encoder, enabling precise alignment between dense query modifications and target videos. The proposed model achieves state-of-the-art results surpassing existing methods on all metrics. Notably, it achieves 71.3\% Recall@1 in visual+text setting and outperforms the state-of-the-art by 3.4\%, highlighting its efficacy in terms of leveraging detailed video descriptions and dense modification texts. Our proposed dataset, code, and model are available at :https://github.com/OmkarThawakar/BSE-CoVR[117] LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos
Chin-Yang Lin,Cheng Sun,Fu-En Yang,Min-Hung Chen,Yen-Yu Lin,Yu-Lun Liu
Main category: cs.CV
TL;DR: LongSplat is a novel framework for robust novel view synthesis from long, casually captured videos, effectively addressing challenges like pose drift, inaccurate geometry initialization, and memory limitations.