Skip to content

Table of Contents

cs.CL [Back]

[1] Bilingual Word Level Language Identification for Omotic Languages

Mesay Gemeda Yigezu,Girma Yohannis Bade,Atnafu Lambebo Tonja,Olga Kolesnikova,Grigori Sidorov,Alexander Gelbukh

Main category: cs.CL

TL;DR: 这篇论文提出了一种结合BERT和LSTM的方法,用于解决埃塞俄比亚南部Wolaita和Gofa语言的双语语言识别任务。

Details Motivation: 语言识别任务在现实世界中具有挑战性,尤其是在多语言社区中。Wolaita和Gofa语言之间的词汇相似性和差异性使得语言识别任务更具挑战性。 Method: 论文采用了基于BERT的预训练语言模型和LSTM方法进行实验,并对各种方法进行了组合测试。 Result: 基于BERT的预训练语言模型和LSTM方法的组合在测试集上取得了0.72的F1分数,优于其他方法。 Conclusion: 该论文提出了一种基于BERT预训练语言模型和LSTM方法相结合的方法,以解决埃塞俄比亚南部Wolaita和Gofa语言的双语语言识别任务。 Abstract: Language identification is the task of determining the languages for a given text. In many real world scenarios, text may contain more than one language, particularly in multilingual communities. Bilingual Language Identification (BLID) is the task of identifying and distinguishing between two languages in a given text. This paper presents BLID for languages spoken in the southern part of Ethiopia, namely Wolaita and Gofa. The presence of words similarities and differences between the two languages makes the language identification task challenging. To overcome this challenge, we employed various experiments on various approaches. Then, the combination of the BERT based pretrained language model and LSTM approach performed better, with an F1 score of 0.72 on the test set. As a result, the work will be effective in tackling unwanted social media issues and providing a foundation for further research in this area.

[2] AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMs

Debdeep Sanyal,Manodeep Ray,Murari Mandal

Main category: cs.CL

TL;DR: AntiDote is a new method for training open-weight large language models to resist malicious tampering, maintaining safety and utility while significantly improving robustness against adversarial attacks.

Details Motivation: The motivation is to address the tension between advancing accessible research with open-weight LLMs and preventing their misuse through malicious fine-tuning that removes safety safeguards. Method: The study introduces AntiDote, a bi-level optimization approach that trains LLMs to resist tampering by nullifying the effects of adversarial weight additions. An auxiliary adversary hypernetwork generates malicious LoRA weights, and the defender model is trained to counteract them. Result: AntiDote demonstrated up to 27.4% greater robustness against adversarial attacks compared to existing methods, with less than 0.5% performance degradation across benchmarks like MMLU, HellaSwag, and GSM8K. Conclusion: The study concludes that AntiDote is an effective and computationally efficient method for enhancing the tamper-resistance of open-weight LLMs while maintaining their utility. Abstract: The release of open-weight large language models (LLMs) creates a tension between advancing accessible research and preventing misuse, such as malicious fine-tuning to elicit harmful content. Current safety measures struggle to preserve the general capabilities of the LLM while resisting a determined adversary with full access to the model's weights and architecture, who can use full-parameter fine-tuning to erase existing safeguards. To address this, we introduce AntiDote, a bi-level optimization procedure for training LLMs to be resistant to such tampering. AntiDote involves an auxiliary adversary hypernetwork that learns to generate malicious Low-Rank Adaptation (LoRA) weights conditioned on the defender model's internal activations. The defender LLM is then trained with an objective to nullify the effect of these adversarial weight additions, forcing it to maintain its safety alignment. We validate this approach against a diverse suite of 52 red-teaming attacks, including jailbreak prompting, latent space manipulation, and direct weight-space attacks. AntiDote is upto 27.4\% more robust against adversarial attacks compared to both tamper-resistance and unlearning baselines. Crucially, this robustness is achieved with a minimal trade-off in utility, incurring a performance degradation of upto less than 0.5\% across capability benchmarks including MMLU, HellaSwag, and GSM8K. Our work offers a practical and compute efficient methodology for building open-weight models where safety is a more integral and resilient property.

[3] MVPBench: A Benchmark and Fine-Tuning Framework for Aligning Large Language Models with Diverse Human Values

Yao Liang,Dongcheng Zhao,Feifei Zhao,Guobin Shen,Yuwei Wang,Dongqi Liang,Yi Zeng

Main category: cs.CL

TL;DR: MVPBench是一个评估大型语言模型在多国多维价值偏好对齐的新基准,它揭示了对齐性能的地理和人口统计学差异,并展示轻量级微调方法能显著增强价值对齐。

Details Motivation: 现有基准常常忽视文化与人口统计学的多样性,导致对价值对齐在全球范围内的理解有限。 Method: 提出了一个名为MVPBench的新基准,系统评估LLMs在75个国家/地区的多维人类价值偏好对齐情况。 Result: 揭示了几种最先进的LLMs在地理和人口统计学线上的对齐性能存在显著差异,并进一步展示了轻量级微调方法可以显著增强价值对齐。 Conclusion: MVPBench是一个用于全球对齐、个性化价值建模和公平人工智能开发的实用基础。 Abstract: The alignment of large language models (LLMs) with human values is critical for their safe and effective deployment across diverse user populations. However, existing benchmarks often neglect cultural and demographic diversity, leading to limited understanding of how value alignment generalizes globally. In this work, we introduce MVPBench, a novel benchmark that systematically evaluates LLMs' alignment with multi-dimensional human value preferences across 75 countries. MVPBench contains 24,020 high-quality instances annotated with fine-grained value labels, personalized questions, and rich demographic metadata, making it the most comprehensive resource of its kind to date. Using MVPBench, we conduct an in-depth analysis of several state-of-the-art LLMs, revealing substantial disparities in alignment performance across geographic and demographic lines. We further demonstrate that lightweight fine-tuning methods, such as Low-Rank Adaptation (LoRA) and Direct Preference Optimization (DPO), can significantly enhance value alignment in both in-domain and out-of-domain settings. Our findings underscore the necessity for population-aware alignment evaluation and provide actionable insights for building culturally adaptive and value-sensitive LLMs. MVPBench serves as a practical foundation for future research on global alignment, personalized value modeling, and equitable AI development.

Hoang-Trung Nguyen,Tan-Minh Nguyen,Xuan-Bach Le,Tuan-Kiet Le,Khanh-Huyen Nguyen,Ha-Thanh Nguyen,Thi-Hai-Yen Vuong,Le-Minh Nguyen

Main category: cs.CL

TL;DR: NOWJ团队在COLIEE 2025竞赛中,通过结合传统信息检索技术和现代生成模型,在法律案例蕴含任务和其他任务中均取得了优异成绩。

Details Motivation: 在COLIEE 2025竞赛的所有五个任务中展示NOWJ团队的方法学和结果,特别是在法律案例蕴含任务(任务2)中的进展。 Method: 结合了传统的信息检索技术与现代生成模型,包括预排序模型、嵌入式语义表示和大型语言模型。 Result: 在任务2中,F1得分为0.3195,位列第一;在其他任务中也展示了稳健的表现。 Conclusion: 混合模型在法律信息处理中展现出巨大潜力,为未来的发展提供了宝贵的参考。 Abstract: This paper presents the methodologies and results of the NOWJ team's participation across all five tasks at the COLIEE 2025 competition, emphasizing advancements in the Legal Case Entailment task (Task 2). Our comprehensive approach systematically integrates pre-ranking models (BM25, BERT, monoT5), embedding-based semantic representations (BGE-m3, LLM2Vec), and advanced Large Language Models (Qwen-2, QwQ-32B, DeepSeek-V3) for summarization, relevance scoring, and contextual re-ranking. Specifically, in Task 2, our two-stage retrieval system combined lexical-semantic filtering with contextualized LLM analysis, achieving first place with an F1 score of 0.3195. Additionally, in other tasks--including Legal Case Retrieval, Statute Law Retrieval, Legal Textual Entailment, and Legal Judgment Prediction--we demonstrated robust performance through carefully engineered ensembles and effective prompt-based reasoning strategies. Our findings highlight the potential of hybrid models integrating traditional IR techniques with contemporary generative models, providing a valuable reference for future advancements in legal information processing.

[5] SciGPT: A Large Language Model for Scientific Literature Understanding and Knowledge Discovery

Fengyu She,Nan Wang,Hongfei Wu,Ziyi Wan,Jingmian Wang,Chang Wang

Main category: cs.CL

TL;DR: SciGPT 是一种专为科学文献设计的大语言模型,通过领域适应技术提升科学任务的性能,有望解决科学知识整合的瓶颈问题。

Details Motivation: 科学文献的快速增长使得研究人员难以高效整合知识,而通用大语言模型难以捕捉科学领域的特定细节,限制了其在跨学科研究中的应用。 Method: 基于 Qwen3 架构构建 SciGPT,提出三种关键技术:低代价的领域蒸馏、稀疏专家混合注意力机制和结合领域本体的知识感知适应。 Result: 在 ScienceBench 实验中,SciGPT 在序列标注、生成和推理等核心科学任务上优于 GPT-4o,并在未见过的科学任务中表现出较强的鲁棒性。 Conclusion: SciGPT 是一个针对科学文献理解的领域适应性基础模型,能够有效弥补通用大语言模型在科学领域中的不足,具有促进 AI 增强科学发现的潜力。 Abstract: Scientific literature is growing exponentially, creating a critical bottleneck for researchers to efficiently synthesize knowledge. While general-purpose Large Language Models (LLMs) show potential in text processing, they often fail to capture scientific domain-specific nuances (e.g., technical jargon, methodological rigor) and struggle with complex scientific tasks, limiting their utility for interdisciplinary research. To address these gaps, this paper presents SciGPT, a domain-adapted foundation model for scientific literature understanding and ScienceBench, an open source benchmark tailored to evaluate scientific LLMs. Built on the Qwen3 architecture, SciGPT incorporates three key innovations: (1) low-cost domain distillation via a two-stage pipeline to balance performance and efficiency; (2) a Sparse Mixture-of-Experts (SMoE) attention mechanism that cuts memory consumption by 55\% for 32,000-token long-document reasoning; and (3) knowledge-aware adaptation integrating domain ontologies to bridge interdisciplinary knowledge gaps. Experimental results on ScienceBench show that SciGPT outperforms GPT-4o in core scientific tasks including sequence labeling, generation, and inference. It also exhibits strong robustness in unseen scientific tasks, validating its potential to facilitate AI-augmented scientific discovery.

[6] No for Some, Yes for Others: Persona Prompts and Other Sources of False Refusal in Language Models

Flor Miriam Plaza-del-Arco,Paul Röttger,Nino Scherrer,Emanuele Borgonovo,Elmar Plischke,Dirk Hovy

Main category: cs.CL

TL;DR: The study finds that persona prompting's effect on false refusals in LLMs is overestimated, with model and task choices being more influential factors.

Details Motivation: To quantify the extent to which LLM personalization through persona prompting leads to unintended false refusals, as prior work suggested but did not fully measure. Method: A Monte Carlo-based method was used to evaluate the impact of 15 sociodemographic personas across 16 models, 3 tasks, and nine prompt paraphrases on false refusal rates. Result: More capable models show less impact of personas on false refusal rates. Some sociodemographic personas increased false refusals, indicating potential biases in alignment strategies or safety mechanisms, but model and task choices were more influential. Conclusion: Persona effects on false refusal rates in LLMs have been overestimated and may be influenced more by other factors such as model choice and task type rather than sociodemographic personas alone. Abstract: Large language models (LLMs) are increasingly integrated into our daily lives and personalized. However, LLM personalization might also increase unintended side effects. Recent work suggests that persona prompting can lead models to falsely refuse user requests. However, no work has fully quantified the extent of this issue. To address this gap, we measure the impact of 15 sociodemographic personas (based on gender, race, religion, and disability) on false refusal. To control for other factors, we also test 16 different models, 3 tasks (Natural Language Inference, politeness, and offensiveness classification), and nine prompt paraphrases. We propose a Monte Carlo-based method to quantify this issue in a sample-efficient manner. Our results show that as models become more capable, personas impact the refusal rate less and less. Certain sociodemographic personas increase false refusal in some models, which suggests underlying biases in the alignment strategies or safety mechanisms. However, we find that the model choice and task significantly influence false refusals, especially in sensitive content tasks. Our findings suggest that persona effects have been overestimated, and might be due to other factors.

[7] Culturally transmitted color categories in LLMs reflect a learning bias toward efficient compression

Nathaniel Imel,Noga Zaslavsky

Main category: cs.CL

TL;DR: LLMs can evolve human-like semantic systems, particularly in color categorization, aligning with the principles of IB-efficiency observed in human languages.

Details Motivation: Converging evidence suggests that systems of semantic categories across human languages achieve near-optimal compression via the Information Bottleneck (IB) complexity-accuracy principle. LLMs are not trained for this objective, which raises the question of their capability to evolve efficient human-like semantic systems. Method: Replicated two influential human behavioral studies with LLMs (Gemini 2.0-flash and Llama 3.3-70B-Instruct) focusing on color as a testbed. Conducted an English color-naming study and simulated cultural evolution of pseudo color-naming systems via iterated in-context language learning. Result: Gemini aligned well with the naming patterns of native English speakers and achieved a significantly high IB-efficiency score, while Llama exhibited an efficient but lower complexity system compared to English. LLMs restructured initially random systems towards greater IB-efficiency and alignment with patterns observed across the world's languages. Conclusion: LLMs are capable of evolving perceptually grounded, human-like semantic systems, driven by the same fundamental principle that governs semantic efficiency across human languages. Abstract: Converging evidence suggests that systems of semantic categories across human languages achieve near-optimal compression via the Information Bottleneck (IB) complexity-accuracy principle. Large language models (LLMs) are not trained for this objective, which raises the question: are LLMs capable of evolving efficient human-like semantic systems? To address this question, we focus on the domain of color as a key testbed of cognitive theories of categorization and replicate with LLMs (Gemini 2.0-flash and Llama 3.3-70B-Instruct) two influential human behavioral studies. First, we conduct an English color-naming study, showing that Gemini aligns well with the naming patterns of native English speakers and achieves a significantly high IB-efficiency score, while Llama exhibits an efficient but lower complexity system compared to English. Second, to test whether LLMs simply mimic patterns in their training data or actually exhibit a human-like inductive bias toward IB-efficiency, we simulate cultural evolution of pseudo color-naming systems in LLMs via iterated in-context language learning. We find that akin to humans, LLMs iteratively restructure initially random systems towards greater IB-efficiency and increased alignment with patterns observed across the world's languages. These findings demonstrate that LLMs are capable of evolving perceptually grounded, human-like semantic systems, driven by the same fundamental principle that governs semantic efficiency across human languages.

[8] MERLIN: Multi-Stage Curriculum Alignment for Multilingual Encoder and LLM Fusion

Kosei Uemura,David Guzmán,Quang Phuoc Nguyen,Jesujoba Oluwadara Alabi,En-shiun Annie Lee,David Ifeoluwa Adelani

Main category: cs.CL

TL;DR: MERLIN enhances complex reasoning in low-resource languages through a novel two-stage model-stacking approach, outperforming current methods and models.

Details Motivation: The motivation is to overcome the limitations of existing methods in handling complex reasoning tasks in low-resource languages. Method: MERLIN uses a two-stage model-stacking framework with a curriculum learning strategy and adapts only a small set of DoRA weights. Result: On the AfriMGSM benchmark, MERLIN improves exact-match accuracy by +12.9 percentage points over MindMerger and outperforms GPT-4o-mini. It also yields consistent gains on MGSM and MSVAMP. Conclusion: MERLIN is effective in improving the performance of low-resource languages in complex reasoning tasks, surpassing existing methods and models like GPT-4o-mini. Abstract: Large language models excel in English but still struggle with complex reasoning in many low-resource languages (LRLs). Existing encoder-plus-decoder methods such as LangBridge and MindMerger raise accuracy on mid and high-resource languages, yet they leave a large gap on LRLs. We present MERLIN, a two-stage model-stacking framework that applies a curriculum learning strategy -- from general bilingual bitext to task-specific data -- and adapts only a small set of DoRA weights. On the AfriMGSM benchmark MERLIN improves exact-match accuracy by +12.9 pp over MindMerger and outperforms GPT-4o-mini. It also yields consistent gains on MGSM and MSVAMP (+0.9 and +2.8 pp), demonstrating effectiveness across both low and high-resource settings.

[9] Bias after Prompting: Persistent Discrimination in Large Language Models

Nivedha Sivakumar,Natalie Mackraz,Samira Khorshidi,Krishna Patel,Barry-John Theobald,Luca Zappella,Nicholas Apostoloff

Main category: cs.CL

TL;DR: 本文研究了预训练大型语言模型(LLMs)通过提示(prompt)适应策略将偏见传递给适应模型的现象,发现偏见确实可以通过提示传递,并且现有的提示去偏方法并不能始终有效防止这种传递。

Details Motivation: 先前关于偏见传递假设(BTH)的研究可能错误地认为偏见不会从预训练模型传递到适应模型,本文旨在验证这一假设在提示适应策略中的有效性。 Method: 通过在因果模型中研究BTH在提示适应下的表现,分析不同人口统计和任务中的偏见相关性,并评估多种基于提示的去偏策略的有效性。 Result: 研究发现,偏见可以通过提示传递,且在不同任务和人口统计中保持中等到强相关性,如性别(rho >= 0.94)、年龄(rho >= 0.98)和宗教(rho >= 0.69);提示去偏方法效果有限,不能一致降低偏见传递。 Conclusion: 本文表明,修正预训练模型中的偏见可能有助于防止其在下游任务中的传播。 Abstract: A dangerous assumption that can be made from prior work on the bias transfer hypothesis (BTH) is that biases do not transfer from pre-trained large language models (LLMs) to adapted models. We invalidate this assumption by studying the BTH in causal models under prompt adaptations, as prompting is an extremely popular and accessible adaptation strategy used in real-world applications. In contrast to prior work, we find that biases can transfer through prompting and that popular prompt-based mitigation methods do not consistently prevent biases from transferring. Specifically, the correlation between intrinsic biases and those after prompt adaptation remain moderate to strong across demographics and tasks -- for example, gender (rho >= 0.94) in co-reference resolution, and age (rho >= 0.98) and religion (rho >= 0.69) in question answering. Further, we find that biases remain strongly correlated when varying few-shot composition parameters, such as sample size, stereotypical content, occupational distribution and representational balance (rho >= 0.90). We evaluate several prompt-based debiasing strategies and find that different approaches have distinct strengths, but none consistently reduce bias transfer across models, tasks or demographics. These results demonstrate that correcting bias, and potentially improving reasoning ability, in intrinsic models may prevent propagation of biases to downstream tasks.

[10] Verbalized Algorithms

Supriya Lall,Christian Farrell,Hari Pathanjaly,Marko Pavic,Sarvesh Chezhian,Masataro Asai

Main category: cs.CL

TL;DR: This paper proposes verbalized algorithms (VAs), which use classical algorithms and limit LLMs to simple tasks, improving reliability in reasoning tasks like sorting and clustering.

Details Motivation: The motivation is to overcome the limitations of one-shot querying of LLMs for reasoning tasks by incorporating classical algorithms with established theoretical understanding, ensuring more reliable performance. Method: The paper introduces the verbalized algorithms (VAs) paradigm, which decomposes tasks into simple operations on natural language strings, using LLMs as a binary comparison oracle within well-known algorithms, such as bitonic sorting network. Result: The approach demonstrates effectiveness on sorting and clustering tasks, showing that VAs can reliably perform these tasks by limiting the scope of LLMs to simple operations. Conclusion: The paper concludes that the proposed verbalized algorithms (VAs) paradigm effectively leverages classical algorithms and improves the reliability of LLMs in performing reasoning tasks. Abstract: Instead of querying LLMs in a one-shot manner and hoping to get the right answer for a reasoning task, we propose a paradigm we call \emph{verbalized algorithms} (VAs), which leverage classical algorithms with established theoretical understanding. VAs decompose a task into simple elementary operations on natural language strings that they should be able to answer reliably, and limit the scope of LLMs to only those simple tasks. For example, for sorting a series of natural language strings, \emph{verbalized sorting} uses an LLM as a binary comparison oracle in a known and well-analyzed sorting algorithm (e.g., bitonic sorting network). We demonstrate the effectiveness of this approach on sorting and clustering tasks.

[11] Balancing Quality and Variation: Spam Filtering Distorts Data Label Distributions

Eve Fleisig,Matthias Orlikowski,Philipp Cimiano,Dan Klein

Main category: cs.CL

TL;DR: 研究发现,在需要保留标签多样性的任务中,现有垃圾信息过滤方法可能适得其反,垃圾标注者往往比非垃圾标注者更不随机,因此需要开发考虑标签多样性的垃圾信息删除方法。

Details Motivation: 为了使机器学习数据集能够准确代表群体中的多样化观点,需要在过滤垃圾或低质量响应的同时保持数据标签的多样性。 Method: 通过实证评估一系列标注者过滤启发式方法在主观任务中对变化保留的影响,并分析合成垃圾信息上的表现。 Result: 发现设计用于单一真实标签背景下将变化视为噪声的方法往往会移除意见不一致的标注者而非垃圾标注者,导致准确性和标签多样性之间的次优权衡。保守的标注者移除设置(<5%)最佳,超过此阈值所有测试方法都会增加与真实平均标签的平均绝对误差。此外,多数垃圾标注者在分布上与真实标注者无法区分,而可区分的少数垃圾标注者倾向于给出固定答案而非随机答案。 Conclusion: 现有垃圾信息过滤方法的直觉在需要保留多样性的情况下可能适得其反,垃圾标注者往往比非垃圾标注者更不随机,因此需要考虑标签多样性的垃圾信息删除方法。 Abstract: For machine learning datasets to accurately represent diverse opinions in a population, they must preserve variation in data labels while filtering out spam or low-quality responses. How can we balance annotator reliability and representation? We empirically evaluate how a range of heuristics for annotator filtering affect the preservation of variation on subjective tasks. We find that these methods, designed for contexts in which variation from a single ground-truth label is considered noise, often remove annotators who disagree instead of spam annotators, introducing suboptimal tradeoffs between accuracy and label diversity. We find that conservative settings for annotator removal (<5%) are best, after which all tested methods increase the mean absolute error from the true average label. We analyze performance on synthetic spam to observe that these methods often assume spam annotators are less random than real spammers tend to be: most spammers are distributionally indistinguishable from real annotators, and the minority that are distinguishable tend to give fixed answers, not random ones. Thus, tasks requiring the preservation of variation reverse the intuition of existing spam filtering methods: spammers tend to be less random than non-spammers, so metrics that assume variation is spam fare worse. These results highlight the need for spam removal methods that account for label diversity.

[12] Towards Knowledge-Aware Document Systems: Modeling Semantic Coverage Relations via Answerability Detection

Yehudit Aperstein,Alon Gottlib,Gal Benita,Alexander Apartsin

Main category: cs.CL

TL;DR: This paper introduces a framework for modeling Semantic Coverage Relations (SCR) using a QA-based approach, showing that discriminative models outperform generative ones in capturing semantic alignment across documents.

Details Motivation: Understanding how information aligns across documents is crucial for tasks like information retrieval and content alignment, regardless of stylistic or format differences. Method: A question answering (QA)-based approach was used to model Semantic Coverage Relations (SCR), involving the creation of a synthetic dataset from SQuAD through paraphrasing and selective omission of information. Result: RoBERTa-base achieved the highest accuracy at 61.4%, while the Random Forest-based model showed the best balance with a macro-F1 score of 52.9%. Conclusion: The study concludes that discriminative models outperform generative approaches in capturing semantic coverage relations, with RoBERTa-base and Random Forest models performing best in accuracy and balance, respectively. Abstract: Understanding how information is shared across documents, regardless of the format in which it is expressed, is critical for tasks such as information retrieval, summarization, and content alignment. In this work, we introduce a novel framework for modelling Semantic Coverage Relations (SCR), which classifies document pairs based on how their informational content aligns. We define three core relation types: equivalence, where both texts convey the same information using different textual forms or styles; inclusion, where one document fully contains the information of another and adds more; and semantic overlap, where each document presents partially overlapping content. To capture these relations, we adopt a question answering (QA)-based approach, using the answerability of shared questions across documents as an indicator of semantic coverage. We construct a synthetic dataset derived from the SQuAD corpus by paraphrasing source passages and selectively omitting information, enabling precise control over content overlap. This dataset allows us to benchmark generative language models and train transformer-based classifiers for SCR prediction. Our findings demonstrate that discriminative models significantly outperform generative approaches, with the RoBERTa-base model achieving the highest accuracy of 61.4% and the Random Forest-based model showing the best balance with a macro-F1 score of 52.9%. The results show that QA provides an effective lens for assessing semantic relations across stylistically diverse texts, offering insights into the capacity of current models to reason about information beyond surface similarity. The dataset and code developed in this study are publicly available to support reproducibility.

[13] Toward Subtrait-Level Model Explainability in Automated Writing Evaluation

Alejandro Andrade-Lotero,Lee Becker,Joshua Southerland,Scott Hellman

Main category: cs.CL

TL;DR: The paper explores subtrait assessment using generative language models to improve the transparency of automated writing scores, showing modest correlations between human and automated scores.

Details Motivation: The motivation is to increase the transparency of automated writing scores by providing more detailed and explainable scoring mechanisms. Method: The study prototypes explainability and subtrait scoring using generative language models and evaluates correlations between human and automated subtrait and trait scores. Result: The study found a modest correlation between human subtrait and trait scores, as well as between automated and human subtrait scores. Conclusion: Subtrait assessment enhances the transparency of automated writing scores, offering detailed insights for educators and students. Abstract: Subtrait (latent-trait components) assessment presents a promising path toward enhancing transparency of automated writing scores. We prototype explainability and subtrait scoring with generative language models and show modest correlation between human subtrait and trait scores, and between automated and human subtrait scores. Our approach provides details to demystify scores for educators and students.

[14] Automatic Detection of Inauthentic Templated Responses in English Language Assessments

Yashad Samant,Lee Becker,Scott Hellman,Bradley Behan,Sarah Hughes,Joshua Southerland

Main category: cs.CL

TL;DR: 本文提出了一个新的任务AuDITR,并使用机器学习方法进行解决,强调了模型持续更新的重要性。

Details Motivation: 在高风险英语语言评估中,低技能考生可能使用记忆材料(模板)来欺骗自动评分系统。 Method: 提出了一种基于机器学习的方法来完成AuDITR任务。 Result: 引入了一个新的任务:自动检测不真实、模板化的反应(AuDITR)。 Conclusion: 更新生产环境中的模型对于检测模板化回答至关重要。 Abstract: In high-stakes English Language Assessments, low-skill test takers may employ memorized materials called ``templates'' on essay questions to ``game'' or fool the automated scoring system. In this study, we introduce the automated detection of inauthentic, templated responses (AuDITR) task, describe a machine learning-based approach to this task and illustrate the importance of regularly updating these models in production.

[15] So let's replace this phrase with insult... Lessons learned from generation of toxic texts with LLMs

Sergey Pletenev,Daniil Moskovskiy,Alexander Panchenko

Main category: cs.CL

TL;DR: This paper finds that LLM-generated synthetic toxic data, while promising, currently falls short of human-generated data in training effective detoxification models due to limited lexical diversity.

Details Motivation: The motivation stems from the need to explore alternatives to human-generated data for training detoxification models, especially considering the potential of modern LLMs in generating synthetic data. Method: Synthetic toxic data was generated using Llama 3 and Qwen activation-patched models based on neutral texts from ParaDetox and SST-2 datasets. The performance of models trained on this synthetic data was compared against those trained on human-generated data. Result: Models fine-tuned on synthetic data performed consistently worse than those trained on human data, with a performance drop of up to 30% in joint metrics. The main issue identified was a lack of lexical diversity in the synthetic data. Conclusion: The study concludes that while LLMs can generate synthetic toxic data, their limited lexical diversity results in subpar performance when such data is used for training detoxification models, underscoring the importance of human-generated data. Abstract: Modern Large Language Models (LLMs) are excellent at generating synthetic data. However, their performance in sensitive domains such as text detoxification has not received proper attention from the scientific community. This paper explores the possibility of using LLM-generated synthetic toxic data as an alternative to human-generated data for training models for detoxification. Using Llama 3 and Qwen activation-patched models, we generated synthetic toxic counterparts for neutral texts from ParaDetox and SST-2 datasets. Our experiments show that models fine-tuned on synthetic data consistently perform worse than those trained on human data, with a drop in performance of up to 30% in joint metrics. The root cause is identified as a critical lexical diversity gap: LLMs generate toxic content using a small, repetitive vocabulary of insults that fails to capture the nuances and variety of human toxicity. These findings highlight the limitations of current LLMs in this domain and emphasize the continued importance of diverse, human-annotated data for building robust detoxification systems.

[16] Low-Resource Fine-Tuning for Multi-Task Structured Information Extraction with a Billion-Parameter Instruction-Tuned Model

Yu Cheng Chih,Yong Hao Hou

Main category: cs.CL

TL;DR: This paper introduces ETLCH, a small, fine-tuned language model that achieves strong performance on structured data extraction tasks with limited data and computational resources.

Details Motivation: Large language models are often impractical for smaller teams due to high computational costs and data requirements. This work investigates whether smaller models can perform reliably under low-resource, multi-task conditions. Method: The authors fine-tune a billion-parameter LLaMA-based model using low-rank adaptation on a small number of samples (a few hundred to one thousand per task) across JSON extraction, knowledge graph extraction, and named entity recognition. Result: ETLCH outperforms strong baselines on most evaluation metrics, with particularly strong performance at the lowest data scales. Conclusion: Well-tuned small models like ETLCH can provide stable, accurate structured outputs at a fraction of the computational cost, making them suitable for resource-constrained environments. Abstract: Deploying large language models (LLMs) for structured data extraction in domains such as financial compliance reporting, legal document analytics, and multilingual knowledge base construction is often impractical for smaller teams due to the high cost of running large architectures and the difficulty of preparing large, high-quality datasets. Most recent instruction-tuning studies focus on seven-billion-parameter or larger models, leaving limited evidence on whether much smaller models can work reliably under low-resource, multi-task conditions. This work presents ETLCH, a billion-parameter LLaMA-based model fine-tuned with low-rank adaptation on only a few hundred to one thousand samples per task for JSON extraction, knowledge graph extraction, and named entity recognition. Despite its small scale, ETLCH outperforms strong baselines across most evaluation metrics, with substantial gains observed even at the lowest data scale. These findings demonstrate that well-tuned small models can deliver stable and accurate structured outputs at a fraction of the computational cost, enabling cost-effective and reliable information extraction pipelines in resource-constrained environments.

[17] CommonVoice-SpeechRE and RPG-MoGe: Advancing Speech Relation Extraction with a New Dataset and Multi-Order Generative Framework

Jinzhong Ning,Paerhati Tulajiang,Yingying Le,Yijia Zhang,Yuanyuan Sun,Hongfei Lin,Haifeng Liu

Main category: cs.CL

TL;DR: This paper introduces RPG-MoGe and the CommonVoice-SpeechRE dataset to improve speech relation extraction, achieving superior performance by leveraging real human speech and a novel multi-order generation approach.

Details Motivation: Existing SpeechRE datasets rely on synthetic data and lack real human speech diversity. Models also suffer from rigid templates and weak semantic alignment, limiting performance. Method: The authors introduced CommonVoice-SpeechRE, a large-scale dataset with real-human speech, and proposed RPG-MoGe, a novel framework with multi-order triplet generation and CNN-based relation prediction. Result: Experiments demonstrated that the RPG-MoGe framework outperforms state-of-the-art methods on SpeechRE tasks. Conclusion: The proposed RPG-MoGe framework, along with the new CommonVoice-SpeechRE dataset, provides an effective solution for real-world SpeechRE, outperforming state-of-the-art methods. Abstract: Speech Relation Extraction (SpeechRE) aims to extract relation triplets directly from speech. However, existing benchmark datasets rely heavily on synthetic data, lacking sufficient quantity and diversity of real human speech. Moreover, existing models also suffer from rigid single-order generation templates and weak semantic alignment, substantially limiting their performance. To address these challenges, we introduce CommonVoice-SpeechRE, a large-scale dataset comprising nearly 20,000 real-human speech samples from diverse speakers, establishing a new benchmark for SpeechRE research. Furthermore, we propose the Relation Prompt-Guided Multi-Order Generative Ensemble (RPG-MoGe), a novel framework that features: (1) a multi-order triplet generation ensemble strategy, leveraging data diversity through diverse element orders during both training and inference, and (2) CNN-based latent relation prediction heads that generate explicit relation prompts to guide cross-modal alignment and accurate triplet generation. Experiments show our approach outperforms state-of-the-art methods, providing both a benchmark dataset and an effective solution for real-world SpeechRE. The source code and dataset are publicly available at https://github.com/NingJinzhong/SpeechRE_RPG_MoGe.

[18] Adversarial Attacks Against Automated Fact-Checking: A Survey

Fanzhen Liu,Alsharif Abuadbba,Kristen Moore,Surya Nepal,Cecile Paris,Jia Wu,Jian Yang,Quan Z. Sheng

Main category: cs.CL

TL;DR: This paper surveys adversarial attacks on automated fact-checking systems, categorizes attack methods, evaluates their impact, and highlights the need for more robust defenses.

Details Motivation: Misinformation spreads freely, and while automated fact-checking has advanced, these systems are vulnerable to adversarial attacks that manipulate claims, evidence, or claim-evidence pairs. This undermines the reliability of fact-checking models, and a comprehensive overview of the challenges is needed. Method: The paper provides a systematic review and categorization of adversarial attack methodologies targeting fact-checking systems, evaluates their impact on current models, and analyzes recent defense strategies. Result: The survey identifies key challenges such as understanding attack strategies, assessing model resilience, and improving robustness. It also highlights recent advancements in adversary-aware defenses and outlines open research questions. Conclusion: There is an urgent need to develop resilient fact-checking frameworks that can withstand adversarial manipulations to maintain high verification accuracy. Abstract: In an era where misinformation spreads freely, fact-checking (FC) plays a crucial role in verifying claims and promoting reliable information. While automated fact-checking (AFC) has advanced significantly, existing systems remain vulnerable to adversarial attacks that manipulate or generate claims, evidence, or claim-evidence pairs. These attacks can distort the truth, mislead decision-makers, and ultimately undermine the reliability of FC models. Despite growing research interest in adversarial attacks against AFC systems, a comprehensive, holistic overview of key challenges remains lacking. These challenges include understanding attack strategies, assessing the resilience of current models, and identifying ways to enhance robustness. This survey provides the first in-depth review of adversarial attacks targeting FC, categorizing existing attack methodologies and evaluating their impact on AFC systems. Additionally, we examine recent advancements in adversary-aware defenses and highlight open research questions that require further exploration. Our findings underscore the urgent need for resilient FC frameworks capable of withstanding adversarial manipulations in pursuit of preserving high verification accuracy.

[19] Acquiescence Bias in Large Language Models

Daniel Braun

Main category: cs.CL

TL;DR: This paper explores whether LLMs exhibit acquiescence bias like humans and finds that instead of agreeing, LLMs tend to answer 'no' across various models, tasks, and languages.

Details Motivation: Given that LLMs are influenceable by small changes in input and are trained on human-generated data, it is reasonable to assume they might exhibit a similar acquiescence bias as humans. Method: The study investigates the presence of acquiescence bias in LLMs across different models, tasks, and languages (English, German, and Polish). Result: Results indicate that LLMs have a tendency to answer 'no', deviating from the human tendency of agreeing with statements in surveys irrespective of their actual beliefs. Conclusion: Large Language Models (LLMs) display a bias towards answering 'no' regardless of whether it indicates agreement or disagreement, contrary to the acquiescence bias observed in humans. Abstract: Acquiescence bias, i.e. the tendency of humans to agree with statements in surveys, independent of their actual beliefs, is well researched and documented. Since Large Language Models (LLMs) have been shown to be very influenceable by relatively small changes in input and are trained on human-generated data, it is reasonable to assume that they could show a similar tendency. We present a study investigating the presence of acquiescence bias in LLMs across different models, tasks, and languages (English, German, and Polish). Our results indicate that, contrary to humans, LLMs display a bias towards answering no, regardless of whether it indicates agreement or disagreement.

[20] Simulating Identity, Propagating Bias: Abstraction and Stereotypes in LLM-Generated Text

Pia Sommerauer,Giulia Rambelli,Tommaso Caselli

Main category: cs.CL

TL;DR: This paper investigates how persona-prompting affects LLMs' linguistic representation of social groups, finding that it has limited impact on reducing stereotyping and may inadvertently propagate biases.

Details Motivation: The motivation is to explore how persona-prompting affects the representation of social groups by LLMs, particularly focusing on linguistic abstraction as a marker of stereotyping. Method: Using the Linguistic Expectancy Bias framework, the paper analyzes outputs from six open-weight LLMs under three prompting conditions, comparing 11 persona-driven responses to those of a generic AI assistant. The study introduces the Self-Stereo dataset from Reddit and measures abstraction through concreteness, specificity, and negation. Result: The results indicate that persona-prompting does not effectively modulate linguistic abstraction, highlighting concerns about its ecological validity and potential to perpetuate stereotypes. Conclusion: The study concludes that persona-prompting has limitations in changing linguistic abstraction in LLMs, raising concerns about its potential to propagate stereotypes even when representing marginalized groups. Abstract: Persona-prompting is a growing strategy to steer LLMs toward simulating particular perspectives or linguistic styles through the lens of a specified identity. While this method is often used to personalize outputs, its impact on how LLMs represent social groups remains underexplored. In this paper, we investigate whether persona-prompting leads to different levels of linguistic abstraction - an established marker of stereotyping - when generating short texts linking socio-demographic categories with stereotypical or non-stereotypical attributes. Drawing on the Linguistic Expectancy Bias framework, we analyze outputs from six open-weight LLMs under three prompting conditions, comparing 11 persona-driven responses to those of a generic AI assistant. To support this analysis, we introduce Self-Stereo, a new dataset of self-reported stereotypes from Reddit. We measure abstraction through three metrics: concreteness, specificity, and negation. Our results highlight the limits of persona-prompting in modulating abstraction in language, confirming criticisms about the ecology of personas as representative of socio-demographic groups and raising concerns about the risk of propagating stereotypes even when seemingly evoking the voice of a marginalized group.

[21] Too Helpful, Too Harmless, Too Honest or Just Right?

Gautam Siddharth Kashyap,Mark Dras,Usman Naseem

Main category: cs.CL

TL;DR: TrinityX 提出了一种新的模块化对齐框架,结合了任务自适应的校准路由机制,有效提升了大语言模型在Helpfulness、Harmlessness和Honesty方面的表现。

Details Motivation: 大语言模型(LLMs)在各种NLP任务中表现优异,但其输出与Helpfulness、Harmlessness和Honesty(HHH)原则的对齐仍然是一个挑战。现有方法通常单独优化单个对齐维度,导致权衡和不一致的行为。Mixture-of-Experts (MoE) 架构虽然具有模块化优势,但其路由机制校准不佳,限制了其在对齐任务中的效果。 Method: TrinityX 使用了Mixture-of-Experts (MoE) 架构,并通过一种任务自适应的校准路由机制整合了针对每个HHH维度分别训练的专家,从而形成统一的对齐感知表示。 Result: TrinityX 在三个标准对齐基准测试中(Alpaca、BeaverTails和TruthfulQA)表现优于强基线模型,分别在胜率、安全评分和真实性方面实现了32.5%、33.9%和28.4%的相对提升;同时,内存使用和推理延迟相比之前的MoE方法减少了40%以上。 Conclusion: TrinityX 提出了一种模块化的对齐框架,通过在Transformer架构中引入MoCaE(Calibrated Experts的混合),有效解决了大语言模型在Helpfulness、Harmlessness和Honesty(HHH)对齐方面的挑战。 Abstract: Large Language Models (LLMs) exhibit strong performance across a wide range of NLP tasks, yet aligning their outputs with the principles of Helpfulness, Harmlessness, and Honesty (HHH) remains a persistent challenge. Existing methods often optimize for individual alignment dimensions in isolation, leading to trade-offs and inconsistent behavior. While Mixture-of-Experts (MoE) architectures offer modularity, they suffer from poorly calibrated routing, limiting their effectiveness in alignment tasks. We propose TrinityX, a modular alignment framework that incorporates a Mixture of Calibrated Experts (MoCaE) within the Transformer architecture. TrinityX leverages separately trained experts for each HHH dimension, integrating their outputs through a calibrated, task-adaptive routing mechanism that combines expert signals into a unified, alignment-aware representation. Extensive experiments on three standard alignment benchmarks-Alpaca (Helpfulness), BeaverTails (Harmlessness), and TruthfulQA (Honesty)-demonstrate that TrinityX outperforms strong baselines, achieving relative improvements of 32.5% in win rate, 33.9% in safety score, and 28.4% in truthfulness. In addition, TrinityX reduces memory usage and inference latency by over 40% compared to prior MoE-based approaches. Ablation studies highlight the importance of calibrated routing, and cross-model evaluations confirm TrinityX's generalization across diverse LLM backbones.

[22] CM-Align: Consistency-based Multilingual Alignment for Large Language Models

Xue Zhang,Yunlong Liang,Fandong Meng,Songming Zhang,Yufeng Chen,Jinan Xu,Jie Zhou

Main category: cs.CL

TL;DR: 本文提出CM-Align方法,通过提高多语言偏好数据质量,有效改进多语言大模型的对齐性能。

Details Motivation: 现有方法在构建多语言偏好数据时存在噪声,导致对齐效果受限,主要问题包括英文响应质量不均和启发式方法偏差。 Method: 提出了基于一致性的数据选择方法,包括英文参考选择和跨语言一致性多语言偏好数据构建。 Result: 在三个LLM和三个常见任务上的实验结果表明CM-Align方法有效且优越。 Conclusion: CM-Align通过一致性数据选择方法提高了多语言对齐效果,证明了构建高质量偏好数据的必要性。 Abstract: Current large language models (LLMs) generally show a significant performance gap in alignment between English and other languages. To bridge this gap, existing research typically leverages the model's responses in English as a reference to select the best/worst responses in other languages, which are then used for Direct Preference Optimization (DPO) training. However, we argue that there are two limitations in the current methods that result in noisy multilingual preference data and further limited alignment performance: 1) Not all English responses are of high quality, and using a response with low quality may mislead the alignment for other languages. 2) Current methods usually use biased or heuristic approaches to construct multilingual preference pairs. To address these limitations, we design a consistency-based data selection method to construct high-quality multilingual preference data for improving multilingual alignment (CM-Align). Specifically, our method includes two parts: consistency-guided English reference selection and cross-lingual consistency-based multilingual preference data construction. Experimental results on three LLMs and three common tasks demonstrate the effectiveness and superiority of our method, which further indicates the necessity of constructing high-quality preference data.

[23] LLM Ensemble for RAG: Role of Context Length in Zero-Shot Question Answering for BioASQ Challenge

Dima Galat,Diego Molla-Aliod

Main category: cs.CL

TL;DR: The paper demonstrates that combining multiple large language models through an ensemble approach with effective retrieval pipelines can achieve state-of-the-art results in biomedical question answering without domain-specific fine-tuning.

Details Motivation: Biomedical question answering requires precise interpretation of specialized knowledge from a vast and evolving corpus, posing significant challenges for traditional approaches. Method: The study uses an ensemble of large language models (LLMs) for information retrieval (IR) and aggregates their outputs to generate accurate and robust answers for a biomedical QA task. Result: The ensemble approach outperforms individual LLMs and rivals or surpasses domain-tuned systems without the need for costly fine-tuning or labeled data. Conclusion: Ensemble-based zero-shot approaches, when paired with effective RAG pipelines, offer a practical and scalable alternative to domain-tuned systems for biomedical question answering. Abstract: Biomedical question answering (QA) poses significant challenges due to the need for precise interpretation of specialized knowledge drawn from a vast, complex, and rapidly evolving corpus. In this work, we explore how large language models (LLMs) can be used for information retrieval (IR), and an ensemble of zero-shot models can accomplish state-of-the-art performance on a domain-specific Yes/No QA task. Evaluating our approach on the BioASQ challenge tasks, we show that ensembles can outperform individual LLMs and in some cases rival or surpass domain-tuned systems - all while preserving generalizability and avoiding the need for costly fine-tuning or labeled data. Our method aggregates outputs from multiple LLM variants, including models from Anthropic and Google, to synthesize more accurate and robust answers. Moreover, our investigation highlights a relationship between context length and performance: while expanded contexts are meant to provide valuable evidence, they simultaneously risk information dilution and model disorientation. These findings emphasize IR as a critical foundation in Retrieval-Augmented Generation (RAG) approaches for biomedical QA systems. Precise, focused retrieval remains essential for ensuring LLMs operate within relevant information boundaries when generating answers from retrieved documents. Our results establish that ensemble-based zero-shot approaches, when paired with effective RAG pipelines, constitute a practical and scalable alternative to domain-tuned systems for biomedical question answering.

[24] Memorization in Large Language Models in Medicine: Prevalence, Characteristics, and Implications

Anran Li,Lingfei Qian,Mengmeng Du,Yu Yin,Yan Hu,Zihao Sun,Yihang Fu,Erica Stutz,Xuguang Ai,Qianqian Xie,Rui Zhu,Jimin Huang,Yifan Yang,Siru Liu,Yih-Chung Tham,Lucila Ohno-Machado,Hyunghoon Cho,Zhiyong Lu,Hua Xu,Qingyu Chen

Main category: cs.CL

TL;DR: 本研究首次全面评估了医学领域大型语言模型的记忆现象,发现其记忆程度显著高于通用领域,并提出促进有益记忆和缓解有害记忆的实践建议。

Details Motivation: 研究旨在解决一个关键问题:大型语言模型在医学领域中对训练数据的记忆程度如何,包括记忆的普遍性、特征、数量及其对医学应用的潜在影响。 Method: 系统分析了三种常见的适应场景下的记忆现象:(1)在医学语料库上继续预训练,(2)在标准医学基准上微调,(3)在现实临床数据上微调,包括耶鲁纽黑文医疗系统的13000多条住院记录。 Result: 结果显示记忆现象在所有适应场景中普遍存在,并显著高于通用领域。记忆内容包括临床指南、生物医学参考资料(有益记忆),重复的免责声明或模板化医学文档语言(无信息性记忆),以及特定数据集或敏感临床内容(有害记忆)。 Conclusion: 研究得出医学领域大型语言模型的记忆现象普遍存在,且显著高于通用领域的水平。记忆可以分为有益、无信息性和有害三类,并提出了促进有益记忆、减少无信息性记忆和缓解有害记忆的实践建议。 Abstract: Large Language Models (LLMs) have demonstrated significant potential in medicine. To date, LLMs have been widely applied to tasks such as diagnostic assistance, medical question answering, and clinical information synthesis. However, a key open question remains: to what extent do LLMs memorize medical training data. In this study, we present the first comprehensive evaluation of memorization of LLMs in medicine, assessing its prevalence (how frequently it occurs), characteristics (what is memorized), volume (how much content is memorized), and potential downstream impacts (how memorization may affect medical applications). We systematically analyze common adaptation scenarios: (1) continued pretraining on medical corpora, (2) fine-tuning on standard medical benchmarks, and (3) fine-tuning on real-world clinical data, including over 13,000 unique inpatient records from Yale New Haven Health System. The results demonstrate that memorization is prevalent across all adaptation scenarios and significantly higher than reported in the general domain. Memorization affects both the development and adoption of LLMs in medicine and can be categorized into three types: beneficial (e.g., accurate recall of clinical guidelines and biomedical references), uninformative (e.g., repeated disclaimers or templated medical document language), and harmful (e.g., regeneration of dataset-specific or sensitive clinical content). Based on these findings, we offer practical recommendations to facilitate beneficial memorization that enhances domain-specific reasoning and factual accuracy, minimize uninformative memorization to promote deeper learning beyond surface-level patterns, and mitigate harmful memorization to prevent the leakage of sensitive or identifiable patient information.

[25] OTESGN:Optimal Transport Enhanced Syntactic-Semantic Graph Networks for Aspect-Based Sentiment Analysis

Xinfeng Liao,Xuanqi Chen,Lianxi Wang,Jiahuan Yang,Zhuowei Chen,Ziying Rong

Main category: cs.CL

TL;DR: 本文提出了一种新的方面情感分析方法OTESGN,通过整合句法与语义信息,提高了情感识别的准确性和抗干扰能力。

Details Motivation: 现有的依赖于句法树和方面感知注意力的方法在建模复杂语义关系方面存在困难,且容易受到无关词汇的干扰。 Method: 提出了一种新的Optimal Transport Enhanced Syntactic-Semantic Graph Network (OTESGN),结合了句法图感知注意力和语义最优传输注意力,并使用自适应注意力融合模块和对比正则化方法。 Result: OTESGN在Twitter和Laptop14基准测试中分别比之前最佳模型高出+1.01% F1和+1.30% F1,并且在消融研究和可视化分析中验证了其有效性和鲁棒性。 Conclusion: OTESGN通过结合句法和语义信息,有效地识别方面情感,并在多个基准数据集上表现出最先进的性能。 Abstract: Aspect-based sentiment analysis (ABSA) aims to identify aspect terms and determine their sentiment polarity. While dependency trees combined with contextual semantics effectively identify aspect sentiment, existing methods relying on syntax trees and aspect-aware attention struggle to model complex semantic relationships. Their dependence on linear dot-product features fails to capture nonlinear associations, allowing noisy similarity from irrelevant words to obscure key opinion terms. Motivated by Differentiable Optimal Matching, we propose the Optimal Transport Enhanced Syntactic-Semantic Graph Network (OTESGN), which introduces a Syntactic-Semantic Collaborative Attention. It comprises a Syntactic Graph-Aware Attention for mining latent syntactic dependencies and modeling global syntactic topology, as well as a Semantic Optimal Transport Attention designed to uncover fine-grained semantic alignments amidst textual noise, thereby accurately capturing sentiment signals obscured by irrelevant tokens. A Adaptive Attention Fusion module integrates these heterogeneous features, and contrastive regularization further improves robustness. Experiments demonstrate that OTESGN achieves state-of-the-art results, outperforming previous best models by +1.01% F1 on Twitter and +1.30% F1 on Laptop14 benchmarks. Ablative studies and visual analyses corroborate its efficacy in precise localization of opinion words and noise resistance.

[26] X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates

Hyunjun Kim,Junwoo Ha,Sangyoon Yu,Haon Park

Main category: cs.CL

TL;DR: X-Teaming Evolutionary M2S 是一种自动化框架,通过语言模型引导的进化发现并优化 M2S 模板,提升了单次提示的探测效果。

Details Motivation: 先前的 Multi-turn-to-single-turn (M2S) 方法依赖于少量手动编写的模板,缺乏自动化和优化机制。 Method: 利用语言模型作为评判标准,通过进化算法优化 M2S 模板,并结合 12 个来源的智能采样和可审计日志记录。 Result: 在 GPT-4.1 上实现了 44.8% 的成功率(103/230),并发现了两个新的模板家族。不同模型之间的结构增益表现不一,部分模型表现不佳。 Conclusion: 结构级别的搜索是提升单次提示探测效果的有效方法,同时强调了阈值校准和跨模型评估的重要性。 Abstract: Multi-turn-to-single-turn (M2S) compresses iterative red-teaming into one structured prompt, but prior work relied on a handful of manually written templates. We present X-Teaming Evolutionary M2S, an automated framework that discovers and optimizes M2S templates through language-model-guided evolution. The system pairs smart sampling from 12 sources with an LLM-as-judge inspired by StrongREJECT and records fully auditable logs. Maintaining selection pressure by setting the success threshold to $\theta = 0.70$, we obtain five evolutionary generations, two new template families, and 44.8% overall success (103/230) on GPT-4.1. A balanced cross-model panel of 2,500 trials (judge fixed) shows that structural gains transfer but vary by target; two models score zero at the same threshold. We also find a positive coupling between prompt length and score, motivating length-aware judging. Our results demonstrate that structure-level search is a reproducible route to stronger single-turn probes and underscore the importance of threshold calibration and cross-model evaluation. Code, configurations, and artifacts are available at https://github.com/hyunjun1121/M2S-x-teaming.

[27] Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling

Neil Zeghidour,Eugene Kharitonov,Manu Orsini,Václav Volhejn,Gabriel de Marmiesse,Edouard Grave,Patrick Pérez,Laurent Mazaré,Alexandre Défossez

Main category: cs.CL

TL;DR: Delayed Streams Modeling (DSM) 是一种用于流式、多模态序列到序列学习的灵活方法,通过预处理对齐和引入延迟,实现高效的流式推理,并在多种任务中表现出色。

Details Motivation: 传统的序列到序列生成通常以离线方式进行,而 DSM 的提出是为了实现流式推理,允许从任意输入组合生成任意输出序列,适用于多种序列到序列问题。 Method: DSM 通过预处理步骤对齐流,并引入适当的延迟,使用仅解码器的语言模型来建模已经时间对齐的流,从而实现流式推理。 Result: DSM 在流式自动语音识别和文本到语音任务中表现出了最先进的性能和延迟特性,支持任意长序列,并且在效果上甚至可以与离线基线模型竞争。 Conclusion: DSM 是一种适用于流式、多模态序列到序列学习的灵活框架,能够提供最先进的性能和延迟特性,同时支持任意长序列,并且在流式自动语音识别和文本到语音任务中表现出色。 Abstract: We introduce Delayed Streams Modeling (DSM), a flexible formulation for streaming, multimodal sequence-to-sequence learning. Sequence-to-sequence generation is often cast in an offline manner, where the model consumes the complete input sequence before generating the first output timestep. Alternatively, streaming sequence-to-sequence rely on learning a policy for choosing when to advance on the input stream, or write to the output stream. DSM instead models already time-aligned streams with a decoder-only language model. By moving the alignment to a pre-processing step,and introducing appropriate delays between streams, DSM provides streaming inference of arbitrary output sequences, from any input combination, making it applicable to many sequence-to-sequence problems. In particular, given text and audio streams, automatic speech recognition (ASR) corresponds to the text stream being delayed, while the opposite gives a text-to-speech (TTS) model. We perform extensive experiments for these two major sequence-to-sequence tasks, showing that DSM provides state-of-the-art performance and latency while supporting arbitrary long sequences, being even competitive with offline baselines. Code, samples and demos are available at https://github.com/kyutai-labs/delayed-streams-modeling

[28] Do All Autoregressive Transformers Remember Facts the Same Way? A Cross-Architecture Analysis of Recall Mechanisms

Minyeong Choe,Haehyun Cho,Changho Seo,Hyunil Kim

Main category: cs.CL

TL;DR: This study examines how different autoregressive Transformer models encode and retrieve factual information, revealing that Qwen-based models rely more on attention modules in early layers for factual recall, unlike GPT-style models.

Details Motivation: Understanding how Transformer-based language models store and retrieve factual associations is critical for improving interpretability and enabling targeted model editing. It is important to determine if prior findings about MLP modules in GPT-style models generalize across different autoregressive architectures. Method: The study conducted a comprehensive evaluation of factual recall across several autoregressive models, including GPT, LLaMA, Qwen, and DeepSeek, analyzing where and how factual information is encoded and accessed. Result: The analysis revealed that Qwen-based models differ from previous patterns observed in GPT-style models. Specifically, attention modules in the earliest layers contribute more to factual recall than MLP modules in these models. Conclusion: Architectural variations in autoregressive Transformer models lead to different mechanisms of factual recall, particularly noting that Qwen-based models rely more on attention modules in early layers compared to MLP modules. Abstract: Understanding how Transformer-based language models store and retrieve factual associations is critical for improving interpretability and enabling targeted model editing. Prior work, primarily on GPT-style models, has identified MLP modules in early layers as key contributors to factual recall. However, it remains unclear whether these findings generalize across different autoregressive architectures. To address this, we conduct a comprehensive evaluation of factual recall across several models -- including GPT, LLaMA, Qwen, and DeepSeek -- analyzing where and how factual information is encoded and accessed. Consequently, we find that Qwen-based models behave differently from previous patterns: attention modules in the earliest layers contribute more to factual recall than MLP modules. Our findings suggest that even within the autoregressive Transformer family, architectural variations can lead to fundamentally different mechanisms of factual recall.

[29] Evaluating LLMs Without Oracle Feedback: Agentic Annotation Evaluation Through Unsupervised Consistency Signals

Cheng Chen,Haiyan Yin,Ivor Tsang

Main category: cs.CL

TL;DR: 提出了一种新的无监督标注质量评估方法CAI Ratio,通过学生模型与LLM协作优化标注质量,适用于动态无监督环境。

Details Motivation: 在缺乏真实反馈的动态无监督环境中,现有标注质量评估方法效果有限,需要一种新型评估方式。 Method: 构建学生模型与LLM协作框架,通过偏好投票策略评估LLM输出一致性,并提出CAI Ratio作为新评估指标。 Result: CAI Ratio与LLM准确率呈强正相关,证明其在十个NLP数据集上的有效性。 Conclusion: CAI Ratio是一种有效的无监督评估工具,可用于动态环境中LLM的标注质量评估与模型选择。 Abstract: Large Language Models (LLMs), when paired with prompt-based tasks, have significantly reduced data annotation costs and reliance on human annotators. However, evaluating the quality of their annotations remains challenging in dynamic, unsupervised environments where oracle feedback is scarce and conventional methods fail. To address this challenge, we propose a novel agentic annotation paradigm, where a student model collaborates with a noisy teacher (the LLM) to assess and refine annotation quality without relying on oracle feedback. The student model, acting as an unsupervised feedback mechanism, employs a user preference-based majority voting strategy to evaluate the consistency of the LLM outputs. To systematically measure the reliability of LLM-generated annotations, we introduce the Consistent and Inconsistent (CAI) Ratio, a novel unsupervised evaluation metric. The CAI Ratio not only quantifies the annotation quality of the noisy teacher under limited user preferences but also plays a critical role in model selection, enabling the identification of robust LLMs in dynamic, unsupervised environments. Applied to ten open-domain NLP datasets across four LLMs, the CAI Ratio demonstrates a strong positive correlation with LLM accuracy, establishing it as an essential tool for unsupervised evaluation and model selection in real-world settings.

[30] MoVoC: Morphology-Aware Subword Construction for Geez Script Languages

Hailay Kidu Teklehaymanot,Dren Fazlija,Wolfgang Nejdl

Main category: cs.CL

TL;DR: 本文提出MoVoC-Tok,一种结合形态分析的分词方法,提升低资源语言的分词准确性和语言保真度,发布了相关数据集和工具。

Details Motivation: 传统的子词分词方法在处理低资源、形态复杂的语言时难以保留形态边界,影响了分词的准确性,因此需要一种结合形态信息的分词方法。 Method: 论文提出了MoVoC(Morpheme-aware Subword Vocabulary Construction)方法,结合形态素划分和字节对编码(BPE)技术,利用监督形态分析构建混合分词模型,并在四个使用吉兹字母的语言上进行了实验。 Result: 虽然该方法在翻译任务上没有显著提升,但在形态边界识别(Boundary Precision)和形态评分(MorphoScore)等内在指标上有改进,表明其在提升语言保真度和分词效率方面具有优势。 Conclusion: 该论文提出了一种结合形态分析的子词分词方法MoVoC-Tok,用于提升低资源、形态复杂语言(如使用吉兹字母的语言)的分词效果,尽管在自动翻译质量上没有显著提升,但在内在语言评估指标上有改进,并发布了相关数据集和工具以支持后续研究。 Abstract: Subword-based tokenization methods often fail to preserve morphological boundaries, a limitation especially pronounced in low-resource, morphologically complex languages such as those written in the Geez script. To address this, we present MoVoC (Morpheme-aware Subword Vocabulary Construction) and train MoVoC-Tok, a tokenizer that integrates supervised morphological analysis into the subword vocabulary. This hybrid segmentation approach combines morpheme-based and Byte Pair Encoding (BPE) tokens to preserve morphological integrity while maintaining lexical meaning. To tackle resource scarcity, we curate and release manually annotated morpheme data for four Geez script languages and a morpheme-aware vocabulary for two of them. While the proposed tokenization method does not lead to significant gains in automatic translation quality, we observe consistent improvements in intrinsic metrics, MorphoScore, and Boundary Precision, highlighting the value of morphology-aware segmentation in enhancing linguistic fidelity and token efficiency. Our morpheme-annotated datasets and tokenizer will be publicly available to support further research in low-resource, morphologically rich languages. Our code and data are available on GitHub: https://github.com/hailaykidu/MoVoC

[31] Building High-Quality Datasets for Portuguese LLMs: From Common Crawl Snapshots to Industrial-Grade Corpora

Thales Sales Almeida,Rodrigo Nogueira,Helio Pedrini

Main category: cs.CL

TL;DR: 本文研究了构建非英语语言的大语言模型训练语料库的方法,以葡萄牙语为例,展示了语言特定过滤和数据选择策略对模型性能的重要性。

Details Motivation: 虽然许多研究集中在英语上,但对于其他语言的训练语料构建仍存在不足。 Method: 我们探索了可扩展的网络语料库构建方法,并通过持续预训练框架研究了数据选择和预处理策略对模型性能的影响。 Result: 成功构建了一个1200亿token的葡萄牙语语料库,性能与工业级语料库相当,并证明了语言特定过滤和目标语言适应的重要性。 Conclusion: 高质量的语言特定数据对于多语言大语言模型的发展至关重要,本文提出的方法适用于其他语言。 Abstract: The performance of large language models (LLMs) is deeply influenced by the quality and composition of their training data. While much of the existing work has centered on English, there remains a gap in understanding how to construct effective training corpora for other languages. We explore scalable methods for building web-based corpora for LLMs. We apply them to build a new 120B token corpus in Portuguese that achieves competitive results to an industrial-grade corpus. Using a continual pretraining setup, we study how different data selection and preprocessing strategies affect LLM performance when transitioning a model originally trained in English to another language. Our findings demonstrate the value of language-specific filtering pipelines, including classifiers for education, science, technology, engineering, and mathematics (STEM), as well as toxic content. We show that adapting a model to the target language leads to performance improvements, reinforcing the importance of high-quality, language-specific data. While our case study focuses on Portuguese, our methods are applicable to other languages, offering insights for multilingual LLM development.

[32] Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation

Joachim Baumann,Paul Röttger,Aleksandra Urman,Albert Wendsjö,Flor Miriam Plaza-del-Arco,Johannes B. Gruber,Dirk Hovy

Main category: cs.CL

TL;DR: TLDR:大型语言模型(LLMs)在社会科学研究中的应用带来了LLM hacking的风险,即使是最先进的模型也不能完全消除这种风险。

Details Motivation: 动机:大型语言模型(LLMs)正在迅速改变社会科学研究,但其输出可能因研究者的实施选择而显著变化,这种变化可能导致系统性偏差和随机错误,进而影响下游分析。 Method: 方法:通过复制21项已发表的社会科学研究中的37个数据注释任务,并使用18种不同的模型分析1300万个LLM标签,测试了2,361个现实假设,以衡量研究者选择如何影响统计结论。 Result: 结果:研究发现,对于最先进的模型,大约三分之一的假设基于LLM标注的数据得出了错误的结论,而对于小型语言模型,这一比例高达一半。此外,LLM hacking的风险随着效应大小的增加而降低,表明在接近显著性阈值的发现中需要更严格的验证。 Conclusion: 结论:研究指出,即使是最先进的大型语言模型(LLMs)也不能完全消除LLM hacking的风险,强调了在接近显著性阈值的发现中需要更严格的验证。此外,人为注释在减少假阳性发现和改进模型选择方面至关重要。 Abstract: Large language models (LLMs) are rapidly transforming social science research by enabling the automation of labor-intensive tasks like data annotation and text analysis. However, LLM outputs vary significantly depending on the implementation choices made by researchers (e.g., model selection, prompting strategy, or temperature settings). Such variation can introduce systematic biases and random errors, which propagate to downstream analyses and cause Type I, Type II, Type S, or Type M errors. We call this LLM hacking. We quantify the risk of LLM hacking by replicating 37 data annotation tasks from 21 published social science research studies with 18 different models. Analyzing 13 million LLM labels, we test 2,361 realistic hypotheses to measure how plausible researcher choices affect statistical conclusions. We find incorrect conclusions based on LLM-annotated data in approximately one in three hypotheses for state-of-the-art models, and in half the hypotheses for small language models. While our findings show that higher task performance and better general model capabilities reduce LLM hacking risk, even highly accurate models do not completely eliminate it. The risk of LLM hacking decreases as effect sizes increase, indicating the need for more rigorous verification of findings near significance thresholds. Our extensive analysis of LLM hacking mitigation techniques emphasizes the importance of human annotations in reducing false positive findings and improving model selection. Surprisingly, common regression estimator correction techniques are largely ineffective in reducing LLM hacking risk, as they heavily trade off Type I vs. Type II errors. Beyond accidental errors, we find that intentional LLM hacking is unacceptably simple. With few LLMs and just a handful of prompt paraphrases, anything can be presented as statistically significant.

[33] A Survey of Reinforcement Learning for Large Reasoning Models

Kaiyan Zhang,Yuxin Zuo,Bingxiang He,Youbang Sun,Runze Liu,Che Jiang,Yuchen Fan,Kai Tian,Guoli Jia,Pengfei Li,Yu Fu,Xingtai Lv,Yuchen Zhang,Sihang Zeng,Shang Qu,Haozhan Li,Shijie Wang,Yuru Wang,Xinwei Long,Fangfu Liu,Xiang Xu,Jiaze Ma,Xuekai Zhu,Ermo Hua,Yihao Liu,Zonglin Li,Huayu Chen,Xiaoye Qu,Yafu Li,Weize Chen,Zhenzhao Yuan,Junqi Gao,Dong Li,Zhiyuan Ma,Ganqu Cui,Zhiyuan Liu,Biqing Qi,Ning Ding,Bowen Zhou

Main category: cs.CL

TL;DR: 本文综述了强化学习在大型语言模型推理能力中的应用进展,探讨了其面临的挑战和未来发展方向。

Details Motivation: 随着RL在推动LLMs前沿能力方面取得显著成功,特别是在解决数学和编码等复杂逻辑任务上,RL已成为将LLMs转化为LRMs的基础方法。然而,RL在LRMs中的进一步扩展面临计算资源、算法设计、训练数据和基础设施等基础性挑战,因此需要重新审视该领域的发展轨迹并探索提升可扩展性的策略。 Method: 论文通过综述的方式,回顾了RL在LLMs和LRMs中推理能力的研究,包括基础组件、核心问题、训练资源和下游应用。 Result: 论文分析了自DeepSeek-R1发布以来,RL在LLMs和LRMs推理能力中的研究进展,并识别了未来的机会和研究方向。 Conclusion: 该论文总结了强化学习(RL)在大型语言模型(LLMs)推理能力方面的最新进展,并探讨了RL在逻辑推理模型(LRMs)中的未来发展方向和挑战。 Abstract: In this paper, we survey recent advances in Reinforcement Learning (RL) for reasoning with Large Language Models (LLMs). RL has achieved remarkable success in advancing the frontier of LLM capabilities, particularly in addressing complex logical tasks such as mathematics and coding. As a result, RL has emerged as a foundational methodology for transforming LLMs into LRMs. With the rapid progress of the field, further scaling of RL for LRMs now faces foundational challenges not only in computational resources but also in algorithm design, training data, and infrastructure. To this end, it is timely to revisit the development of this domain, reassess its trajectory, and explore strategies to enhance the scalability of RL toward Artificial SuperIntelligence (ASI). In particular, we examine research applying RL to LLMs and LRMs for reasoning abilities, especially since the release of DeepSeek-R1, including foundational components, core problems, training resources, and downstream applications, to identify future opportunities and directions for this rapidly evolving area. We hope this review will promote future research on RL for broader reasoning models. Github: https://github.com/TsinghuaC3I/Awesome-RL-for-LRMs

cs.CV [Back]

[34] 3D and 4D World Modeling: A Survey

Lingdong Kong,Wesley Yang,Jianbiao Mei,Youquan Liu,Ao Liang,Dekai Zhu,Dongyue Lu,Wei Yin,Xiaotao Hu,Mingkai Jia,Junyuan Deng,Kaiwen Zhang,Yang Wu,Tianyi Yan,Shenyuan Gao,Song Wang,Linfeng Li,Liang Pan,Yong Liu,Jianke Zhu,Wei Tsang Ooi,Steven C. H. Hoi,Ziwei Liu

Main category: cs.CV

TL;DR: This survey paper introduces a structured taxonomy and comprehensive review of 3D and 4D world modeling methods, datasets, and applications, aiming to unify research in this emerging field.

Details Motivation: Prior research on world modeling mainly focuses on 2D generative methods, while the growing use of 3D and 4D representations like RGB-D, occupancy grids, and LiDAR remains underexplored. Additionally, the lack of standardized definitions has caused inconsistencies in literature. Method: The authors systematically reviewed existing literature on 3D and 4D world modeling, proposed a structured taxonomy, and summarized relevant datasets and evaluation metrics. Result: The study presents a new taxonomy spanning VideoGen, OccGen, and LiDARGen approaches, along with a systematic summary of datasets, metrics, applications, and open challenges in 3D/4D world modeling. Conclusion: The paper provides a comprehensive survey on 3D and 4D world modeling, establishing definitions, taxonomy, datasets, and evaluation metrics while discussing applications and future research directions. Abstract: World modeling has become a cornerstone in AI research, enabling agents to understand, represent, and predict the dynamic environments they inhabit. While prior work largely emphasizes generative methods for 2D image and video data, they overlook the rapidly growing body of work that leverages native 3D and 4D representations such as RGB-D imagery, occupancy grids, and LiDAR point clouds for large-scale scene modeling. At the same time, the absence of a standardized definition and taxonomy for ``world models'' has led to fragmented and sometimes inconsistent claims in the literature. This survey addresses these gaps by presenting the first comprehensive review explicitly dedicated to 3D and 4D world modeling and generation. We establish precise definitions, introduce a structured taxonomy spanning video-based (VideoGen), occupancy-based (OccGen), and LiDAR-based (LiDARGen) approaches, and systematically summarize datasets and evaluation metrics tailored to 3D/4D settings. We further discuss practical applications, identify open challenges, and highlight promising research directions, aiming to provide a coherent and foundational reference for advancing the field. A systematic summary of existing literature is available at https://github.com/worldbench/survey

[35] An Explainable Deep Neural Network with Frequency-Aware Channel and Spatial Refinement for Flood Prediction in Sustainable Cities

Shahid Shafi Dar,Bharat Kaurav,Arnav Jain,Chandravardhan Singh Raghaw,Mohammad Zia Ur Rehman,Nagendra Kumar

Main category: cs.CV

TL;DR: XFloodNet, a novel deep-learning framework for urban flood classification, achieves state-of-the-art results by integrating advanced attention mechanisms and adaptive multi-scale feature extraction techniques.

Details Motivation: Traditional flood detection methods are constrained by their reliance on unimodal data and static rule-based systems, which fail to capture the dynamic, non-linear relationships inherent in flood events. Method: XFloodNet integrates three components: (1) a Hierarchical Cross-Modal Gated Attention mechanism, (2) a Heterogeneous Convolutional Adaptive Multi-Scale Attention module, and (3) a Cascading Convolutional Transformer Feature Refinement technique. Result: XFloodNet achieves state-of-the-art F1-scores of 93.33%, 82.24%, and 88.60% on the Chennai Floods, Rhine18 Floods, and Harz17 Floods datasets, respectively. Conclusion: XFloodNet is a novel framework that redefines urban flood classification through advanced deep-learning techniques. It has three components that help in precise flood detection and surpasses existing methods by significant margins. Abstract: In an era of escalating climate change, urban flooding has emerged as a critical challenge for sustainable cities, threatening lives, infrastructure, and ecosystems. Traditional flood detection methods are constrained by their reliance on unimodal data and static rule-based systems, which fail to capture the dynamic, non-linear relationships inherent in flood events. Furthermore, existing attention mechanisms and ensemble learning approaches exhibit limitations in hierarchical refinement, cross-modal feature integration, and adaptability to noisy or unstructured environments, resulting in suboptimal flood classification performance. To address these challenges, we present XFloodNet, a novel framework that redefines urban flood classification through advanced deep-learning techniques. XFloodNet integrates three novel components: (1) a Hierarchical Cross-Modal Gated Attention mechanism that dynamically aligns visual and textual features, enabling precise multi-granularity interactions and resolving contextual ambiguities; (2) a Heterogeneous Convolutional Adaptive Multi-Scale Attention module, which leverages frequency-enhanced channel attention and frequency-modulated spatial attention to extract and prioritize discriminative flood-related features across spectral and spatial domains; and (3) a Cascading Convolutional Transformer Feature Refinement technique that harmonizes hierarchical features through adaptive scaling and cascading operations, ensuring robust and noise-resistant flood detection. We evaluate our proposed method on three benchmark datasets, such as Chennai Floods, Rhine18 Floods, and Harz17 Floods, XFloodNet achieves state-of-the-art F1-scores of 93.33%, 82.24%, and 88.60%, respectively, surpassing existing methods by significant margins.

[36] Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs

Hyungjin Chung,Hyelin Nam,Jiyeon Kim,Hyojun Go,Byeongjun Park,Junho Kim,Joonseok Lee,Seongsu Ha,Byung-Hoon Kim

Main category: cs.CV

TL;DR: This paper proposes VPS, an efficient inference method for VideoLLMs that improves temporal reasoning without increasing context length, achieving better performance with lower memory cost.

Details Motivation: VideoLLMs face a bottleneck where increasing input frames leads to high computational costs and performance degradation due to long context lengths. Method: VPS works by running multiple parallel inference streams, each processing a unique subset of video frames, and aggregates their outputs to integrate richer visual information. Result: Experiments showed that VPS consistently and significantly improves performance across various model architectures and scales, outperforming other parallel methods like Self-consistency. Conclusion: The proposed VPS method effectively improves the performance of VideoLLMs by expanding perceptual bandwidth without increasing context window, offering a memory-efficient and robust framework. Abstract: Video Large Language Models (VideoLLMs) face a critical bottleneck: increasing the number of input frames to capture fine-grained temporal detail leads to prohibitive computational costs and performance degradation from long context lengths. We introduce Video Parallel Scaling (VPS), an inference-time method that expands a model's perceptual bandwidth without increasing its context window. VPS operates by running multiple parallel inference streams, each processing a unique, disjoint subset of the video's frames. By aggregating the output probabilities from these complementary streams, VPS integrates a richer set of visual information than is possible with a single pass. We theoretically show that this approach effectively contracts the Chinchilla scaling law by leveraging uncorrelated visual evidence, thereby improving performance without additional training. Extensive experiments across various model architectures and scales (2B-32B) on benchmarks such as Video-MME and EventHallusion demonstrate that VPS consistently and significantly improves performance. It scales more favorably than other parallel alternatives (e.g. Self-consistency) and is complementary to other decoding strategies, offering a memory-efficient and robust framework for enhancing the temporal reasoning capabilities of VideoLLMs.

[37] Two Stage Context Learning with Large Language Models for Multimodal Stance Detection on Climate Change

Lata Pangtey,Omkar Kabde,Shahid Shafi Dar,Nagendra Kumar

Main category: cs.CV

TL;DR: This paper proposes a multimodal stance detection framework that integrates textual and visual information through a hierarchical fusion approach, achieving superior performance on the MultiClimate dataset.

Details Motivation: The motivation is driven by the increasing combination of text and visual elements in real-world social media content, which creates a need for advanced multimodal methods in stance detection. Method: The method uses a hierarchical fusion approach, combining textual and visual data through a Large Language Model for text summarization and a domain-aware image caption generator for visual interpretation. Modalities are jointly modeled using a specialized transformer module. Result: The framework achieved an accuracy of 76.2%, precision of 76.3%, recall of 76.2%, and F1-score of 76.2% on the MultiClimate dataset, demonstrating its effectiveness in multimodal stance detection. Conclusion: The proposed multimodal stance detection framework effectively integrates textual and visual information, outperforming existing state-of-the-art approaches on the MultiClimate dataset with an accuracy of 76.2%. Abstract: With the rapid proliferation of information across digital platforms, stance detection has emerged as a pivotal challenge in social media analysis. While most of the existing approaches focus solely on textual data, real-world social media content increasingly combines text with visual elements creating a need for advanced multimodal methods. To address this gap, we propose a multimodal stance detection framework that integrates textual and visual information through a hierarchical fusion approach. Our method first employs a Large Language Model to retrieve stance-relevant summaries from source text, while a domain-aware image caption generator interprets visual content in the context of the target topic. These modalities are then jointly modeled along with the reply text, through a specialized transformer module that captures interactions between the texts and images. The proposed modality fusion framework integrates diverse modalities to facilitate robust stance classification. We evaluate our approach on the MultiClimate dataset, a benchmark for climate change-related stance detection containing aligned video frames and transcripts. We achieve accuracy of 76.2%, precision of 76.3%, recall of 76.2% and F1-score of 76.2%, respectively, outperforming existing state-of-the-art approaches.

[38] Two-Stage Swarm Intelligence Ensemble Deep Transfer Learning (SI-EDTL) for Vehicle Detection Using Unmanned Aerial Vehicles

Zeinab Ghasemi Darehnaei,Mohammad Shokouhifar,Hossein Yazdanjouei,S. M. J. Rastegar Fatemi

Main category: cs.CV

TL;DR: This paper proposes SI-EDTL, a novel two-stage ensemble deep transfer learning model using swarm intelligence, which achieves superior performance in detecting multiple vehicles from UAV images compared to existing methods.

Details Motivation: The motivation is to develop a more accurate and efficient method for detecting multiple vehicles in UAV images, addressing challenges such as varying vehicle sizes, occlusions, and complex backgrounds. Method: SI-EDTL combines three pre-trained Faster R-CNN feature extractor models (InceptionV3, ResNet50, GoogLeNet) with five transfer classifiers (KNN, SVM, MLP, C4.5, Naïve Bayes), resulting in 15 different base learners. These are aggregated via weighted averaging to classify regions as Car, Van, Truck, Bus, or background, with hyperparameters optimized using the whale optimization algorithm. Result: SI-EDTL outperforms existing methods on the AU-AIR UAV dataset, achieving better performance in terms of accuracy, precision, and recall. Conclusion: The paper concludes that SI-EDTL, a two-stage swarm intelligence ensemble deep transfer learning model, outperforms existing methods for detecting multiple vehicles in UAV images. Abstract: This paper introduces SI-EDTL, a two-stage swarm intelligence ensemble deep transfer learning model for detecting multiple vehicles in UAV images. It combines three pre-trained Faster R-CNN feature extractor models (InceptionV3, ResNet50, GoogLeNet) with five transfer classifiers (KNN, SVM, MLP, C4.5, Na\"ive Bayes), resulting in 15 different base learners. These are aggregated via weighted averaging to classify regions as Car, Van, Truck, Bus, or background. Hyperparameters are optimized with the whale optimization algorithm to balance accuracy, precision, and recall. Implemented in MATLAB R2020b with parallel processing, SI-EDTL outperforms existing methods on the AU-AIR UAV dataset.

[39] MCTED: A Machine-Learning-Ready Dataset for Digital Elevation Model Generation From Mars Imagery

Rafał Osadnik,Pablo Gómez,Eleni Bohacek,Rickbir Bahia

Main category: cs.CV

TL;DR: MCTED是一个新的机器学习适用的火星数字高程模型预测数据集,包含80,898个样本,提供全面的处理工具和开源代码。

Details Motivation: 解决现有DEM数据中的问题,为机器学习提供一个高质量的火星数字高程模型预测数据集。 Method: 使用处理火星高分辨率正射影像和DEM配对数据的综合流水线生成数据集,并开发工具解决数据中的问题。划分训练和验证集,使用U-Net架构进行实验。 Result: 生成了包含80,898个样本的数据集,每个样本包括光学图像块、DEM块和两个掩码块。通过训练U-Net模型,其表现优于DepthAnythingV2等单目深度估计模型。 Conclusion: MCTED是一个新的用于火星数字高程模型预测的机器学习数据集,其经过全面处理和统计分析,并且提供开源代码和数据集。 Abstract: This work presents a new dataset for the Martian digital elevation model prediction task, ready for machine learning applications called MCTED. The dataset has been generated using a comprehensive pipeline designed to process high-resolution Mars orthoimage and DEM pairs from Day et al., yielding a dataset consisting of 80,898 data samples. The source images are data gathered by the Mars Reconnaissance Orbiter using the CTX instrument, providing a very diverse and comprehensive coverage of the Martian surface. Given the complexity of the processing pipelines used in large-scale DEMs, there are often artefacts and missing data points in the original data, for which we developed tools to solve or mitigate their impact. We divide the processed samples into training and validation splits, ensuring samples in both splits cover no mutual areas to avoid data leakage. Every sample in the dataset is represented by the optical image patch, DEM patch, and two mask patches, indicating values that were originally missing or were altered by us. This allows future users of the dataset to handle altered elevation regions as they please. We provide statistical insights of the generated dataset, including the spatial distribution of samples, the distributions of elevation values, slopes and more. Finally, we train a small U-Net architecture on the MCTED dataset and compare its performance to a monocular depth estimation foundation model, DepthAnythingV2, on the task of elevation prediction. We find that even a very small architecture trained on this dataset specifically, beats a zero-shot performance of a depth estimation foundation model like DepthAnythingV2. We make the dataset and code used for its generation completely open source in public repositories.

[40] APML: Adaptive Probabilistic Matching Loss for Robust 3D Point Cloud Reconstruction

Sasan Sharifipour,Constantino Álvarez Casado,Mohammad Sabokrou,Miguel Bordallo López

Main category: cs.CV

TL;DR: 本文提出了一种新的点云预测任务损失函数APML,解决了现有方法的问题,具有更好的性能和效率。

Details Motivation: 现有的点云预测任务损失函数如Chamfer Distance、EMD等存在点拥堵、覆盖率低、计算复杂度高等问题,需要一种更高效、可微的损失函数。 Method: 提出了一种新的损失函数APML,利用Sinkhorn迭代和温度缩放相似性矩阵来实现一对一点匹配,同时解析计算温度以确保最小分配概率。 Result: APML在ShapeNet基准数据集和3D人体点云生成任务中表现出更快的收敛速度、更优的空间分布以及改进的定量性能。 Conclusion: APML是一种完全可微的一对一匹配的近似方法,具有接近二次方的运行时间,避免了非可微操作,适用于最先进的架构,具有更快的收敛速度、更优的空间分布,尤其是在低密度区域,且无需额外超参数搜索。 Abstract: Training deep learning models for point cloud prediction tasks such as shape completion and generation depends critically on loss functions that measure discrepancies between predicted and ground-truth point sets. Commonly used functions such as Chamfer Distance (CD), HyperCD, and InfoCD rely on nearest-neighbor assignments, which often induce many-to-one correspondences, leading to point congestion in dense regions and poor coverage in sparse regions. These losses also involve non-differentiable operations due to index selection, which may affect gradient-based optimization. Earth Mover Distance (EMD) enforces one-to-one correspondences and captures structural similarity more effectively, but its cubic computational complexity limits its practical use. We propose the Adaptive Probabilistic Matching Loss (APML), a fully differentiable approximation of one-to-one matching that leverages Sinkhorn iterations on a temperature-scaled similarity matrix derived from pairwise distances. We analytically compute the temperature to guarantee a minimum assignment probability, eliminating manual tuning. APML achieves near-quadratic runtime, comparable to Chamfer-based losses, and avoids non-differentiable operations. When integrated into state-of-the-art architectures (PoinTr, PCN, FoldingNet) on ShapeNet benchmarks and on a spatiotemporal Transformer (CSI2PC) that generates 3D human point clouds from WiFi CSI measurements, APM loss yields faster convergence, superior spatial distribution, especially in low-density regions, and improved or on-par quantitative performance without additional hyperparameter search. The code is available at: https://github.com/apm-loss/apml.

[41] Lightweight Deep Unfolding Networks with Enhanced Robustness for Infrared Small Target Detection

Jingjing Liu,Yinchao Han,Xianchao Xiu,Jianhua Zhang,Wanquan Liu

Main category: cs.CV

TL;DR: L-RPCANet is a lightweight and noise-robust framework for infrared small target detection that achieves superior performance through feature refinement, noise reduction, and channel attention.

Details Motivation: To address the challenges of parameter lightweightness and noise robustness in existing deep unfolding networks for infrared small target detection. Method: L-RPCANet uses a hierarchical bottleneck structure for channel-wise feature refinement, integrates a noise reduction module, and employs SENets for channel attention. Result: Experiments show that L-RPCANet outperforms state-of-the-art methods like RPCANet, DRPCANet, and RPCANet++ on ISTD datasets. Conclusion: The proposed L-RPCANet framework demonstrates superior performance in infrared small target detection with improved lightweightness and robustness against noise. Abstract: Infrared small target detection (ISTD) is one of the key techniques in image processing. Although deep unfolding networks (DUNs) have demonstrated promising performance in ISTD due to their model interpretability and data adaptability, existing methods still face significant challenges in parameter lightweightness and noise robustness. In this regard, we propose a highly lightweight framework based on robust principal component analysis (RPCA) called L-RPCANet. Technically, a hierarchical bottleneck structure is constructed to reduce and increase the channel dimension in the single-channel input infrared image to achieve channel-wise feature refinement, with bottleneck layers designed in each module to extract features. This reduces the number of channels in feature extraction and improves the lightweightness of network parameters. Furthermore, a noise reduction module is embedded to enhance the robustness against complex noise. In addition, squeeze-and-excitation networks (SENets) are leveraged as a channel attention mechanism to focus on the varying importance of different features across channels, thereby achieving excellent performance while maintaining both lightweightness and robustness. Extensive experiments on the ISTD datasets validate the superiority of our proposed method compared with state-of-the-art methods covering RPCANet, DRPCANet, and RPCANet++. The code will be available at https://github.com/xianchaoxiu/L-RPCANet.

[42] Sparse Transformer for Ultra-sparse Sampled Video Compressive Sensing

Miao Cao,Siming Zheng,Lishun Wang,Ziyang Chen,David Brady,Xin Yuan

Main category: cs.CV

TL;DR: The paper proposes an Ultra-Sparse Sampling (USS) method and a BSTFormer model for efficient high-speed, high-resolution video capture, outperforming existing techniques and offering practical advantages for future camera systems.

Details Motivation: Current high-resolution video processing models are unsustainable for future high-speed, high-resolution cameras, prompting the need for more power-efficient sampling and recovery methods. Method: Proposed Ultra-Sparse Sampling (USS) regime and developed BSTFormer, a sparse Transformer, to recover high-speed frames while using a Digital Micro-mirror Device (DMD) encoding system. Result: BSTFormer outperforms previous state-of-the-art algorithms on both simulated and real-world data. USS also provides a higher dynamic range compared to the Random Sampling strategy. Conclusion: USS strategy is a sustainable and effective method for high-speed, high-resolution video capture, offering advantages in power efficiency, dynamic range, and potential for on-chip implementation. Abstract: Digital cameras consume ~0.1 microjoule per pixel to capture and encode video, resulting in a power usage of ~20W for a 4K sensor operating at 30 fps. Imagining gigapixel cameras operating at 100-1000 fps, the current processing model is unsustainable. To address this, physical layer compressive measurement has been proposed to reduce power consumption per pixel by 10-100X. Video Snapshot Compressive Imaging (SCI) introduces high frequency modulation in the optical sensor layer to increase effective frame rate. A commonly used sampling strategy of video SCI is Random Sampling (RS) where each mask element value is randomly set to be 0 or 1. Similarly, image inpainting (I2P) has demonstrated that images can be recovered from a fraction of the image pixels. Inspired by I2P, we propose Ultra-Sparse Sampling (USS) regime, where at each spatial location, only one sub-frame is set to 1 and all others are set to 0. We then build a Digital Micro-mirror Device (DMD) encoding system to verify the effectiveness of our USS strategy. Ideally, we can decompose the USS measurement into sub-measurements for which we can utilize I2P algorithms to recover high-speed frames. However, due to the mismatch between the DMD and CCD, the USS measurement cannot be perfectly decomposed. To this end, we propose BSTFormer, a sparse TransFormer that utilizes local Block attention, global Sparse attention, and global Temporal attention to exploit the sparsity of the USS measurement. Extensive results on both simulated and real-world data show that our method significantly outperforms all previous state-of-the-art algorithms. Additionally, an essential advantage of the USS strategy is its higher dynamic range than that of the RS strategy. Finally, from the application perspective, the USS strategy is a good choice to implement a complete video SCI system on chip due to its fixed exposure time.

[43] GTA-Crime: A Synthetic Dataset and Generation Framework for Fatal Violence Detection with Adversarial Snippet-Level Domain Adaptation

Seongho Kim,Sejong Ryu,Hyoukjun You,Je Hyeong Hong

Main category: cs.CV

TL;DR: This paper presents GTA-Crime, a synthetic dataset and framework for detecting fatal incidents in videos, along with a domain adaptation method that improves real-world detection accuracy.

Details Motivation: Detecting fatal incidents like shootings and stabbings in surveillance videos is challenging due to their rarity and ethical issues in data collection. The authors aim to address this limitation by introducing a synthetic dataset and generation framework. Method: The authors introduce GTA-Crime, a dataset and generation framework based on GTA5, and propose a snippet-level domain adaptation strategy using Wasserstein adversarial training to align synthetic and real-world features. Result: Experimental results show that incorporating GTA-Crime with the proposed domain adaptation strategy consistently improves the accuracy of detecting fatal violence in real-world scenarios. Conclusion: GTA-Crime, along with the proposed domain adaptation strategy, enhances the detection of real-world fatal violence, and the dataset and framework are publicly available. Abstract: Recent advancements in video anomaly detection (VAD) have enabled identification of various criminal activities in surveillance videos, but detecting fatal incidents such as shootings and stabbings remains difficult due to their rarity and ethical issues in data collection. Recognizing this limitation, we introduce GTA-Crime, a fatal video anomaly dataset and generation framework using Grand Theft Auto 5 (GTA5). Our dataset contains fatal situations such as shootings and stabbings, captured from CCTV multiview perspectives under diverse conditions including action types, weather, time of day, and viewpoints. To address the rarity of such scenarios, we also release a framework for generating these types of videos. Additionally, we propose a snippet-level domain adaptation strategy using Wasserstein adversarial training to bridge the gap between synthetic GTA-Crime features and real-world features like UCF-Crime. Experimental results validate our GTA-Crime dataset and demonstrate that incorporating GTA-Crime with our domain adaptation strategy consistently enhances real world fatal violence detection accuracy. Our dataset and the data generation framework are publicly available at https://github.com/ta-ho/GTA-Crime.

[44] RepViT-CXR: A Channel Replication Strategy for Vision Transformers in Chest X-ray Tuberculosis and Pneumonia Classification

Faisal Ahmed

Main category: cs.CV

TL;DR: RepViT-CXR是一种用于胸片X光检查中结核病和肺炎检测的新型通道复制策略,其表现超过了现有方法,并显示出在现实世界临床筛查系统中的巨大潜力。

Details Motivation: 尽管深度学习特别是视觉转换器(ViTs)在自动化医学图像分析中显示出巨大潜力,但大多数ViT架构都是基于自然图像进行预训练,并且需要三通道输入,而CXR扫描本质上是灰度的。 Method: 提出了一种称为RepViT-CXR的通道复制策略,将单通道CXR图像转换为ViT兼容格式,而不会引入额外的信息损失。 Result: 在TB-CXR数据集上,该方法达到了99.9%的准确率和99.9%的AUC;在儿科肺炎数据集上,获得了99.0%的准确率、99.2%的召回率、99.3%的精确度和99.0%的AUC;在深圳结核病数据集上,达到了91.1%的准确率和91.2%的AUC。 Conclusion: RepViT-CXR是一个新的最先进的方法,用于胸片X光检查中的结核病和肺炎检测,显示出在现实世界临床筛查系统中部署的巨大潜力。 Abstract: Chest X-ray (CXR) imaging remains one of the most widely used diagnostic tools for detecting pulmonary diseases such as tuberculosis (TB) and pneumonia. Recent advances in deep learning, particularly Vision Transformers (ViTs), have shown strong potential for automated medical image analysis. However, most ViT architectures are pretrained on natural images and require three-channel inputs, while CXR scans are inherently grayscale. To address this gap, we propose RepViT-CXR, a channel replication strategy that adapts single-channel CXR images into a ViT-compatible format without introducing additional information loss. We evaluate RepViT-CXR on three benchmark datasets. On the TB-CXR dataset,our method achieved an accuracy of 99.9% and an AUC of 99.9%, surpassing prior state-of-the-art methods such as Topo-CXR (99.3% accuracy, 99.8% AUC). For the Pediatric Pneumonia dataset, RepViT-CXR obtained 99.0% accuracy, with 99.2% recall, 99.3% precision, and an AUC of 99.0%, outperforming strong baselines including DCNN and VGG16. On the Shenzhen TB dataset, our approach achieved 91.1% accuracy and an AUC of 91.2%, marking a performance improvement over previously reported CNN-based methods. These results demonstrate that a simple yet effective channel replication strategy allows ViTs to fully leverage their representational power on grayscale medical imaging tasks. RepViT-CXR establishes a new state of the art for TB and pneumonia detection from chest X-rays, showing strong potential for deployment in real-world clinical screening systems.

[45] Symmetry Interactive Transformer with CNN Framework for Diagnosis of Alzheimer's Disease Using Structural MRI

Zheng Yang,Yanteng Zhang,Xupeng Kou,Yang Liu,Chao Ren

Main category: cs.CV

TL;DR: 本文提出了一种新的端到端网络,结合3D CNN编码器和对称交互变压器(SIT),以提高阿尔茨海默病的诊断准确性,并更好地关注脑萎缩区域的不对称病理特征。

Details Motivation: 现有的阿尔茨海默病预测和诊断研究大多基于预训练或忽视由脑部疾病引起的不对称特性。 Method: 提出了一种端到端的网络,用于检测基于疾病的大脑左右萎缩引起的不对称性,包括3D CNN编码器和对称交互变压器(SIT)。在交互等网格块获取操作之后,将相应的左右半球特征对齐,并随后输入SIT进行诊断分析。 Result: 基于ADNI数据集的评估结果显示,该方法的诊断准确率(92.5%)优于几种CNN方法以及CNN与通用变压器的结合。可视化结果表明,该网络更多地关注脑萎缩区域,特别是由AD引起的不对称病理特征。 Conclusion: 所提出的端到端网络结合了3D CNN编码器和对称交互变压器(SIT),提高了阿尔茨海默病的诊断准确性,并展示了对脑萎缩区域特别是不对称病理特征的关注,证明了该方法的可解释性和有效性。 Abstract: Structural magnetic resonance imaging (sMRI) combined with deep learning has achieved remarkable progress in the prediction and diagnosis of Alzheimer's disease (AD). Existing studies have used CNN and transformer to build a well-performing network, but most of them are based on pretraining or ignoring the asymmetrical character caused by brain disorders. We propose an end-to-end network for the detection of disease-based asymmetric induced by left and right brain atrophy which consist of 3D CNN Encoder and Symmetry Interactive Transformer (SIT). Following the inter-equal grid block fetch operation, the corresponding left and right hemisphere features are aligned and subsequently fed into the SIT for diagnostic analysis. SIT can help the model focus more on the regions of asymmetry caused by structural changes, thus improving diagnostic performance. We evaluated our method based on the ADNI dataset, and the results show that the method achieves better diagnostic accuracy (92.5\%) compared to several CNN methods and CNNs combined with a general transformer. The visualization results show that our network pays more attention in regions of brain atrophy, especially for the asymmetric pathological characteristics induced by AD, demonstrating the interpretability and effectiveness of the method.

[46] EVDI++: Event-based Video Deblurring and Interpolation via Self-Supervised Learning

Chi Zhang,Xiang Zhang,Chenxu Jiang,Gui-Song Xia,Lei Yu

Main category: cs.CV

TL;DR: EVDI++는 이벤트 카메라의 고해상도 시간 정보를 활용하여 영상 디블러링과 보간을 통합적으로 처리하는 자기감독 학습 프레임워크입니다.

Details Motivation: 프레임 기반 카메라는 긴 노출 시간으로 인해 시각적 블러링과 정보 손실이 발생하여 영상 품질이 저하됩니다. 이를 해결하기 위한 연구가 필요합니다. Method: EVDI++는 Learnable Double Integral (LDI) 네트워크를 설계하여 참조 프레임과 선명한 잠재 이미지 간의 매핑 관계를 추정합니다. 또한 학습 기반 분할 재구성 모듈과 적응적 파라미터-프리 융합 전략을 도입하여 결과를 개선하고 훈련 효율성을 최적화합니다. 자기감독 학습 프레임워크를 활용해 실제 블러리 영상과 이벤트를 이용한 네트워크 학습을 가능하게 합니다. Result: 합성 및 실제 데이터셋에서의 실험 결과, 제안된 방법은 영상 디블러링 및 보간 작업에서 최신 성능을 달성했습니다. 또한 DAVIS346c 카메라를 사용한 실제 블러리 이미지 및 이벤트 데이터셋을 구축하여 제안된 EVDI++의 일반화 가능성을 입증했습니다. Conclusion: EVDI++은 이벤트 카메라의 고해상도 시간 정보를 효과적으로 활용하여 영상 디블러링과 보간을 동시에 해결할 수 있는 자기감독 학습 기반 통합 프레임워크로, 실제 시나리오에서도 우수한 성능을 보입니다. Abstract: Frame-based cameras with extended exposure times often produce perceptible visual blurring and information loss between frames, significantly degrading video quality. To address this challenge, we introduce EVDI++, a unified self-supervised framework for Event-based Video Deblurring and Interpolation that leverages the high temporal resolution of event cameras to mitigate motion blur and enable intermediate frame prediction. Specifically, the Learnable Double Integral (LDI) network is designed to estimate the mapping relation between reference frames and sharp latent images. Then, we refine the coarse results and optimize overall training efficiency by introducing a learning-based division reconstruction module, enabling images to be converted with varying exposure intervals. We devise an adaptive parameter-free fusion strategy to obtain the final results, utilizing the confidence embedded in the LDI outputs of concurrent events. A self-supervised learning framework is proposed to enable network training with real-world blurry videos and events by exploring the mutual constraints among blurry frames, latent images, and event streams. We further construct a dataset with real-world blurry images and events using a DAVIS346c camera, demonstrating the generalizability of the proposed EVDI++ in real-world scenarios. Extensive experiments on both synthetic and real-world datasets show that our method achieves state-of-the-art performance in video deblurring and interpolation tasks.

[47] Hyperspectral Mamba for Hyperspectral Object Tracking

Long Gao,Yunhe Zhang,Yan Jiang,Weiying Xie,Yunsong Li

Main category: cs.CV

TL;DR: HyMamba 是一种新的超光谱目标跟踪网络,通过 Spectral State Integration (SSI) 和 Hyperspectral Mamba (HSM) 模块统一了光谱、跨深度和时间建模,解决了现有方法无法捕捉内在光谱信息、时间依赖性和跨深度交互的问题,并在多个基准数据集上取得了最先进的性能。

Details Motivation: 现有的超光谱跟踪器要么将超光谱数据转换为伪彩色图像,要么采用模态融合策略,但它们往往无法捕捉内在的光谱信息、时间依赖性和跨深度交互。 Method: 提出了一种新的超光谱目标跟踪网络 HyMamba,该网络配备了 Mamba (HyMamba),通过状态空间模块 (SSMs) 统一了光谱、跨深度和时间建模。核心是 Spectral State Integration (SSI) 模块,以及嵌入其中的 Hyperspectral Mamba (HSM) 模块,通过三个方向扫描 SSMs 同步学习空间和光谱信息。 Result: 实验结果表明,HyMamba 在七个基准数据集上取得了最先进的性能。例如,在 HOTC2020 数据集上,AUC 得分为 73.0%,DP@20 得分为 96.3%。 Conclusion: HyMamba 通过统一的光谱、跨深度和时间建模实现了最先进的性能,在七个基准数据集的实验中表现出色,例如在 HOTC2020 数据集上达到了 73.0% 的 AUC 分数和 96.3% 的 DP@20 分数。 Abstract: Hyperspectral object tracking holds great promise due to the rich spectral information and fine-grained material distinctions in hyperspectral images, which are beneficial in challenging scenarios. While existing hyperspectral trackers have made progress by either transforming hyperspectral data into false-color images or incorporating modality fusion strategies, they often fail to capture the intrinsic spectral information, temporal dependencies, and cross-depth interactions. To address these limitations, a new hyperspectral object tracking network equipped with Mamba (HyMamba), is proposed. It unifies spectral, cross-depth, and temporal modeling through state space modules (SSMs). The core of HyMamba lies in the Spectral State Integration (SSI) module, which enables progressive refinement and propagation of spectral features with cross-depth and temporal spectral information. Embedded within each SSI, the Hyperspectral Mamba (HSM) module is introduced to learn spatial and spectral information synchronously via three directional scanning SSMs. Based on SSI and HSM, HyMamba constructs joint features from false-color and hyperspectral inputs, and enhances them through interaction with original spectral features extracted from raw hyperspectral images. Extensive experiments conducted on seven benchmark datasets demonstrate that HyMamba achieves state-of-the-art performance. For instance, it achieves 73.0\% of the AUC score and 96.3\% of the DP@20 score on the HOTC2020 dataset. The code will be released at https://github.com/lgao001/HyMamba.

[48] Examining Vision Language Models through Multi-dimensional Experiments with Vision and Text Features

Saurav Sengupta,Nazanin Moradinasab,Jiebei Liu,Donald E. Brown

Main category: cs.CV

TL;DR: 本文研究了视觉语言模型(VLM)在回答需要关注图像特定区域的具体问题时,由于训练中的固有偏差而表现不佳的问题。通过构建多维检查框架并使用开源VLM,研究了输入数据特征(如图像大小、对象数量、背景颜色和提示具体性)如何影响VLM的性能。结果表明,即使是微小的输入变化也会导致VLM的回答和性能发生显著变化。

Details Motivation: 最近关于视觉语言模型(VLMs)的研究表明,它们依赖于训练期间学到的固有偏差来回答有关图像视觉属性的问题。当被问及需要关注图像特定区域的高度具体问题时,这些偏差会加剧。例如,当要求计算经过修改的美国国旗上的星星数量(例如,超过50颗星星)时,VLM往往会忽视视觉证据,无法准确回答。这项研究在此基础上展开,旨在系统地确定哪些输入数据特征导致了这种性能差异。 Method: 通过构建多维检查框架,系统地研究输入数据(包括图像和提示)的特征如何导致性能差异。使用开源VLM,进一步检查注意力值如何随输入参数(如图像大小、图像中的对象数量、背景颜色、提示的具体性)波动。 Result: 研究结果表明,图像特征和提示的具体性中的微小变化会导致视觉语言模型形成答案的方式以及随后的整体性能发生重大变化。此外,注意力值随输入参数的变化而波动,这表明VLM的行为变化可以通过输入数据的变化进行表征。 Conclusion: 研究表明,图像特征和提示的微小变化会导致视觉语言模型(VLM)回答方式和整体性能的重大变化。这强调了研究VLM行为变化的重要性,并提出了对这种变化进行表征的方法。 Abstract: Recent research on Vision Language Models (VLMs) suggests that they rely on inherent biases learned during training to respond to questions about visual properties of an image. These biases are exacerbated when VLMs are asked highly specific questions that require focusing on specific areas of the image. For example, a VLM tasked with counting stars on a modified American flag (e.g., with more than 50 stars) will often disregard the visual evidence and fail to answer accurately. We build upon this research and develop a multi-dimensional examination framework to systematically determine which characteristics of the input data, including both the image and the accompanying prompt, lead to such differences in performance. Using open-source VLMs, we further examine how attention values fluctuate with varying input parameters (e.g., image size, number of objects in the image, background color, prompt specificity). This research aims to learn how the behavior of vision language models changes and to explore methods for characterizing such changes. Our results suggest, among other things, that even minor modifications in image characteristics and prompt specificity can lead to large changes in how a VLM formulates its answer and, subsequently, its overall performance.

[49] Generalized Zero-Shot Learning for Point Cloud Segmentation with Evidence-Based Dynamic Calibration

Hyeonseok Kim,Byeongkeun Kang,Yeejin Lee

Main category: cs.CV

TL;DR: 本文提出了一种名为E3DPC-GZSL的新方法,用於解決3D點雲的廣義零樣本語義分割中的預測偏見問題。

Details Motivation: 現有的模型在處理廣義零樣本語義分割時傾向於對訓練期間遇到的類別做出過度自信的預測,這在訓練數據量較小的3D應用中尤為明顯。 Method: E3DPC-GZSL通過將基於證據的不確定性估計器集成到分類器中來解決過度自信問題,並使用動態校準堆疊因子調整預測概率。此外,E3DPC-GZSL引入了一種新的訓練策略,通過合併可學習參數和文本衍生特徵來改進語義空間的優化。 Result: 實驗結果表明,E3DPC-GZSL在ScanNet v2和S3DIS等廣義零樣本語義分割數據集上達到了最先進的性能。 Conclusion: E3DPC-GZSL有效地解決了3D點雲廣義零樣本語義分割中的預測偏見問題,並通過改進不確定性估計和模型優化提升了對未見數據的性能。 Abstract: Generalized zero-shot semantic segmentation of 3D point clouds aims to classify each point into both seen and unseen classes. A significant challenge with these models is their tendency to make biased predictions, often favoring the classes encountered during training. This problem is more pronounced in 3D applications, where the scale of the training data is typically smaller than in image-based tasks. To address this problem, we propose a novel method called E3DPC-GZSL, which reduces overconfident predictions towards seen classes without relying on separate classifiers for seen and unseen data. E3DPC-GZSL tackles the overconfidence problem by integrating an evidence-based uncertainty estimator into a classifier. This estimator is then used to adjust prediction probabilities using a dynamic calibrated stacking factor that accounts for pointwise prediction uncertainty. In addition, E3DPC-GZSL introduces a novel training strategy that improves uncertainty estimation by refining the semantic space. This is achieved by merging learnable parameters with text-derived features, thereby improving model optimization for unseen data. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art performance on generalized zero-shot semantic segmentation datasets, including ScanNet v2 and S3DIS.

[50] Dual-Thresholding Heatmaps to Cluster Proposals for Weakly Supervised Object Detection

Yuelin Guo,Haoyu He,Zhiyuan Chen,Zitong Huang,Renhao Lu,Lu Shi,Zejun Wang,Weizhe Zhang

Main category: cs.CV

TL;DR: This paper presents an improved framework for weakly supervised object detection, addressing key limitations in pseudo ground truth generation, background class representation, and convergence speed, with favorable results on benchmark datasets.

Details Motivation: The motivation is to overcome the limitations of current weakly supervised object detection methods, such as the lack of accurate pseudo ground truth boxes, poor background class representation, and slow convergence. Method: The paper proposes a framework incorporating a heatmap-guided proposal selector (HGPS), a weakly supervised basic detection network (WSBDN), and a negative certainty supervision loss to improve detection performance. Result: The experiments show that the framework achieves mAP/mCorLoc scores of 58.5%/81.8% on VOC 2007 and 55.6%/80.5% on VOC 2012, performing favorably against state-of-the-art methods. Conclusion: The paper concludes that the proposed framework effectively addresses the limitations of existing weakly supervised object detection methods, achieving favorable results on PASCAL VOC datasets. Abstract: Weakly supervised object detection (WSOD) has attracted significant attention in recent years, as it does not require box-level annotations. State-of-the-art methods generally adopt a multi-module network, which employs WSDDN as the multiple instance detection network module and multiple instance refinement modules to refine performance. However, these approaches suffer from three key limitations. First, existing methods tend to generate pseudo GT boxes that either focus only on discriminative parts, failing to capture the whole object, or cover the entire object but fail to distinguish between adjacent intra-class instances. Second, the foundational WSDDN architecture lacks a crucial background class representation for each proposal and exhibits a large semantic gap between its branches. Third, prior methods discard ignored proposals during optimization, leading to slow convergence. To address these challenges, we first design a heatmap-guided proposal selector (HGPS) algorithm, which utilizes dual thresholds on heatmaps to pre-select proposals, enabling pseudo GT boxes to both capture the full object extent and distinguish between adjacent intra-class instances. We then present a weakly supervised basic detection network (WSBDN), which augments each proposal with a background class representation and uses heatmaps for pre-supervision to bridge the semantic gap between matrices. At last, we introduce a negative certainty supervision loss on ignored proposals to accelerate convergence. Extensive experiments on the challenging PASCAL VOC 2007 and 2012 datasets demonstrate the effectiveness of our framework. We achieve mAP/mCorLoc scores of 58.5%/81.8% on VOC 2007 and 55.6%/80.5% on VOC 2012, performing favorably against the state-of-the-art WSOD methods. Our code is publicly available at https://github.com/gyl2565309278/DTH-CP.

[51] An Open Benchmark Dataset for GeoAI Foundation Models for Oil Palm Mapping in Indonesia

M. Warizmi Wafiq,Peter Cutter,Ate Poortinga,Daniel Marc G. dela Torre,Karis Tenneson,Vanna Teck,Enikoe Bihari,Chanarun Saisaward,Weraphong Suaruang,Andrea McMahon,Andi Vika Faradiba Muin,Karno B. Batiran,Chairil A,Nurul Qomar,Arya Arismaya Metananda,David Ganz,David Saah

Main category: cs.CV

TL;DR: This paper presents an open-access geospatial dataset for oil palm plantations in Indonesia, enhancing deforestation monitoring and supporting sustainability goals through detailed mapping and remote sensing model training.

Details Motivation: Oil palm cultivation is a leading cause of deforestation in Indonesia, and detailed, reliable mapping is needed to support sustainability efforts and regulatory frameworks. Method: The dataset was created using wall-to-wall digitization over large grids, based on expert labeling of high-resolution satellite imagery from 2020 to 2024, with quality assurance through multi-interpreter consensus and field validation. Result: An open-access geospatial dataset of oil palm plantations and related land cover types in Indonesia was produced, providing polygon-based, wall-to-wall annotations with a hierarchical typology and suitable for training and benchmarking remote sensing models. Conclusion: The dataset contributes to global deforestation reduction goals by supporting transparent monitoring of oil palm expansion and follows FAIR data principles. Abstract: Oil palm cultivation remains one of the leading causes of deforestation in Indonesia. To better track and address this challenge, detailed and reliable mapping is needed to support sustainability efforts and emerging regulatory frameworks. We present an open-access geospatial dataset of oil palm plantations and related land cover types in Indonesia, produced through expert labeling of high-resolution satellite imagery from 2020 to 2024. The dataset provides polygon-based, wall-to-wall annotations across a range of agro-ecological zones and includes a hierarchical typology that distinguishes oil palm planting stages as well as similar perennial crops. Quality was ensured through multi-interpreter consensus and field validation. The dataset was created using wall-to-wall digitization over large grids, making it suitable for training and benchmarking both conventional convolutional neural networks and newer geospatial foundation models. Released under a CC-BY license, it fills a key gap in training data for remote sensing and aims to improve the accuracy of land cover types mapping. By supporting transparent monitoring of oil palm expansion, the resource contributes to global deforestation reduction goals and follows FAIR data principles.

[52] SimCroP: Radiograph Representation Learning with Similarity-driven Cross-granularity Pre-training

Rongsheng Wang,Fenghe Tang,Qingsong Yao,Rui Yan,Xu Zhang,Zhen Huang,Haoran Lai,Zhiyang He,Xiaodong Tao,Zihang Jiang,Shaohua Kevin Zhou

Main category: cs.CV

TL;DR: SimCroP improves radiograph interpretation by aligning report sentences with corresponding patches and fusing multimodal information across different granularities.

Details Motivation: Medical vision-language pre-training has potential, but challenges such as spatial sparsity of lesions in CT scans and complex relationships between pathological descriptions and sub-regions in radiographs need to be addressed. Method: The Similarity-Driven Cross-Granularity Pre-training (SimCroP) framework combines similarity-driven alignment and cross-granularity fusion to enhance radiograph interpretation. Result: SimCroP demonstrates improved performance on multi-scale downstream tasks like image classification and segmentation. Conclusion: SimCroP outperforms existing medical self-supervised learning and vision-language pre-training methods. Abstract: Medical vision-language pre-training shows great potential in learning representative features from massive paired radiographs and reports. However, in computed tomography (CT) scans, the distribution of lesions which contain intricate structures is characterized by spatial sparsity. Besides, the complex and implicit relationships between different pathological descriptions in each sentence of the report and their corresponding sub-regions in radiographs pose additional challenges. In this paper, we propose a Similarity-Driven Cross-Granularity Pre-training (SimCroP) framework on chest CTs, which combines similarity-driven alignment and cross-granularity fusion to improve radiograph interpretation. We first leverage multi-modal masked modeling to optimize the encoder for understanding precise low-level semantics from radiographs. Then, similarity-driven alignment is designed to pre-train the encoder to adaptively select and align the correct patches corresponding to each sentence in reports. The cross-granularity fusion module integrates multimodal information across instance level and word-patch level, which helps the model better capture key pathology structures in sparse radiographs, resulting in improved performance for multi-scale downstream tasks. SimCroP is pre-trained on a large-scale paired CT-reports dataset and validated on image classification and segmentation tasks across five public datasets. Experimental results demonstrate that SimCroP outperforms both cutting-edge medical self-supervised learning methods and medical vision-language pre-training methods. Codes and models are available at https://github.com/ToniChopp/SimCroP.

[53] Boosted Training of Lightweight Early Exits for Optimizing CNN Image Classification Inference

Yehudit Aperstein,Alexander Apartsin

Main category: cs.CV

TL;DR: 本文提出了一种新的早退训练方法BTS-EE,解决了资源受限平台上实时图像分类的效率与准确性平衡问题。

Details Motivation: 在资源受限平台上,实时图像分类需要平衡准确性和延迟、功耗的问题,而传统早退策略存在协方差偏移问题。 Method: 引入了BTS-EE(Boosted Training Scheme for Early Exits)和基于1D卷积的轻量级分支架构,以及CPM校准方法。 Result: 在CINIC-10数据集上,BTS-EE相较非增强训练在64种配置中均表现更优,计算量减少最多达45%,精度损失仅为2%。 Conclusion: BTS-EE通过顺序训练方法和轻量级分支架构,在实时图像处理系统中实现了更高的效率和实用性。 Abstract: Real-time image classification on resource-constrained platforms demands inference methods that balance accuracy with strict latency and power budgets. Early-exit strategies address this need by attaching auxiliary classifiers to intermediate layers of convolutional neural networks (CNNs), allowing "easy" samples to terminate inference early. However, conventional training of early exits introduces a covariance shift: downstream branches are trained on full datasets, while at inference they process only the harder, non-exited samples. This mismatch limits efficiency--accuracy trade-offs in practice. We introduce the Boosted Training Scheme for Early Exits (BTS-EE), a sequential training approach that aligns branch training with inference-time data distributions. Each branch is trained and calibrated before the next, ensuring robustness under selective inference conditions. To further support embedded deployment, we propose a lightweight branch architecture based on 1D convolutions and a Class Precision Margin (CPM) calibration method that enables per-class threshold tuning for reliable exit decisions. Experiments on the CINIC-10 dataset with a ResNet18 backbone demonstrate that BTS-EE consistently outperforms non-boosted training across 64 configurations, achieving up to 45 percent reduction in computation with only 2 percent accuracy degradation. These results expand the design space for deploying CNNs in real-time image processing systems, offering practical efficiency gains for applications such as industrial inspection, embedded vision, and UAV-based monitoring.

[54] Retrieval-Augmented VLMs for Multimodal Melanoma Diagnosis

Jihyun Moon,Charmgil Hong

Main category: cs.CV

TL;DR: 本研究提出了一种检索增强的视觉语言模型框架,用于皮肤镜图像分析,以提高恶性黑色素瘤的诊断准确性。

Details Motivation: 为了改善恶性黑色素瘤的准确和早期诊断,需要一种能够整合临床元数据并克服卷积神经网络和传统视觉语言模型局限性的方法。 Method: 我们提出了一种检索增强的视觉语言模型框架,将语义相似的患者病例纳入诊断提示中。 Result: 该方法无需微调即可实现有根据的预测,并在分类准确性和错误纠正方面显著优于传统基线方法。 Conclusion: 检索增强的视觉语言模型为临床决策支持提供了一种稳健的策略。 Abstract: Accurate and early diagnosis of malignant melanoma is critical for improving patient outcomes. While convolutional neural networks (CNNs) have shown promise in dermoscopic image analysis, they often neglect clinical metadata and require extensive preprocessing. Vision-language models (VLMs) offer a multimodal alternative but struggle to capture clinical specificity when trained on general-domain data. To address this, we propose a retrieval-augmented VLM framework that incorporates semantically similar patient cases into the diagnostic prompt. Our method enables informed predictions without fine-tuning and significantly improves classification accuracy and error correction over conventional baselines. These results demonstrate that retrieval-augmented prompting provides a robust strategy for clinical decision support.

[55] InsFusion: Rethink Instance-level LiDAR-Camera Fusion for 3D Object Detection

Zhongyu Xia,Hansong Yang,Yongtao Wang

Main category: cs.CV

TL;DR: InsFusion improves 3D object detection by reducing error accumulation through proposal-based feature querying and attention mechanisms.

Details Motivation: Noise and error accumulate during feature extraction, perspective transformation, and fusion in 3D object detection, which needs to be addressed to improve detection accuracy. Method: InsFusion extracts proposals from both raw and fused features and uses these proposals to query the raw features, incorporating attention mechanisms to mitigate accumulated errors. Result: Experiments on the nuScenes dataset show that InsFusion mitigates error accumulation and improves performance in 3D object detection. Conclusion: InsFusion demonstrates compatibility with advanced baseline methods and achieves state-of-the-art performance for 3D object detection. Abstract: Three-dimensional Object Detection from multi-view cameras and LiDAR is a crucial component for autonomous driving and smart transportation. However, in the process of basic feature extraction, perspective transformation, and feature fusion, noise and error will gradually accumulate. To address this issue, we propose InsFusion, which can extract proposals from both raw and fused features and utilizes these proposals to query the raw features, thereby mitigating the impact of accumulated errors. Additionally, by incorporating attention mechanisms applied to the raw features, it thereby mitigates the impact of accumulated errors. Experiments on the nuScenes dataset demonstrate that InsFusion is compatible with various advanced baseline methods and delivers new state-of-the-art performance for 3D object detection.

[56] Bitrate-Controlled Diffusion for Disentangling Motion and Content in Video

Xiao Li,Qi Chen,Xiulian Peng,Kai Yu,Xie Chen,Yan Lu

Main category: cs.CV

TL;DR: 本文提出了一种基于Transformer和扩散模型的自监督视频分解框架,可将视频分解为动态运动和静态内容,并在多种视频数据上表现良好。

Details Motivation: 为了解决传统方法对先验假设和归纳偏置的依赖,该研究旨在提出一种更加通用且适用于不同视频数据类型的解耦表示学习方法。 Method: 该方法采用基于Transformer的架构,逐帧生成灵活的隐式运动特征和整个视频片段的静态内容特征,并引入低比特率的向量量化作为信息瓶颈来促进分解并构建有意义的离散运动空间,最后使用去噪扩散模型进行自监督表示学习。 Result: 该方法在真实世界的人脸对话视频中进行了验证,成功实现了运动迁移和自回归运动生成任务,并展示了其对其他类型视频数据(如2D卡通角色像素动画)的良好泛化能力。 Conclusion: 该研究提出了一种新颖且通用的框架,能够以自监督的方式将视频数据分解为动态运动和静态内容,为视频分析与生成领域提供了新的视角。 Abstract: We propose a novel and general framework to disentangle video data into its dynamic motion and static content components. Our proposed method is a self-supervised pipeline with less assumptions and inductive biases than previous works: it utilizes a transformer-based architecture to jointly generate flexible implicit features for frame-wise motion and clip-wise content, and incorporates a low-bitrate vector quantization as an information bottleneck to promote disentanglement and form a meaningful discrete motion space. The bitrate-controlled latent motion and content are used as conditional inputs to a denoising diffusion model to facilitate self-supervised representation learning. We validate our disentangled representation learning framework on real-world talking head videos with motion transfer and auto-regressive motion generation tasks. Furthermore, we also show that our method can generalize to other types of video data, such as pixel sprites of 2D cartoon characters. Our work presents a new perspective on self-supervised learning of disentangled video representations, contributing to the broader field of video analysis and generation.

[57] Semantic Causality-Aware Vision-Based 3D Occupancy Prediction

Dubing Chen,Huan Zheng,Yucheng Zhou,Xianfei Li,Wenlong Liao,Tao He,Pai Peng,Jianbing Shen

Main category: cs.CV

TL;DR: 本文提出了一种端到端的2D到3D语义变换方法,通过因果损失函数提升3D语义占用预测的性能和鲁棒性。

Details Motivation: 现有方法依赖于模块化流水线,通常独立优化模块或使用预配置输入,导致误差累积。 Method: 设计了一种新的因果损失函数,使整个2D到3D变换流程可微分,包括Channel-Grouped Lifting、可学习相机偏移和归一化卷积三个组成部分。 Result: 实验表明,该方法在Occ3D基准测试中表现最佳,对相机扰动具有显著的鲁棒性,并提高了2D到3D的语义一致性。 Conclusion: 该论文提出了一种基于语义因果关系的2D到3D变换方法,实现了整个流程的端到端监督,并在Occ3D基准测试中表现出最先进的性能。 Abstract: Vision-based 3D semantic occupancy prediction is a critical task in 3D vision that integrates volumetric 3D reconstruction with semantic understanding. Existing methods, however, often rely on modular pipelines. These modules are typically optimized independently or use pre-configured inputs, leading to cascading errors. In this paper, we address this limitation by designing a novel causal loss that enables holistic, end-to-end supervision of the modular 2D-to-3D transformation pipeline. Grounded in the principle of 2D-to-3D semantic causality, this loss regulates the gradient flow from 3D voxel representations back to the 2D features. Consequently, it renders the entire pipeline differentiable, unifying the learning process and making previously non-trainable components fully learnable. Building on this principle, we propose the Semantic Causality-Aware 2D-to-3D Transformation, which comprises three components guided by our causal loss: Channel-Grouped Lifting for adaptive semantic mapping, Learnable Camera Offsets for enhanced robustness against camera perturbations, and Normalized Convolution for effective feature propagation. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the Occ3D benchmark, demonstrating significant robustness to camera perturbations and improved 2D-to-3D semantic consistency.

[58] VRAE: Vertical Residual Autoencoder for License Plate Denoising and Deblurring

Cuong Nguyen,Dung T. Tran,Hong Nguyen,Xuan-Vu Phan,Nam-Phong Nguyen

Main category: cs.CV

TL;DR: 本文提出了一种用于交通监控中图像增强的垂直残差自编码器(VRAE)架构,通过在编码阶段注入输入感知特征来改善传统自编码器的信息保留能力。实验表明,该方法在PSNR、NMSE和SSIM指标上均优于传统方法,且参数增加极少。

Details Motivation: 在现实世界的交通监控中,由于恶劣天气、光线不足或高速运动导致的噪声和模糊图像会显著降低车牌识别系统的准确性。因此,需要一种快速的实时图像恢复方法来提升识别性能。 Method: 提出了一种垂直残差自编码器(VRAE)架构,通过在每个编码阶段引入一个辅助模块注入输入感知特征,从而指导网络的表示学习过程,提升信息保留能力。 Result: 实验表明,与传统自编码器(AE)、生成对抗网络(GAN)和基于流的方法(FB)相比,VRAE在PSNR指标上提升了约20%,NMSE降低了约50%,SSIM提高了1%,且参数仅增加了约1%。 Conclusion: 所提出的VRAE架构在交通监控图像增强任务中表现优异,能够在仅增加少量参数的情况下显著提升图像质量,具有较高的实用价值。 Abstract: In real-world traffic surveillance, vehicle images captured under adverse weather, poor lighting, or high-speed motion often suffer from severe noise and blur. Such degradations significantly reduce the accuracy of license plate recognition systems, especially when the plate occupies only a small region within the full vehicle image. Restoring these degraded images a fast realtime manner is thus a crucial pre-processing step to enhance recognition performance. In this work, we propose a Vertical Residual Autoencoder (VRAE) architecture designed for the image enhancement task in traffic surveillance. The method incorporates an enhancement strategy that employs an auxiliary block, which injects input-aware features at each encoding stage to guide the representation learning process, enabling better general information preservation throughout the network compared to conventional autoencoders. Experiments on a vehicle image dataset with visible license plates demonstrate that our method consistently outperforms Autoencoder (AE), Generative Adversarial Network (GAN), and Flow-Based (FB) approaches. Compared with AE at the same depth, it improves PSNR by about 20\%, reduces NMSE by around 50\%, and enhances SSIM by 1\%, while requiring only a marginal increase of roughly 1\% in parameters.

[59] Sparse BEV Fusion with Self-View Consistency for Multi-View Detection and Tracking

Keisuke Toida,Taigo Sakai,Naoki Kato,Kazutoyo Yokota,Takeshi Nakamura,Kazuhiro Hotta

Main category: cs.CV

TL;DR: SCFusion addresses the challenges of multi-view multi-object tracking by introducing a novel framework that enhances feature integration through sparse transformation, density-aware weighting, and multi-view consistency loss, achieving superior performance on two benchmark datasets.

Details Motivation: Maintaining consistent object identities across multiple cameras remains challenging due to viewpoint changes, lighting variations, and occlusions. Conventional BEV projection introduces feature distortion and non-uniform density, which degrade the quality of the fused representation. Method: SCFusion combines three techniques: sparse transformation, density-aware weighting, and multi-view consistency loss to improve multi-view feature integration. Result: SCFusion achieves state-of-the-art performance, reaching an IDF1 score of 95.9% on WildTrack and a MODP of 89.2% on MultiviewX, outperforming the baseline method TrackTacular. Conclusion: SCFusion effectively mitigates the limitations of conventional BEV projection and provides a robust and accurate solution for multi-view object detection and tracking. Abstract: Multi-View Multi-Object Tracking (MVMOT) is essential for applications such as surveillance, autonomous driving, and sports analytics. However, maintaining consistent object identities across multiple cameras remains challenging due to viewpoint changes, lighting variations, and occlusions, which often lead to tracking errors.Recent methods project features from multiple cameras into a unified Bird's-Eye-View (BEV) space to improve robustness against occlusion. However, this projection introduces feature distortion and non-uniform density caused by variations in object scale with distance. These issues degrade the quality of the fused representation and reduce detection and tracking accuracy.To address these problems, we propose SCFusion, a framework that combines three techniques to improve multi-view feature integration. First, it applies a sparse transformation to avoid unnatural interpolation during projection. Next, it performs density-aware weighting to adaptively fuse features based on spatial confidence and camera distance. Finally, it introduces a multi-view consistency loss that encourages each camera to learn discriminative features independently before fusion.Experiments show that SCFusion achieves state-of-the-art performance, reaching an IDF1 score of 95.9% on WildTrack and a MODP of 89.2% on MultiviewX, outperforming the baseline method TrackTacular. These results demonstrate that SCFusion effectively mitigates the limitations of conventional BEV projection and provides a robust and accurate solution for multi-view object detection and tracking.

[60] LD-ViCE: Latent Diffusion Model for Video Counterfactual Explanations

Payal Varshney,Adriano Lucieri,Christoph Balada,Sheraz Ahmed,Andreas Dengel

Main category: cs.CV

TL;DR: The paper introduces LD-ViCE, a novel framework for explaining video-based AI models, which operates efficiently in latent space and produces realistic, interpretable counterfactuals, significantly improving performance over existing methods.

Details Motivation: Video-based AI systems are increasingly used in safety-critical domains, but their decisions are hard to interpret due to the spatiotemporal complexity of video data and the opacity of deep learning models. Existing explanation techniques face challenges in temporal coherence, robustness, and actionable causal insights. Method: Latent Diffusion for Video Counterfactual Explanations (LD-ViCE) operates in latent space using a state-of-the-art diffusion model to reduce computational costs, with an additional refinement step to produce realistic and interpretable counterfactuals. Result: LD-ViCE outperforms recent state-of-the-art methods by achieving a 68% increase in R2 score while halving inference time. It generates semantically meaningful and temporally coherent explanations across diverse video datasets. Conclusion: LD-ViCE represents a significant advancement in explaining video-based AI models, offering semantically meaningful and temporally coherent explanations, and contributing to the trustworthy deployment of AI in safety-critical domains. Abstract: Video-based AI systems are increasingly adopted in safety-critical domains such as autonomous driving and healthcare. However, interpreting their decisions remains challenging due to the inherent spatiotemporal complexity of video data and the opacity of deep learning models. Existing explanation techniques often suffer from limited temporal coherence, insufficient robustness, and a lack of actionable causal insights. Current counterfactual explanation methods typically do not incorporate guidance from the target model, reducing semantic fidelity and practical utility. We introduce Latent Diffusion for Video Counterfactual Explanations (LD-ViCE), a novel framework designed to explain the behavior of video-based AI models. Compared to previous approaches, LD-ViCE reduces the computational costs of generating explanations by operating in latent space using a state-of-the-art diffusion model, while producing realistic and interpretable counterfactuals through an additional refinement step. Our experiments demonstrate the effectiveness of LD-ViCE across three diverse video datasets, including EchoNet-Dynamic (cardiac ultrasound), FERV39k (facial expression), and Something-Something V2 (action recognition). LD-ViCE outperforms a recent state-of-the-art method, achieving an increase in R2 score of up to 68% while reducing inference time by half. Qualitative analysis confirms that LD-ViCE generates semantically meaningful and temporally coherent explanations, offering valuable insights into the target model behavior. LD-ViCE represents a valuable step toward the trustworthy deployment of AI in safety-critical domains.

[61] Beyond Distribution Shifts: Adaptive Hyperspectral Image Classification at Test Time

Xia Yue,Anfeng Liu,Ning Chen,Chenjia Huang,Hui Liu,Zhou Huang,Leyuan Fang

Main category: cs.CV

TL;DR: HyperTTA提出了一种新的统一框架,通过构建多退化数据集和设计测试时自适应策略,显著增强了HSI分类模型在各种退化条件下的鲁棒性。

Details Motivation: HSI分类模型对由现实世界中的各种退化(如噪声、模糊、压缩和大气效应)引起的分布变化非常敏感,需要一个统一的框架来提升模型在这些条件下的鲁棒性。 Method: 构建了一个多退化高光谱数据集,设计了增强多级感受野机制和标签平滑正则化的光谱-空间变压器分类器,并结合了一个基于LayerNorm适配器的轻量级测试时自适应策略。 Result: 在两个基准数据集上的大量实验表明,HyperTTA在广泛的退化场景中均优于现有基线方法,验证了其分类主干网络和所提TTA方案的有效性。 Conclusion: HyperTTA是一个增强HSI分类模型在各种退化条件下鲁棒性的统一框架,它通过一个轻量级的测试时自适应策略和一个设计的光谱-空间变压器分类器来实现鲁棒性和泛化能力的提升。 Abstract: Hyperspectral image (HSI) classification models are highly sensitive to distribution shifts caused by various real-world degradations such as noise, blur, compression, and atmospheric effects. To address this challenge, we propose HyperTTA, a unified framework designed to enhance model robustness under diverse degradation conditions. Specifically, we first construct a multi-degradation hyperspectral dataset that systematically simulates nine representative types of degradations, providing a comprehensive benchmark for robust classification evaluation. Based on this, we design a spectral-spatial transformer classifier (SSTC) enhanced with a multi-level receptive field mechanism and label smoothing regularization to jointly capture multi-scale spatial context and improve generalization. Furthermore, HyperTTA incorporates a lightweight test-time adaptation (TTA) strategy, the confidence-aware entropy-minimized LayerNorm adapter (CELA), which updates only the affine parameters of LayerNorm layers by minimizing prediction entropy on high-confidence unlabeled target samples. This confidence-aware adaptation prevents unreliable updates from noisy predictions, enabling robust and dynamic adaptation without access to source data or target annotations. Extensive experiments on two benchmark datasets demonstrate that HyperTTA outperforms existing baselines across a wide range of degradation scenarios, validating the effectiveness of both its classification backbone and the proposed TTA scheme. Code will be made available publicly.

[62] Spherical Brownian Bridge Diffusion Models for Conditional Cortical Thickness Forecasting

Ivan Stoyanov,Fabian Bongratz,Christian Wachinger

Main category: cs.CV

TL;DR: The paper introduces the Spherical Brownian Bridge Diffusion Model (SBDM) to accurately forecast high-resolution cortical thickness trajectories, addressing challenges posed by the cerebral cortex's complex geometry and the need to integrate multi-modal data.

Details Motivation: Accurate forecasting of individualized, high-resolution cortical thickness trajectories is vital for detecting subtle cortical changes, offering insights into neurodegenerative processes and enabling earlier, more precise interventions. However, the complex non-Euclidean geometry of the cerebral cortex and the integration of multi-modal data present significant challenges. Method: The study introduces the Spherical Brownian Bridge Diffusion Model (SBDM), which uses a bidirectional conditional Brownian bridge diffusion process and a new denoising model called the conditional spherical U-Net (CoS-UNet). This combines spherical convolutions and dense cross-attention to integrate multi-modal data for subject-specific predictions. Result: Experiments on longitudinal datasets from ADNI and OASIS showed that SBDM significantly reduces prediction errors compared to previous approaches. Additionally, SBDM can generate individual factual and counterfactual cortical thickness trajectories. Conclusion: The SBDM model offers a novel and effective framework for forecasting cortical thickness trajectories, allowing for the exploration of both factual and counterfactual scenarios in cortical development. Abstract: Accurate forecasting of individualized, high-resolution cortical thickness (CTh) trajectories is essential for detecting subtle cortical changes, providing invaluable insights into neurodegenerative processes and facilitating earlier and more precise intervention strategies. However, CTh forecasting is a challenging task due to the intricate non-Euclidean geometry of the cerebral cortex and the need to integrate multi-modal data for subject-specific predictions. To address these challenges, we introduce the Spherical Brownian Bridge Diffusion Model (SBDM). Specifically, we propose a bidirectional conditional Brownian bridge diffusion process to forecast CTh trajectories at the vertex level of registered cortical surfaces. Our technical contribution includes a new denoising model, the conditional spherical U-Net (CoS-UNet), which combines spherical convolutions and dense cross-attention to integrate cortical surfaces and tabular conditions seamlessly. Compared to previous approaches, SBDM achieves significantly reduced prediction errors, as demonstrated by our experiments based on longitudinal datasets from the ADNI and OASIS. Additionally, we demonstrate SBDM's ability to generate individual factual and counterfactual CTh trajectories, offering a novel framework for exploring hypothetical scenarios of cortical development.

[63] First-order State Space Model for Lightweight Image Super-resolution

Yujie Zhu,Xinyi Zhang,Yekai Lu,Guang Yang,Faming Fang,Guixu Zhang

Main category: cs.CV

TL;DR: 本文介绍了FSSM,一种改进的Mamba模块,用于提高轻量级超分辨率任务的性能。

Details Motivation: 为了探索SSMs的潜力,我们修改了SSM的计算过程,以提高轻量级超分辨率任务的性能。 Method: 应用了一阶保持条件在SSMs中,推导了新的离散形式,并分析了累积误差。 Result: 实验结果表明,FSSM在五个基准数据集上改进了MambaIR的性能,并超越了当前的轻量级SR方法。 Conclusion: FSSM改进了Mamba模块,增强了性能,并在不增加参数数量的情况下超越了当前的轻量级SR方法,达到了最先进的结果。 Abstract: State space models (SSMs), particularly Mamba, have shown promise in NLP tasks and are increasingly applied to vision tasks. However, most Mamba-based vision models focus on network architecture and scan paths, with little attention to the SSM module. In order to explore the potential of SSMs, we modified the calculation process of SSM without increasing the number of parameters to improve the performance on lightweight super-resolution tasks. In this paper, we introduce the First-order State Space Model (FSSM) to improve the original Mamba module, enhancing performance by incorporating token correlations. We apply a first-order hold condition in SSMs, derive the new discretized form, and analyzed cumulative error. Extensive experimental results demonstrate that FSSM improves the performance of MambaIR on five benchmark datasets without additionally increasing the number of parameters, and surpasses current lightweight SR methods, achieving state-of-the-art results.

[64] Maximally Useful and Minimally Redundant: The Key to Self Supervised Learning for Imbalanced Data

Yash Kumar Sharma,Vineet Nair,Wilson Naik

Main category: cs.CV

TL;DR: This paper proposes a new self-supervised learning method for imbalanced datasets using more than two views and mutual information, achieving state-of-the-art results.

Details Motivation: The motivation stems from the lack of robustness of contrastive self-supervised learning in imbalanced datasets and a suggestion by Yann LeCun to extend multiview frameworks to more than two views. Method: The method introduces a theoretical justification based on mutual information to handle more than two views, distinguishing intra and inter discriminatory characteristics and proposing a new loss function to filter extreme features. Result: Experimental evaluations showed significant improvements in accuracy on imbalanced datasets like Cifar10-LT, Cifar100-LT, and Imagenet-LT using Resnet architectures. Conclusion: The proposed method enhances the robustness of self-supervised learning on imbalanced datasets by leveraging more than two views and mutual information, achieving state-of-the-art results. Abstract: The robustness of contrastive self-supervised learning (CSSL) for imbalanced datasets is largely unexplored. CSSL usually makes use of \emph{multi-view} assumptions to learn discriminatory features via similar and dissimilar data samples. CSSL works well on balanced datasets, but does not generalize well for imbalanced datasets. In a very recent paper, as part of future work, Yann LeCun pointed out that the self-supervised multiview framework can be extended to cases involving \emph{more than two views}. Taking a cue from this insight we propose a theoretical justification based on the concept of \emph{mutual information} to support the \emph{more than two views} objective and apply it to the problem of dataset imbalance in self-supervised learning. The proposed method helps extract representative characteristics of the tail classes by segregating between \emph{intra} and \emph{inter} discriminatory characteristics. We introduce a loss function that helps us to learn better representations by filtering out extreme features. Experimental evaluation on a variety of self-supervised frameworks (both contrastive and non-contrastive) also prove that the \emph{more than two view} objective works well for imbalanced datasets. We achieve a new state-of-the-art accuracy in self-supervised imbalanced dataset classification (2\% improvement in Cifar10-LT using Resnet-18, 5\% improvement in Cifar100-LT using Resnet-18, 3\% improvement in Imagenet-LT (1k) using Resnet-50).

[65] Prompt-Driven Image Analysis with Multimodal Generative AI: Detection, Segmentation, Inpainting, and Interpretation

Kaleem Ahmad

Main category: cs.CV

TL;DR: 本文介绍了一个统一的图像分析系统,通过一个自然语言指令实现多步骤图像处理,并提供可靠的可重复性和调试能力。

Details Motivation: 为了提高图像分析的可操作性和可靠性,使用户能够通过单一指令完成复杂的视觉任务。 Method: 结合了开放词汇检测、可提示分割、文本条件修复和视觉语言描述等技术,形成一个端到端的系统。 Result: 在单字提示下,检测和分割生成可用掩码的比例超过90%,准确率高于85%;修复步骤占总运行时间的60%至75%。 Conclusion: 本文提出了一种统一的图像分析流水线,能够通过一个自然语言指令实现多步骤的图像处理,并提供了可重复、可调试的工作流程。 Abstract: Prompt-driven image analysis converts a single natural-language instruction into multiple steps: locate, segment, edit, and describe. We present a practical case study of a unified pipeline that combines open-vocabulary detection, promptable segmentation, text-conditioned inpainting, and vision-language description into a single workflow. The system works end to end from a single prompt, retains intermediate artifacts for transparent debugging (such as detections, masks, overlays, edited images, and before and after composites), and provides the same functionality through an interactive UI and a scriptable CLI for consistent, repeatable runs. We highlight integration choices that reduce brittleness, including threshold adjustments, mask inspection with light morphology, and resource-aware defaults. In a small, single-word prompt segment, detection and segmentation produced usable masks in over 90% of cases with an accuracy above 85% based on our criteria. On a high-end GPU, inpainting makes up 60 to 75% of total runtime under typical guidance and sampling settings, which highlights the need for careful tuning. The study offers implementation-guided advice on thresholds, mask tightness, and diffusion parameters, and details version pinning, artifact logging, and seed control to support replay. Our contribution is a transparent, reliable pattern for assembling modern vision and multimodal models behind a single prompt, with clear guardrails and operational practices that improve reliability in object replacement, scene augmentation, and removal.

[66] A Structured Review of Underwater Object Detection Challenges and Solutions: From Traditional to Large Vision Language Models

Edwine Nabahirwa,Wei Song,Minghua Zhang,Yi Fang,Zhou Ni

Main category: cs.CV

TL;DR: 这篇论文综述了水下目标检测(UOD)面临的挑战,并分析了从传统技术到现代方法的发展。探讨了大型视觉-语言模型(LVLMs)在UOD中的潜力,并通过案例研究展示了其应用前景。

Details Motivation: 水下目标检测(UOD)对于各种海洋应用至关重要,包括海洋研究、水下机器人和海洋保护。然而,UOD面临许多挑战,影响其性能。 Method: 论文的方法包括将UOD挑战分为五个关键领域,并分析了从传统图像处理和目标检测技术到现代方法的发展。此外,还探讨了大型视觉-语言模型(LVLMs)在UOD中的潜力,并进行了案例研究。 Result: 论文的结果包括三个关键见解:(i)当前的UOD方法不足以完全解决水下环境中的挑战。(ii)使用LVLMs进行合成数据生成显示了增强数据集的潜力,但需要进一步改进以确保真实感和适用性。(iii)LVLMs在UOD中有重大前景,但其实时应用仍需进一步研究。 Conclusion: 这篇论文的结论是,当前的UOD方法不足以完全解决水下环境中的挑战,如图像退化和小物体检测。同时,LVLMs在UOD中具有重大潜力,但其实时应用仍需进一步研究优化技术。 Abstract: Underwater object detection (UOD) is vital to diverse marine applications, including oceanographic research, underwater robotics, and marine conservation. However, UOD faces numerous challenges that compromise its performance. Over the years, various methods have been proposed to address these issues, but they often fail to fully capture the complexities of underwater environments. This review systematically categorizes UOD challenges into five key areas: Image quality degradation, target-related issues, data-related challenges, computational and processing constraints, and limitations in detection methodologies. To address these challenges, we analyze the progression from traditional image processing and object detection techniques to modern approaches. Additionally, we explore the potential of large vision-language models (LVLMs) in UOD, leveraging their multi-modal capabilities demonstrated in other domains. We also present case studies, including synthetic dataset generation using DALL-E 3 and fine-tuning Florence-2 LVLM for UOD. This review identifies three key insights: (i) Current UOD methods are insufficient to fully address challenges like image degradation and small object detection in dynamic underwater environments. (ii) Synthetic data generation using LVLMs shows potential for augmenting datasets but requires further refinement to ensure realism and applicability. (iii) LVLMs hold significant promise for UOD, but their real-time application remains under-explored, requiring further research on optimization techniques.

[67] Chirality in Action: Time-Aware Video Representation Learning by Latent Straightening

Piyush Bagad,Andrew Zisserman

Main category: cs.CV

TL;DR: The paper proposes a self-supervised method to improve time-sensitive video representations for recognizing temporally opposite actions, achieving strong performance on multiple datasets.

Details Motivation: The motivation is to develop compact video representations sensitive to visual change over time, as many current video embeddings poorly represent such changes, especially in distinguishing temporally opposite actions. Method: The method involves a self-supervised adaptation recipe using an auto-encoder with inductive bias inspired by perceptual straightening to inject time-sensitivity into frozen image features. Result: The model achieves linear separability between chiral action pairs, outperforms larger video models on three datasets (Something-Something, EPIC-Kitchens, and Charade), and enhances classification performance when combined with existing models. Conclusion: The paper concludes that their proposed model effectively creates time-sensitive video representations that perform well on chiral action recognition tasks and improve classification performance when combined with existing models. Abstract: Our objective is to develop compact video representations that are sensitive to visual change over time. To measure such time-sensitivity, we introduce a new task: chiral action recognition, where one needs to distinguish between a pair of temporally opposite actions, such as "opening vs. closing a door", "approaching vs. moving away from something", "folding vs. unfolding paper", etc. Such actions (i) occur frequently in everyday life, (ii) require understanding of simple visual change over time (in object state, size, spatial position, count . . . ), and (iii) are known to be poorly represented by many video embeddings. Our goal is to build time aware video representations which offer linear separability between these chiral pairs. To that end, we propose a self-supervised adaptation recipe to inject time-sensitivity into a sequence of frozen image features. Our model is based on an auto-encoder with a latent space with inductive bias inspired by perceptual straightening. We show that this results in a compact but time-sensitive video representation for the proposed task across three datasets: Something-Something, EPIC-Kitchens, and Charade. Our method (i) outperforms much larger video models pre-trained on large-scale video datasets, and (ii) leads to an improvement in classification performance on standard benchmarks when combined with these existing models.

[68] HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning

Liyang Chen,Tianxiang Ma,Jiawei Liu,Bingchuan Li,Zhuowei Chen,Lijie Liu,Xu He,Gen Li,Qian He,Zhiyong Wu

Main category: cs.CV

TL;DR: 本文提出 HuMo,一个统一的人类视频生成框架,通过高质量数据集和创新训练策略解决多模态协调难题,实现了更优的生成效果。

Details Motivation: 现有方法在处理多模态输入生成人类视频时面临两个挑战:缺乏配对的三元组训练数据,以及难以协调主体保留和音视频同步任务。 Method: 提出了一种两阶段渐进式多模态训练范式,包括最小侵入式图像注入策略、关注预测策略以及时间自适应无分类器引导策略。 Result: HuMo 在子任务上超越了最先进的方法,建立了一个能够处理多模态输入的统一视频生成框架,并通过了广泛的实验验证。 Conclusion: HuMo 是一种统一的人类视频生成框架,能够有效整合多种输入模式(如文本、图像和音频),解决了现有方法在协调这些模式方面的不足,同时通过创新的训练策略和推理方法实现了优越的生成效果。 Abstract: Human-Centric Video Generation (HCVG) methods seek to synthesize human videos from multimodal inputs, including text, image, and audio. Existing methods struggle to effectively coordinate these heterogeneous modalities due to two challenges: the scarcity of training data with paired triplet conditions and the difficulty of collaborating the sub-tasks of subject preservation and audio-visual sync with multimodal inputs. In this work, we present HuMo, a unified HCVG framework for collaborative multimodal control. For the first challenge, we construct a high-quality dataset with diverse and paired text, reference images, and audio. For the second challenge, we propose a two-stage progressive multimodal training paradigm with task-specific strategies. For the subject preservation task, to maintain the prompt following and visual generation abilities of the foundation model, we adopt the minimal-invasive image injection strategy. For the audio-visual sync task, besides the commonly adopted audio cross-attention layer, we propose a focus-by-predicting strategy that implicitly guides the model to associate audio with facial regions. For joint learning of controllabilities across multimodal inputs, building on previously acquired capabilities, we progressively incorporate the audio-visual sync task. During inference, for flexible and fine-grained multimodal control, we design a time-adaptive Classifier-Free Guidance strategy that dynamically adjusts guidance weights across denoising steps. Extensive experimental results demonstrate that HuMo surpasses specialized state-of-the-art methods in sub-tasks, establishing a unified framework for collaborative multimodal-conditioned HCVG. Project Page: https://phantom-video.github.io/HuMo.

[69] MESH -- Understanding Videos Like Human: Measuring Hallucinations in Large Video Models

Garry Yang,Zizhe Chen,Man Hon Wong,Haoyu Lei,Yongqiang Chen,Zhenguo Li,Kaiwen Zhou,James Cheng

Main category: cs.CV

TL;DR: The paper introduces MESH, a systematic benchmark for evaluating hallucinations in Large Video Models (LVMs), revealing their limitations in processing complex video content.

Details Motivation: Current benchmarks for video hallucination heavily rely on manual categorization, neglecting human-like perception of videos. Method: The MESH benchmark was developed using a Question-Answering framework with binary and multi-choice questions to evaluate hallucinations in LVMs. Result: MESH effectively identifies hallucinations in LVMs and highlights their challenges in interpreting fine details and multi-subject actions in longer videos. Conclusion: LVMs perform well in recognizing basic objects but are susceptible to hallucinations when dealing with complex video details and multiple actions. Abstract: Large Video Models (LVMs) build on the semantic capabilities of Large Language Models (LLMs) and vision modules by integrating temporal information to better understand dynamic video content. Despite their progress, LVMs are prone to hallucinations-producing inaccurate or irrelevant descriptions. Current benchmarks for video hallucination depend heavily on manual categorization of video content, neglecting the perception-based processes through which humans naturally interpret videos. We introduce MESH, a benchmark designed to evaluate hallucinations in LVMs systematically. MESH uses a Question-Answering framework with binary and multi-choice formats incorporating target and trap instances. It follows a bottom-up approach, evaluating basic objects, coarse-to-fine subject features, and subject-action pairs, aligning with human video understanding. We demonstrate that MESH offers an effective and comprehensive approach for identifying hallucinations in videos. Our evaluations show that while LVMs excel at recognizing basic objects and features, their susceptibility to hallucinations increases markedly when handling fine details or aligning multiple actions involving various subjects in longer videos.

[70] ViewSparsifier: Killing Redundancy in Multi-View Plant Phenotyping

Robin-Nico Kampa,Fabian Deuser,Konrad Habel,Norbert Oswald

Main category: cs.CV

TL;DR: This paper addresses the limitations of single-view models in plant phenotyping by introducing a multi-view approach (ViewSparsifier) that learns view-invariant embeddings, achieving success in plant age prediction and leaf count estimation tasks.

Details Motivation: Single-view deep learning models in plant phenotyping often fail to capture sufficient information for accurate trait estimation. Multi-view datasets and view-invariant embeddings are explored to improve accuracy in plant health assessment and harvest readiness prediction. Method: The ViewSparsifier approach was used to learn view-invariant embeddings by incorporating 24 randomly selected views (selection vector). Additionally, selection matrices were experimented with, covering all 120 views across five height levels. Result: The ViewSparsifier approach won both tasks in the GroMo Grand Challenge, demonstrating effectiveness in handling multi-view data with overlapping information. Conclusion: The ViewSparsifier approach successfully won both tasks in the GroMo Grand Challenge, showing potential for further improvements through randomized view selection across all height levels. Abstract: Plant phenotyping involves analyzing observable characteristics of plants to better understand their growth, health, and development. In the context of deep learning, this analysis is often approached through single-view classification or regression models. However, these methods often fail to capture all information required for accurate estimation of target phenotypic traits, which can adversely affect plant health assessment and harvest readiness prediction. To address this, the Growth Modelling (GroMo) Grand Challenge at ACM Multimedia 2025 provides a multi-view dataset featuring multiple plants and two tasks: Plant Age Prediction and Leaf Count Estimation. Each plant is photographed from multiple heights and angles, leading to significant overlap and redundancy in the captured information. To learn view-invariant embeddings, we incorporate 24 views, referred to as the selection vector, in a random selection. Our ViewSparsifier approach won both tasks. For further improvement and as a direction for future research, we also experimented with randomized view selection across all five height levels (120 views total), referred to as selection matrices.

[71] Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles

Eric Slyman,Mehrab Tanjim,Kushal Kafle,Stefan Lee

Main category: cs.CV

TL;DR: 本文提出了一种多模态感知的方法Multimodal Mixture-of-Bayesian Prompt Ensembles (MMB),用于改善文本到图像生成系统的评估效果,尤其是在判断偏好和校准方面。

Details Motivation: 多模态大语言模型在评估文本到图像生成系统时存在偏差、过度自信和性能不稳定的问题,因此需要一种新的方法来提升评估的准确性和可靠性。 Method: 提出了一种结合贝叶斯提示集成和图像聚类的新方法,能够根据每个样本的视觉特征动态分配提示权重。 Result: MMB方法在HPSv2和MJBench两个文本到图像生成基准数据集上均优于现有基线模型,与人工标注的对齐性和校准性得到了显著提升。 Conclusion: 研究结果表明,针对多模态任务设计特定策略对于提升评估模型的可靠性和稳定性具有重要意义,并为大规模文本到图像生成系统的自动化评估提供了新方向。 Abstract: Multimodal large language models (MLLMs) are increasingly used to evaluate text-to-image (TTI) generation systems, providing automated judgments based on visual and textual context. However, these "judge" models often suffer from biases, overconfidence, and inconsistent performance across diverse image domains. While prompt ensembling has shown promise for mitigating these issues in unimodal, text-only settings, our experiments reveal that standard ensembling methods fail to generalize effectively for TTI tasks. To address these limitations, we propose a new multimodal-aware method called Multimodal Mixture-of-Bayesian Prompt Ensembles (MMB). Our method uses a Bayesian prompt ensemble approach augmented by image clustering, allowing the judge to dynamically assign prompt weights based on the visual characteristics of each sample. We show that MMB improves accuracy in pairwise preference judgments and greatly enhances calibration, making it easier to gauge the judge's true uncertainty. In evaluations on two TTI benchmarks, HPSv2 and MJBench, MMB outperforms existing baselines in alignment with human annotations and calibration across varied image content. Our findings highlight the importance of multimodal-specific strategies for judge calibration and suggest a promising path forward for reliable large-scale TTI evaluation.

[72] Vision-Language Semantic Aggregation Leveraging Foundation Model for Generalizable Medical Image Segmentation

Wenjun Yu,Yinchen Zhou,Jia-Xuan Jiang,Shubin Zeng,Yuee Li,Zhong Wang

Main category: cs.CV

TL;DR: 本文提出了一种新的多模态语义聚合方法,用于解决医学图像分割中多模态融合的语义差距和特征分散问题。

Details Motivation: 多模态模型在自然图像分割中表现出色,但在医学领域表现不佳,主要原因是文本提示与细粒度医学视觉特征之间的语义差距和特征分散问题。 Method: 提出了一种期望最大化(EM)聚合机制和一种文本引导像素解码器,分别用于减少特征分散和弥合语义差距。 Result: 在公共心脏和眼底数据集上的实验表明,该方法在多个领域泛化基准上始终优于现有的SOTA方法。 Conclusion: 该方法通过语义聚合解决了医学图像分割中的多模态融合问题,提高了模型的泛化能力。 Abstract: Multimodal models have achieved remarkable success in natural image segmentation, yet they often underperform when applied to the medical domain. Through extensive study, we attribute this performance gap to the challenges of multimodal fusion, primarily the significant semantic gap between abstract textual prompts and fine-grained medical visual features, as well as the resulting feature dispersion. To address these issues, we revisit the problem from the perspective of semantic aggregation. Specifically, we propose an Expectation-Maximization (EM) Aggregation mechanism and a Text-Guided Pixel Decoder. The former mitigates feature dispersion by dynamically clustering features into compact semantic centers to enhance cross-modal correspondence. The latter is designed to bridge the semantic gap by leveraging domain-invariant textual knowledge to effectively guide deep visual representations. The synergy between these two mechanisms significantly improves the model's generalization ability. Extensive experiments on public cardiac and fundus datasets demonstrate that our method consistently outperforms existing SOTA approaches across multiple domain generalization benchmarks.

[73] Improving Greenland Bed Topography Mapping with Uncertainty-Aware Graph Learning on Sparse Radar Data

Bayu Adhi Tama,Homayra Alam,Mostafa Cham,Omar Faruque,Jianwu Wang,Vandana Janeja

Main category: cs.CV

TL;DR: 本研究提出了一种新的图学习框架GraphTopoNet,用于提高格陵兰岛冰川床地图的精度,从而改善海平面变化预测和气候政策制定的依据。

Details Motivation: 格陵兰岛冰下床的精确地图对于海平面预测至关重要,但雷达观测稀疏且不均匀。 Method: 提出了一种名为GraphTopoNet的图学习框架,通过蒙特卡洛丢弃显式建模不确定性,并利用置信度加权雷达监督与动态平衡正则化的混合损失来处理数据缺口。 Result: 应用于格陵兰岛三个子区域时,GraphTopoNet优于插值、卷积和基于图的基线方法,误差减少高达60%,同时保留了小尺度冰川特征。 Conclusion: GraphTopoNet展示了如何在大陆尺度上将稀疏和不确定的地球物理观测转化为可操作的知识,提高了冰川床图的可靠性,有助于气候预测和政策制定。 Abstract: Accurate maps of Greenland's subglacial bed are essential for sea-level projections, but radar observations are sparse and uneven. We introduce GraphTopoNet, a graph-learning framework that fuses heterogeneous supervision and explicitly models uncertainty via Monte Carlo dropout. Spatial graphs built from surface observables (elevation, velocity, mass balance) are augmented with gradient features and polynomial trends to capture both local variability and broad structure. To handle data gaps, we employ a hybrid loss that combines confidence-weighted radar supervision with dynamically balanced regularization. Applied to three Greenland subregions, GraphTopoNet outperforms interpolation, convolutional, and graph-based baselines, reducing error by up to 60 percent while preserving fine-scale glacial features. The resulting bed maps improve reliability for operational modeling, supporting agencies engaged in climate forecasting and policy. More broadly, GraphTopoNet shows how graph machine learning can convert sparse, uncertain geophysical observations into actionable knowledge at continental scale.

[74] Implicit Shape-Prior for Few-Shot Assisted 3D Segmentation

Mathilde Monvoisin,Louise Piecuch,Blanche Texier,Cédric Hémon,Anaïs Barateau,Jérémie Huet,Antoine Nordez,Anne-Sophie Boureau,Jean-Claude Nunes,Diana Mateus

Main category: cs.CV

TL;DR: 论文提出一种减少医学3D分割任务中手动工作量的方法,通过引入隐式形状先验和自动选择信息量最大的切片,在脑癌治疗和肌肉萎缩症诊断中取得了良好效果。

Details Motivation: 复杂3D分割任务目前无法完全自动化,需要大量医疗专业人员的手动工作,例如放射治疗计划和肌肉萎缩症诊断。论文旨在显著减少这种手动工作负担。 Method: 引入隐式形状先验以从稀疏切片手动注释中分割体积,并设计了一个框架来自动选择最具信息量的切片,以指导和减少下一次交互。 Result: 实验验证表明,该方法在两种医学使用案例中有效:脑癌患者风险器官的辅助分割,以及加速创建新的肌肉萎缩症患者肌肉形状数据库。 Conclusion: 该论文提出了一种引入隐式形状先验的方法,以减少医疗专业人员在复杂3D分割任务中的手动工作量,并展示了其在两种医学使用案例中的有效性。 Abstract: The objective of this paper is to significantly reduce the manual workload required from medical professionals in complex 3D segmentation tasks that cannot be yet fully automated. For instance, in radiotherapy planning, organs at risk must be accurately identified in computed tomography (CT) or magnetic resonance imaging (MRI) scans to ensure they are spared from harmful radiation. Similarly, diagnosing age-related degenerative diseases such as sarcopenia, which involve progressive muscle volume loss and strength, is commonly based on muscular mass measurements often obtained from manual segmentation of medical volumes. To alleviate the manual-segmentation burden, this paper introduces an implicit shape prior to segment volumes from sparse slice manual annotations generalized to the multi-organ case, along with a simple framework for automatically selecting the most informative slices to guide and minimize the next interactions. The experimental validation shows the method's effectiveness on two medical use cases: assisted segmentation in the context of at risks organs for brain cancer patients, and acceleration of the creation of a new database with unseen muscle shapes for patients with sarcopenia.

[75] EfficientIML: Efficient High-Resolution Image Manipulation Localization

Jinhan Li,Haoyang He,Lei Xie,Jiangning Zhang

Main category: cs.CV

TL;DR: This paper introduces a new dataset and EfficientIML model to improve detection of advanced image forgeries with high accuracy and low computational cost, suitable for real-time use.

Details Motivation: Current forgery detectors are ineffective against new diffusion-based forgery methods due to lack of exposure and high computational complexity, especially with increasing image resolution. Method: The authors proposed a novel dataset (SIF) containing 1200+ diffusion-generated manipulations and introduced the EfficientIML model, which uses an EfficientRWKV backbone with hybrid state-space and attention mechanisms, along with a multi-scale supervision strategy. Result: EfficientIML outperforms ViT-based and other state-of-the-art lightweight models in localization performance, computational efficiency (FLOPs), and inference speed on both the proposed dataset and standard benchmarks. Conclusion: The proposed EfficientIML model effectively detects diffusion-generated image forgeries with high accuracy and efficiency, making it suitable for real-time forensic applications. Abstract: With imaging devices delivering ever-higher resolutions and the emerging diffusion-based forgery methods, current detectors trained only on traditional datasets (with splicing, copy-moving and object removal forgeries) lack exposure to this new manipulation type. To address this, we propose a novel high-resolution SIF dataset of 1200+ diffusion-generated manipulations with semantically extracted masks. However, this also imposes a challenge on existing methods, as they face significant computational resource constraints due to their prohibitive computational complexities. Therefore, we propose a novel EfficientIML model with a lightweight, three-stage EfficientRWKV backbone. EfficientRWKV's hybrid state-space and attention network captures global context and local details in parallel, while a multi-scale supervision strategy enforces consistency across hierarchical predictions. Extensive evaluations on our dataset and standard benchmarks demonstrate that our approach outperforms ViT-based and other SOTA lightweight baselines in localization performance, FLOPs and inference speed, underscoring its suitability for real-time forensic applications.

[76] CLAPS: A CLIP-Unified Auto-Prompt Segmentation for Multi-Modal Retinal Imaging

Zhihao Zhao,Yinzheng Zhao,Junjie Yang,Xiangtong Yao,Quanmin Liang,Shahrooz Faghihroohi,Kai Huang,Nassir Navab,M. Ali Nasseri

Main category: cs.CV

TL;DR: CLAPS 是一种新型的统一分割方法,旨在克服当前视网膜成像中医学图像分割方法的局限性,并实现了在多个任务和模态中的高性能分割。

Details Motivation: 当前的医学图像分割方法在视网膜成像中面临模态歧义、依赖手动提示和缺乏统一框架等挑战,需要一种统一且自动化的分割方法。 Method: 通过在大型多模态视网膜数据集上预训练基于 CLIP 的图像编码器,利用 GroundingDINO 自动生成空间边界框提示,并结合带有独特“模态签名”的文本提示来统一任务和解决歧义,最终指导 SAM 进行精确分割。 Result: 在 12 个多样化数据集上的实验表明,CLAPS 在多数指标上达到了与专业模型相当的性能,并超越了现有基准,展示了其广泛的泛化能力。 Conclusion: CLAPS 提供了一种完全自动化且统一的分割框架,有效解决了当前方法的局限性,并在视网膜成像的多任务和多模态分割中表现出色。 Abstract: Recent advancements in foundation models, such as the Segment Anything Model (SAM), have significantly impacted medical image segmentation, especially in retinal imaging, where precise segmentation is vital for diagnosis. Despite this progress, current methods face critical challenges: 1) modality ambiguity in textual disease descriptions, 2) a continued reliance on manual prompting for SAM-based workflows, and 3) a lack of a unified framework, with most methods being modality- and task-specific. To overcome these hurdles, we propose CLIP-unified Auto-Prompt Segmentation (\CLAPS), a novel method for unified segmentation across diverse tasks and modalities in retinal imaging. Our approach begins by pre-training a CLIP-based image encoder on a large, multi-modal retinal dataset to handle data scarcity and distribution imbalance. We then leverage GroundingDINO to automatically generate spatial bounding box prompts by detecting local lesions. To unify tasks and resolve ambiguity, we use text prompts enhanced with a unique "modality signature" for each imaging modality. Ultimately, these automated textual and spatial prompts guide SAM to execute precise segmentation, creating a fully automated and unified pipeline. Extensive experiments on 12 diverse datasets across 11 critical segmentation categories show that CLAPS achieves performance on par with specialized expert models while surpassing existing benchmarks across most metrics, demonstrating its broad generalizability as a foundation model.

[77] AdsQA: Towards Advertisement Video Understanding

Xinwei Long,Kai Tian,Peng Xu,Guoli Jia,Jingxuan Li,Sa Yang,Yihua Shao,Kaiyan Zhang,Che Jiang,Hao Xu,Yang Liu,Jiaheng Ma,Bowen Zhou

Main category: cs.CV

TL;DR: The paper proposes AdsQA, a new benchmark for evaluating LLMs using advertisement videos, and introduces ReAd-R, a RL model that achieves state-of-the-art results on the benchmark.

Details Motivation: The motivation is to explore the potential of ad videos as a test-bed for LLMs, leveraging their rich and dense information traits such as marketing logic and persuasive strategies. Method: The paper proposes AdsQA, a benchmark derived from ad videos with multiple tasks, and introduces ReAd-R, a RL model that uses reward-driven optimization for answer generation. Result: The paper benchmarks 14 top-tier LLMs on AdsQA, and the proposed ReAd-R model achieves state-of-the-art results, outperforming strong competitors by a clear margin. Conclusion: The paper concludes that advertisement videos can serve as a challenging test-bed to evaluate and enhance the perception ability of LLMs beyond the objective physical content of common visual domains. Abstract: Large language models (LLMs) have taken a great step towards AGI. Meanwhile, an increasing number of domain-specific problems such as math and programming boost these general-purpose models to continuously evolve via learning deeper expertise. Now is thus the time further to extend the diversity of specialized applications for knowledgeable LLMs, though collecting high quality data with unexpected and informative tasks is challenging. In this paper, we propose to use advertisement (ad) videos as a challenging test-bed to probe the ability of LLMs in perceiving beyond the objective physical content of common visual domain. Our motivation is to take full advantage of the clue-rich and information-dense ad videos' traits, e.g., marketing logic, persuasive strategies, and audience engagement. Our contribution is three-fold: (1) To our knowledge, this is the first attempt to use ad videos with well-designed tasks to evaluate LLMs. We contribute AdsQA, a challenging ad Video QA benchmark derived from 1,544 ad videos with 10,962 clips, totaling 22.7 hours, providing 5 challenging tasks. (2) We propose ReAd-R, a Deepseek-R1 styled RL model that reflects on questions, and generates answers via reward-driven optimization. (3) We benchmark 14 top-tier LLMs on AdsQA, and our \texttt{ReAd-R}~achieves the state-of-the-art outperforming strong competitors equipped with long-chain reasoning capabilities by a clear margin.

[78] UOPSL: Unpaired OCT Predilection Sites Learning for Fundus Image Diagnosis Augmentation

Zhihao Zhao,Yinzheng Zhao,Junjie Yang,Xiangtong Yao,Quanmin Liang,Daniel Zapp,Kai Huang,Nassir Navab,M. Ali Nasseri

Main category: cs.CV

TL;DR: This study introduces UOPSL, an unpaired multimodal framework that uses OCT-derived spatial priors to enhance fundus image-based disease recognition, outperforming current benchmarks.

Details Motivation: The limited availability of paired OCT data and the inherent modality imbalance hinder progress in multimodal medical image diagnosis. Fundus photography is cost-effective, but capturing fine-grained spatial information remains a challenge. Method: A novel unpaired multimodal framework (UOPSL) was developed, employing contrastive learning on unpaired OCT and fundus images to learn predilection sites in the OCT latent space, which are then used to assist fundus image classification. Result: The framework outperformed existing benchmarks across 9 diverse datasets and 28 critical categories in extensive experiments. Conclusion: The proposed UOPSL framework effectively enhances fundus image-based disease recognition by utilizing OCT-derived spatial priors without requiring paired data. Abstract: Significant advancements in AI-driven multimodal medical image diagnosis have led to substantial improvements in ophthalmic disease identification in recent years. However, acquiring paired multimodal ophthalmic images remains prohibitively expensive. While fundus photography is simple and cost-effective, the limited availability of OCT data and inherent modality imbalance hinder further progress. Conventional approaches that rely solely on fundus or textual features often fail to capture fine-grained spatial information, as each imaging modality provides distinct cues about lesion predilection sites. In this study, we propose a novel unpaired multimodal framework \UOPSL that utilizes extensive OCT-derived spatial priors to dynamically identify predilection sites, enhancing fundus image-based disease recognition. Our approach bridges unpaired fundus and OCTs via extended disease text descriptions. Initially, we employ contrastive learning on a large corpus of unpaired OCT and fundus images while simultaneously learning the predilection sites matrix in the OCT latent space. Through extensive optimization, this matrix captures lesion localization patterns within the OCT feature space. During the fine-tuning or inference phase of the downstream classification task based solely on fundus images, where paired OCT data is unavailable, we eliminate OCT input and utilize the predilection sites matrix to assist in fundus image classification learning. Extensive experiments conducted on 9 diverse datasets across 28 critical categories demonstrate that our framework outperforms existing benchmarks.

[79] LADB: Latent Aligned Diffusion Bridges for Semi-Supervised Domain Translation

Xuqin Wang,Tao Wu,Yanfeng Zhang,Lu Liu,Dong Wang,Mingwei Sun,Yongliang Wang,Niclas Zeller,Daniel Cremers

Main category: cs.CV

TL;DR: LADB是一种半监督框架,用于样本到样本的转换,通过部分配对的潜在表示来桥接领域差异,从而在数据稀缺领域实现高效的域转换。

Details Motivation: 扩散模型在生成高质量输出方面表现出色,但在数据稀缺领域面临挑战,需要大量重新训练或成本高昂的配对数据。 Method: LADB通过在共享潜在空间内对齐源域和目标域分布,将预训练的源域扩散模型与目标域的潜在对齐扩散模型(LADM)无缝集成。 Result: 实验结果表明,LADB在部分监督下的深度到图像转换中表现出色,并且可以扩展到多源翻译和多目标翻译任务。 Conclusion: LADB在数据注释成本高或不完整的现实场景中,提供了一种可扩展且多功能的域转换解决方案。 Abstract: Diffusion models excel at generating high-quality outputs but face challenges in data-scarce domains, where exhaustive retraining or costly paired data are often required. To address these limitations, we propose Latent Aligned Diffusion Bridges (LADB), a semi-supervised framework for sample-to-sample translation that effectively bridges domain gaps using partially paired data. By aligning source and target distributions within a shared latent space, LADB seamlessly integrates pretrained source-domain diffusion models with a target-domain Latent Aligned Diffusion Model (LADM), trained on partially paired latent representations. This approach enables deterministic domain mapping without the need for full supervision. Compared to unpaired methods, which often lack controllability, and fully paired approaches that require large, domain-specific datasets, LADB strikes a balance between fidelity and diversity by leveraging a mixture of paired and unpaired latent-target couplings. Our experimental results demonstrate superior performance in depth-to-image translation under partial supervision. Furthermore, we extend LADB to handle multi-source translation (from depth maps and segmentation masks) and multi-target translation in a class-conditioned style transfer task, showcasing its versatility in handling diverse and heterogeneous use cases. Ultimately, we present LADB as a scalable and versatile solution for real-world domain translation, particularly in scenarios where data annotation is costly or incomplete.

[80] Skeleton-based sign language recognition using a dual-stream spatio-temporal dynamic graph convolutional network

Liangjin Liu,Haoyang Zheng,Pei Zhou

Main category: cs.CV

TL;DR: 本文提出了一种新的孤立手语识别方法DSLNet,利用双参考框架和双流架构提升识别准确率,并在多个数据集上取得了最先进的结果。

Details Motivation: 孤立手语识别面临形态相似但语义不同的挑战,传统方法依赖单一参考框架难以解决几何歧义,因此需要一种更有效的方法来提升识别准确性。 Method: 提出了DSLNet,采用手腕中心和面部中心的双参考框架分别建模手势形态和轨迹,使用拓扑感知图卷积和芬斯勒几何编码器处理双流信息,并通过几何驱动的最优传输机制融合结果。 Result: DSLNet在WLASL-100、WLASL-300和LSA64数据集上分别实现了93.70%、89.97%和99.79%的准确率,显著优于现有方法。 Conclusion: DSLNet通过双参考、双流架构有效解决了手势形态和轨迹的几何歧义问题,在多个数据集上达到了最先进的识别准确率,并且模型参数更少。 Abstract: Isolated Sign Language Recognition (ISLR) is challenged by gestures that are morphologically similar yet semantically distinct, a problem rooted in the complex interplay between hand shape and motion trajectory. Existing methods, often relying on a single reference frame, struggle to resolve this geometric ambiguity. This paper introduces Dual-SignLanguageNet (DSLNet), a dual-reference, dual-stream architecture that decouples and models gesture morphology and trajectory in separate, complementary coordinate systems. Our approach utilizes a wrist-centric frame for view-invariant shape analysis and a facial-centric frame for context-aware trajectory modeling. These streams are processed by specialized networks-a topology-aware graph convolution for shape and a Finsler geometry-based encoder for trajectory-and are integrated via a geometry-driven optimal transport fusion mechanism. DSLNet sets a new state-of-the-art, achieving 93.70%, 89.97% and 99.79% accuracy on the challenging WLASL-100, WLASL-300 and LSA64 datasets, respectively, with significantly fewer parameters than competing models.

[81] FractalPINN-Flow: A Fractal-Inspired Network for Unsupervised Optical Flow Estimation with Total Variation Regularization

Sara Behnamian,Rasoul Khaksarinezhad,Andreas Langer

Main category: cs.CV

TL;DR: FractalPINN-Flow is an unsupervised deep learning framework for dense optical flow estimation that uses a Fractal Deformation Network and total variation regularization to produce accurate and smooth optical flow fields without ground truth.

Details Motivation: The motivation is to estimate dense optical flow directly from consecutive grayscale frames without requiring ground truth, and to capture both fine-grained details and long-range motion patterns. Method: FractalPINN-Flow uses an unsupervised deep learning framework with a Fractal Deformation Network (FDN) that is a recursive encoder-decoder inspired by fractal geometry and self-similarity. It uses repeated encoder-decoder nesting with skip connections to capture both fine-grained details and long-range motion patterns. The training objective is based on a classical variational formulation using total variation (TV) regularization. Result: Experiments on synthetic and benchmark datasets show that FractalPINN-Flow produces accurate, smooth, and edge-preserving optical flow fields. Conclusion: FractalPINN-Flow is especially effective for high-resolution data and scenarios with limited annotations. Abstract: We present FractalPINN-Flow, an unsupervised deep learning framework for dense optical flow estimation that learns directly from consecutive grayscale frames without requiring ground truth. The architecture centers on the Fractal Deformation Network (FDN) - a recursive encoder-decoder inspired by fractal geometry and self-similarity. Unlike traditional CNNs with sequential downsampling, FDN uses repeated encoder-decoder nesting with skip connections to capture both fine-grained details and long-range motion patterns. The training objective is based on a classical variational formulation using total variation (TV) regularization. Specifically, we minimize an energy functional that combines $L^1$ and $L^2$ data fidelity terms to enforce brightness constancy, along with a TV term that promotes spatial smoothness and coherent flow fields. Experiments on synthetic and benchmark datasets show that FractalPINN-Flow produces accurate, smooth, and edge-preserving optical flow fields. The model is especially effective for high-resolution data and scenarios with limited annotations.

[82] Multi-Modal Robust Enhancement for Coastal Water Segmentation: A Systematic HSV-Guided Framework

Zhen Tian,Christos Anagnostopoulos,Qiyuan Wang,Zhiwei Gao

Main category: cs.CV

TL;DR: This paper proposes a Robust U-Net framework for coastal water segmentation using HSV color space and multi-modal constraints, significantly improving segmentation quality and training stability.

Details Motivation: The motivation stems from the challenges faced by traditional RGB-based approaches in coastal water segmentation, such as training instability and poor generalization in diverse maritime environments. Method: The method involves a Robust U-Net framework that integrates five components: HSV-guided color supervision, gradient-based coastline optimization, morphological post-processing, sea area cleanup, and connectivity control. Result: The results show that the Robust U-Net achieves an 84% variance reduction in training stability and demonstrates consistent improvements across multiple evaluation metrics while maintaining computational efficiency. Conclusion: The paper concludes that the proposed Robust U-Net framework significantly improves coastal water segmentation from satellite imagery by leveraging HSV color space supervision and multi-modal constraints, achieving superior training stability and segmentation quality. Abstract: Coastal water segmentation from satellite imagery presents unique challenges due to complex spectral characteristics and irregular boundary patterns. Traditional RGB-based approaches often suffer from training instability and poor generalization in diverse maritime environments. This paper introduces a systematic robust enhancement framework, referred to as Robust U-Net, that leverages HSV color space supervision and multi-modal constraints for improved coastal water segmentation. Our approach integrates five synergistic components: HSV-guided color supervision, gradient-based coastline optimization, morphological post-processing, sea area cleanup, and connectivity control. Through comprehensive ablation studies, we demonstrate that HSV supervision provides the highest impact (0.85 influence score), while the complete framework achieves superior training stability (84\% variance reduction) and enhanced segmentation quality. Our method shows consistent improvements across multiple evaluation metrics while maintaining computational efficiency. For reproducibility, our training configurations and code are available here: https://github.com/UofgCoastline/ICASSP-2026-Robust-Unet.

[83] Computational Imaging for Enhanced Computer Vision

Humera Shaikh,Kaur Jashanpreet

Main category: cs.CV

TL;DR: This survey explores computational imaging techniques and their transformative impact on computer vision applications, highlighting the potential for task-specific, adaptive imaging pipelines to improve robustness, accuracy, and efficiency in real-world scenarios.

Details Motivation: Conventional imaging methods often fail to deliver high-fidelity visual data in challenging conditions, such as low light, motion blur, or high dynamic range scenes, thereby limiting the performance of state-of-the-art computer vision systems. Method: This paper presents a comprehensive survey of computational imaging techniques and their impact on computer vision applications. It systematically explores the synergies between CI techniques and core CV tasks. Result: Computational imaging techniques, including light field imaging, high dynamic range imaging, deblurring, high-speed imaging, and glare mitigation, address these limitations by enhancing image acquisition and reconstruction processes. The survey explores the relationships between CI methods and their practical contributions to computer vision applications. Conclusion: This survey highlights the importance of computational imaging techniques in overcoming the limitations of conventional imaging methods and improving the performance of computer vision systems. It emphasizes the potential for task-specific, adaptive imaging pipelines to enhance robustness, accuracy, and efficiency in real-world applications. Abstract: This paper presents a comprehensive survey of computational imaging (CI) techniques and their transformative impact on computer vision (CV) applications. Conventional imaging methods often fail to deliver high-fidelity visual data in challenging conditions, such as low light, motion blur, or high dynamic range scenes, thereby limiting the performance of state-of-the-art CV systems. Computational imaging techniques, including light field imaging, high dynamic range (HDR) imaging, deblurring, high-speed imaging, and glare mitigation, address these limitations by enhancing image acquisition and reconstruction processes. This survey systematically explores the synergies between CI techniques and core CV tasks, including object detection, depth estimation, optical flow, face recognition, and keypoint detection. By analyzing the relationships between CI methods and their practical contributions to CV applications, this work highlights emerging opportunities, challenges, and future research directions. We emphasize the potential for task-specific, adaptive imaging pipelines that improve robustness, accuracy, and efficiency in real-world scenarios, such as autonomous navigation, surveillance, augmented reality, and robotics.

[84] BcQLM: Efficient Vision-Language Understanding with Distilled Q-Gated Cross-Modal Fusion

Sike Xiang,Shuang Chen,Amir Atapour-Abarghouei

Main category: cs.CV

TL;DR: This paper proposes a lightweight MLLM framework named BcQLM, which balances accuracy and efficiency for deployment in resource-constrained environments.

Details Motivation: As multimodal large language models (MLLMs) advance, their large-scale architectures pose challenges for deployment in resource-constrained environments. In the age of large models, where energy efficiency, computational scalability and environmental sustainability are paramount, the development of lightweight and high-performance models is critical for real-world applications. Method: We proposed BcQLM, a lightweight vision-language framework based on BreezeCLIP, a compact yet powerful vision-language encoder optimised for efficient multimodal understanding. Result: With only 1.2 billion parameters overall, our model significantly reduces computational cost while achieving performance comparable to standard-size MLLMs. Experiments conducted on multiple datasets further validate its effectiveness in balancing accuracy and efficiency. Conclusion: BcQLM offers a promising path toward deployable MLLMs under practical hardware constraints. Abstract: As multimodal large language models (MLLMs) advance, their large-scale architectures pose challenges for deployment in resource-constrained environments. In the age of large models, where energy efficiency, computational scalability and environmental sustainability are paramount, the development of lightweight and high-performance models is critical for real-world applications. As such, we propose a lightweight MLLM framework for end-to-end visual question answering. Our proposed approach centres on BreezeCLIP, a compact yet powerful vision-language encoder optimised for efficient multimodal understanding. With only 1.2 billion parameters overall, our model significantly reduces computational cost while achieving performance comparable to standard-size MLLMs. Experiments conducted on multiple datasets further validate its effectiveness in balancing accuracy and efficiency. The modular and extensible design enables generalisation to broader multimodal tasks. The proposed lightweight vision-language framework is denoted as BcQLM (BreezeCLIP-enhanced Q-Gated Multimodal Language Model). It offers a promising path toward deployable MLLMs under practical hardware constraints. The source code is available at https://github.com/thico0224/BcQLM.

[85] CrowdQuery: Density-Guided Query Module for Enhanced 2D and 3D Detection in Crowded Scenes

Marius Dähling,Sebastian Krebs,J. Marius Zöllner

Main category: cs.CV

TL;DR: This paper proposes CrowdQuery (CQ), a novel method that integrates object density information into transformer-based detectors to improve crowd detection in both 2D and 3D environments.

Details Motivation: Existing crowd detection methods often rely on head positions or spatial statistics for density maps, which may not capture the complexity of crowded scenes. The authors aim to bridge the gap between 2D and 3D detection while improving performance in crowded environments by leveraging density-guided object queries. Method: The authors propose a novel method called CrowdQuery (CQ) that introduces a module to predict and embed an object density map, incorporating it into the decoder of transformer-based detectors. This method extends existing density map definitions to include bounding box dimensions and is applied to both 2D and 3D detection frameworks (CQ2D and CQ3D). Result: Experiments on the STCrowd dataset show that the proposed method significantly improves performance in both 2D and 3D detection compared to base models and outperforms most state-of-the-art methods. Further tests on the CrowdHuman dataset demonstrate the generalizability of the approach. Conclusion: The paper concludes that integrating object density information into transformer-based detectors using the CQ module significantly enhances crowd detection performance in both 2D and 3D settings. Additionally, the method demonstrates generalizability across datasets and applicability to various transformer models. Abstract: This paper introduces a novel method for end-to-end crowd detection that leverages object density information to enhance existing transformer-based detectors. We present CrowdQuery (CQ), whose core component is our CQ module that predicts and subsequently embeds an object density map. The embedded density information is then systematically integrated into the decoder. Existing density map definitions typically depend on head positions or object-based spatial statistics. Our method extends these definitions to include individual bounding box dimensions. By incorporating density information into object queries, our method utilizes density-guided queries to improve detection in crowded scenes. CQ is universally applicable to both 2D and 3D detection without requiring additional data. Consequently, we are the first to design a method that effectively bridges 2D and 3D detection in crowded environments. We demonstrate the integration of CQ into both a general 2D and 3D transformer-based object detector, introducing the architectures CQ2D and CQ3D. CQ is not limited to the specific transformer models we selected. Experiments on the STCrowd dataset for both 2D and 3D domains show significant performance improvements compared to the base models, outperforming most state-of-the-art methods. When integrated into a state-of-the-art crowd detector, CQ can further improve performance on the challenging CrowdHuman dataset, demonstrating its generalizability. The code is released at https://github.com/mdaehl/CrowdQuery.

[86] ArgoTweak: Towards Self-Updating HD Maps through Structured Priors

Lena Wild,Rafael Valencia,Patric Jensfelt

Main category: cs.CV

TL;DR: ArgoTweak 是一个包含真实地图先验的新数据集,通过细粒度变化标注和可解释性框架,有效缩小了模拟与现实之间的差距。

Details Motivation: 现有方法依赖于合成先验,导致模拟与现实之间存在显著差距,因此需要一个包含真实先验地图、当前地图和传感器数据的三元组数据集。 Method: ArgoTweak 采用了一种双射映射框架,将大规模修改分解为地图元素级别的细粒度原子变化,从而确保可解释性。 Result: 实验表明,使用 ArgoTweak 训练模型相较于合成先验显著缩小了模拟到现实的差距,消融实验进一步突出了结构化先验和详细变化标注的影响。 Conclusion: ArgoTweak 提出了一种新的数据集,填补了现有数据集中缺乏真实地图先验信息的空白,通过引入具有细粒度变化标注的结构化先验,有效缩小了模拟到现实的差距,推动了可解释、基于先验的高精地图构建技术的发展。 Abstract: Reliable integration of prior information is crucial for self-verifying and self-updating HD maps. However, no public dataset includes the required triplet of prior maps, current maps, and sensor data. As a result, existing methods must rely on synthetic priors, which create inconsistencies and lead to a significant sim2real gap. To address this, we introduce ArgoTweak, the first dataset to complete the triplet with realistic map priors. At its core, ArgoTweak employs a bijective mapping framework, breaking down large-scale modifications into fine-grained atomic changes at the map element level, thus ensuring interpretability. This paradigm shift enables accurate change detection and integration while preserving unchanged elements with high fidelity. Experiments show that training models on ArgoTweak significantly reduces the sim2real gap compared to synthetic priors. Extensive ablations further highlight the impact of structured priors and detailed change annotations. By establishing a benchmark for explainable, prior-aided HD mapping, ArgoTweak advances scalable, self-improving mapping solutions. The dataset, baselines, map modification toolbox, and further resources are available at https://kth-rpl.github.io/ArgoTweak/.

[87] An End-to-End Deep Learning Framework for Arsenicosis Diagnosis Using Mobile-Captured Skin Images

Asif Newaz,Asif Ur Rahman Adib,Rajit Sahil,Mashfique Mehzad

Main category: cs.CV

TL;DR: This paper presents a deep learning framework for diagnosing arsenicosis using mobile phone-captured skin images, showing that Transformer-based models can effectively support early detection in resource-limited settings.

Details Motivation: Arsenicosis is a significant public health issue in South and Southeast Asia, primarily due to long-term consumption of arsenic-contaminated water. Early diagnosis is crucial, but underdiagnosis is common in rural areas with limited access to dermatologists. Method: An end-to-end framework for arsenicosis diagnosis using mobile phone-captured skin images was proposed. Multiple deep learning architectures, including CNNs and Transformer-based models, were benchmarked for detection. Result: Transformer-based models significantly outperformed CNNs, with the Swin Transformer achieving the best results (86% accuracy). Conclusion: The proposed framework demonstrates the potential of deep learning for non-invasive, accessible, and explainable diagnosis of arsenicosis from mobile-acquired images. Abstract: Background: Arsenicosis is a serious public health concern in South and Southeast Asia, primarily caused by long-term consumption of arsenic-contaminated water. Its early cutaneous manifestations are clinically significant but often underdiagnosed, particularly in rural areas with limited access to dermatologists. Automated, image-based diagnostic solutions can support early detection and timely interventions. Methods: In this study, we propose an end-to-end framework for arsenicosis diagnosis using mobile phone-captured skin images. A dataset comprising 20 classes and over 11000 images of arsenic-induced and other dermatological conditions was curated. Multiple deep learning architectures, including convolutional neural networks (CNNs) and Transformer-based models, were benchmarked for arsenicosis detection. Model interpretability was integrated via LIME and Grad-CAM, while deployment feasibility was demonstrated through a web-based diagnostic tool. Results: Transformer-based models significantly outperformed CNNs, with the Swin Transformer achieving the best results (86\\% accuracy). LIME and Grad-CAM visualizations confirmed that the models attended to lesion-relevant regions, increasing clinical transparency and aiding in error analysis. The framework also demonstrated strong performance on external validation samples, confirming its ability to generalize beyond the curated dataset. Conclusion: The proposed framework demonstrates the potential of deep learning for non-invasive, accessible, and explainable diagnosis of arsenicosis from mobile-acquired images. By enabling reliable image-based screening, it can serve as a practical diagnostic aid in rural and resource-limited communities, where access to dermatologists is scarce, thereby supporting early detection and timely intervention.

[88] Quantifying Accuracy of an Event-Based Star Tracker via Earth's Rotation

Dennis Melamed,Connor Hashemi,Scott McCloskey

Main category: cs.CV

TL;DR: 事件相机在星体跟踪姿态确定中表现出良好的精度和多项优势,表明其在低成本、低延迟应用中的实用性。

Details Motivation: 事件相机是一种新兴技术,其在基于星体跟踪的姿态确定中的潜力尚未被充分验证,尤其是在真实数据的准确参考确定方面。 Method: 利用地球自转作为真实参考,通过事件相机系统生成的事件流来估计方向,并与IERS测量的方向进行比较。 Result: 事件相机系统实现了18.47角秒的均方根误差和78.84角秒的绝对误差,同时具备计算量小、动态范围高、能耗低和更新速率快的优势。 Conclusion: 事件相机在低成本和低延迟的星体跟踪中具有实用价值,并且在精度上达到了可接受的水平。 Abstract: Event-based cameras (EBCs) are a promising new technology for star tracking-based attitude determination, but prior studies have struggled to determine accurate ground truth for real data. We analyze the accuracy of an EBC star tracking system utilizing the Earth's motion as the ground truth for comparison. The Earth rotates in a regular way with very small irregularities which are measured to the level of milli-arcseconds. By keeping an event camera static and pointing it through a ground-based telescope at the night sky, we create a system where the only camera motion in the celestial reference frame is that induced by the Earth's rotation. The resulting event stream is processed to generate estimates of orientation which we compare to the International Earth Rotation and Reference System (IERS) measured orientation of the Earth. The event camera system is able to achieve a root mean squared across error of 18.47 arcseconds and an about error of 78.84 arcseconds. Combined with the other benefits of event cameras over framing sensors (reduced computation due to sparser data streams, higher dynamic range, lower energy consumption, faster update rates), this level of accuracy suggests the utility of event cameras for low-cost and low-latency star tracking. We provide all code and data used to generate our results: https://gitlab.kitware.com/nest-public/telescope_accuracy_quantification.

[89] Handling Multiple Hypotheses in Coarse-to-Fine Dense Image Matching

Matthieu Vilain,Rémi Giraud,Yannick Berthoumieu,Guillaume Bourmaud

Main category: cs.CV

TL;DR: This paper introduces BEAMER, a novel dense matching architecture that improves robustness by predicting and propagating multiple correspondent hypotheses across scales, particularly useful in challenging cases like depth discontinuities or strong zoom-ins.

Details Motivation: In challenging cases, such as depth discontinuities or strong zoom-ins, current methods that produce a single correspondent hypothesis per source location often lead to erroneous matches. Method: Investigated the idea of predicting multiple correspondent hypotheses per source location at each scale using a beam search strategy and integrated these hypotheses into cross-attention layers, creating the BEAMER architecture. Result: The BEAMER architecture learns to preserve and propagate multiple hypotheses across scales. Conclusion: BEAMER is significantly more robust than state-of-the-art methods, especially at depth discontinuities or when the target image is a strong zoom-in of the source image. Abstract: Dense image matching aims to find a correspondent for every pixel of a source image in a partially overlapping target image. State-of-the-art methods typically rely on a coarse-to-fine mechanism where a single correspondent hypothesis is produced per source location at each scale. In challenging cases -- such as at depth discontinuities or when the target image is a strong zoom-in of the source image -- the correspondents of neighboring source locations are often widely spread and predicting a single correspondent hypothesis per source location at each scale may lead to erroneous matches. In this paper, we investigate the idea of predicting multiple correspondent hypotheses per source location at each scale instead. We consider a beam search strategy to propagat multiple hypotheses at each scale and propose integrating these multiple hypotheses into cross-attention layers, resulting in a novel dense matching architecture called BEAMER. BEAMER learns to preserve and propagate multiple hypotheses across scales, making it significantly more robust than state-of-the-art methods, especially at depth discontinuities or when the target image is a strong zoom-in of the source image.

[90] GeneVA: A Dataset of Human Annotations for Generative Text to Video Artifacts

Jenna Kang,Maria Silva,Patsorn Sangkloy,Kenneth Chen,Niall Williams,Qi Sun

Main category: cs.CV

TL;DR: This paper introduces GeneVA, a benchmark dataset designed to address spatio-temporal artifacts in text-driven video generation, enabling performance evaluation and quality improvements.

Details Motivation: Addressing unpredictable artifacts in text-driven video generation, such as impossible physics and temporal inconsistencies, requires systematic benchmarks which current datasets lack. Method: The paper introduces GeneVA, a large-scale dataset with rich human annotations focused on spatio-temporal artifacts in text-generated videos. Result: GeneVA aims to bridge the gap in evaluating text-driven video generation by focusing on spatio-temporal complexities and enabling critical applications like performance benchmarking and quality improvement. Conclusion: GeneVA is expected to enhance the quality of generative video models by providing a systematic benchmark for evaluating and improving model performance. Abstract: Recent advances in probabilistic generative models have extended capabilities from static image synthesis to text-driven video generation. However, the inherent randomness of their generation process can lead to unpredictable artifacts, such as impossible physics and temporal inconsistency. Progress in addressing these challenges requires systematic benchmarks, yet existing datasets primarily focus on generative images due to the unique spatio-temporal complexities of videos. To bridge this gap, we introduce GeneVA, a large-scale artifact dataset with rich human annotations that focuses on spatio-temporal artifacts in videos generated from natural text prompts. We hope GeneVA can enable and assist critical applications, such as benchmarking model performance and improving generative video quality.

[91] RewardDance: Reward Scaling in Visual Generation

Jie Wu,Yu Gao,Zilyu Ye,Ming Li,Liang Li,Hanzhong Guo,Jie Liu,Zeyue Xue,Xiaoxia Hou,Wei Liu,Yan Zeng,Weilin Huang

Main category: cs.CV

TL;DR: RewardDance是一种新的可扩展奖励建模框架,通过解决Reward Hacking问题,显著提升了视觉生成模型的性能和输出质量。

Details Motivation: 现有的CLIP-based奖励模型和Bradley-Terry损失存在架构和机制上的限制,且RLHF优化过程存在Reward Hacking问题,需要一种新的可扩展奖励建模框架。 Method: 通过将奖励分数重新定义为模型预测“yes”标记的概率,提出了一种新的生成奖励范式,并支持模型和上下文的双重扩展。 Result: RewardDance在文本到图像、文本到视频和图像到视频生成任务上显著优于现有方法,并成功缓解了小模型中的模式崩溃问题。 Conclusion: RewardDance有效解决了Reward Hacking问题,显著提升了生成模型的质量和多样性。 Abstract: Reward Models (RMs) are critical for improving generation models via Reinforcement Learning (RL), yet the RM scaling paradigm in visual generation remains largely unexplored. It primarily due to fundamental limitations in existing approaches: CLIP-based RMs suffer from architectural and input modality constraints, while prevalent Bradley-Terry losses are fundamentally misaligned with the next-token prediction mechanism of Vision-Language Models (VLMs), hindering effective scaling. More critically, the RLHF optimization process is plagued by Reward Hacking issue, where models exploit flaws in the reward signal without improving true quality. To address these challenges, we introduce RewardDance, a scalable reward modeling framework that overcomes these barriers through a novel generative reward paradigm. By reformulating the reward score as the model's probability of predicting a "yes" token, indicating that the generated image outperforms a reference image according to specific criteria, RewardDance intrinsically aligns reward objectives with VLM architectures. This alignment unlocks scaling across two dimensions: (1) Model Scaling: Systematic scaling of RMs up to 26 billion parameters; (2) Context Scaling: Integration of task-specific instructions, reference examples, and chain-of-thought (CoT) reasoning. Extensive experiments demonstrate that RewardDance significantly surpasses state-of-the-art methods in text-to-image, text-to-video, and image-to-video generation. Crucially, we resolve the persistent challenge of "reward hacking": Our large-scale RMs exhibit and maintain high reward variance during RL fine-tuning, proving their resistance to hacking and ability to produce diverse, high-quality outputs. It greatly relieves the mode collapse problem that plagues smaller models.

[92] SAFT: Shape and Appearance of Fabrics from Template via Differentiable Physical Simulations from Monocular Video

David Stotko,Reinhard Klein

Main category: cs.CV

TL;DR: 该研究提出了一种新的三维织物重建与渲染方法,通过物理模拟和可微分渲染,显著提高了单目视频输入下的重建精度和细节恢复能力。

Details Motivation: 为了解决三维动态场景重建中单目视频深度模糊的问题,同时提高重建精度和渲染质量。 Method: 采用布料几何的物理模拟和可微分渲染,引入了两种新的正则化项来解决单目视频中的深度模糊问题。 Result: 与最新方法相比,3D重建误差减少了2.64倍,每场景运行时间为30分钟,并成功实现了对变形物体外观的高质量估计。 Conclusion: 本文提出了一种新颖的方法,结合了3D几何重建和基于物理渲染的外观估计,实现了从单个单目RGB视频序列中对织物进行高质量的三维重建和渲染。 Abstract: The reconstruction of three-dimensional dynamic scenes is a well-established yet challenging task within the domain of computer vision. In this paper, we propose a novel approach that combines the domains of 3D geometry reconstruction and appearance estimation for physically based rendering and present a system that is able to perform both tasks for fabrics, utilizing only a single monocular RGB video sequence as input. In order to obtain realistic and high-quality deformations and renderings, a physical simulation of the cloth geometry and differentiable rendering are employed. In this paper, we introduce two novel regularization terms for the 3D reconstruction task that improve the plausibility of the reconstruction by addressing the depth ambiguity problem in monocular video. In comparison with the most recent methods in the field, we have reduced the error in the 3D reconstruction by a factor of 2.64 while requiring a medium runtime of 30 min per scene. Furthermore, the optimized motion achieves sufficient quality to perform an appearance estimation of the deforming object, recovering sharp details from this single monocular RGB video.