Table of Contents
cs.CL [Back]
[1] Structured Information Matters: Explainable ICD Coding with Patient-Level Knowledge Graphs
Mingyang Li,Viktor Schlegel,Tingting Mu,Warren Del-Pinto,Goran Nenadic
Main category: cs.CL
TL;DR: 该研究利用文档级别的知识图谱提高自动化ICD编码的效果,从而提高临床数据处理的效率和解释性。
Details
Motivation: 手动编码临床文档既困难又耗时,自动化编码可以提高结构化临床数据的可用性和准确性。 Method: 通过构建文档级别的知识图谱,将其整合到最先进的ICD编码架构PLM-ICD中进行评估。 Result: 实验结果显示Macro-F1分数提高了最多3.20%,同时提高了训练效率。 Conclusion: 使用文档级别的知识图谱可以有效地表示患者中心的输入文档,并能提高ICD编码的自动化效果。 Abstract: Mapping clinical documents to standardised clinical vocabularies is an important task, as it provides structured data for information retrieval and analysis, which is essential to clinical research, hospital administration and improving patient care. However, manual coding is both difficult and time-consuming, making it impractical at scale. Automated coding can potentially alleviate this burden, improving the availability and accuracy of structured clinical data. The task is difficult to automate, as it requires mapping to high-dimensional and long-tailed target spaces, such as the International Classification of Diseases (ICD). While external knowledge sources have been readily utilised to enhance output code representation, the use of external resources for representing the input documents has been underexplored. In this work, we compute a structured representation of the input documents, making use of document-level knowledge graphs (KGs) that provide a comprehensive structured view of a patient's condition. The resulting knowledge graph efficiently represents the patient-centred input documents with 23\% of the original text while retaining 90\% of the information. We assess the effectiveness of this graph for automated ICD-9 coding by integrating it into the state-of-the-art ICD coding architecture PLM-ICD. Our experiments yield improved Macro-F1 scores by up to 3.20\% on popular benchmarks, while improving training efficiency. We attribute this improvement to different types of entities and relationships in the KG, and demonstrate the improved explainability potential of the approach over the text-only baseline.[2] Cross-Layer Attention Probing for Fine-Grained Hallucination Detection
Malavika Suresh,Rahaf Aljundi,Ikechukwu Nkisi-Orji,Nirmalie Wiratunga
Main category: cs.CL
TL;DR: 本文提出了一种名为Cross-Layer Attention Probing (CLAP)的新技术,用于检测大型语言模型中的幻觉,提高了模型的可靠性。
Details
Motivation: 由于大型语言模型在各种应用中的大规模采用,由于它们生成不准确文本(即幻觉)的趋势,因此对可靠性存在日益增长的关注。 Method: Cross-Layer Attention Probing (CLAP) 技术处理整个残差流中的LLM激活作为联合序列。 Result: CLAP在使用五种LLM和三种任务进行的实证评估显示,与基线相比,CLAP在贪心解码的响应以及较高温度下采样的响应上都提高了幻觉检测,从而实现了细粒度检测。 Conclusion: CLAP是一种新的用于检测幻觉的激活探测技术,能够提高大型语言模型的可靠性。 Abstract: With the large-scale adoption of Large Language Models (LLMs) in various applications, there is a growing reliability concern due to their tendency to generate inaccurate text, i.e. hallucinations. In this work, we propose Cross-Layer Attention Probing (CLAP), a novel activation probing technique for hallucination detection, which processes the LLM activations across the entire residual stream as a joint sequence. Our empirical evaluations using five LLMs and three tasks show that CLAP improves hallucination detection compared to baselines on both greedy decoded responses as well as responses sampled at higher temperatures, thus enabling fine-grained detection, i.e. the ability to disambiguate hallucinations and non-hallucinations among different sampled responses to a given prompt. This allows us to propose a detect-then-mitigate strategy using CLAP to reduce hallucinations and improve LLM reliability compared to direct mitigation approaches. Finally, we show that CLAP maintains high reliability even when applied out-of-distribution.[3] Optimal Multi-Task Learning at Regularization Horizon for Speech Translation Task
JungHo Jung,Junhyun Lee
Main category: cs.CL
TL;DR: The paper explores regularization techniques in Multi-Task Learning for end-to-end speech-to-text translation, introducing the concept of a regularization horizon for optimal performance.
Details
Motivation: The motivation stems from the scarcity of paired speech-text data in end-to-end speech-to-text translation, which leads researchers to explore Multi-Task Learning using bitext data from Machine Translation to overcome this limitation. Method: The paper formulates Multi-Task Learning from a regularization perspective and explores the regularization of sequences within and across modalities, using consistency regularization and R-drop as methods, and examining the coefficient of MT loss as another regularization source. Result: The result of the research is the introduction of the optimal regularization contour in high-dimensional space, termed the regularization horizon, which allows for achieving near state-of-the-art performance when hyperparameters are tuned within this contour. Conclusion: The study concludes that by using three sources of regularization, near state-of-the-art performance can be achieved on the MuST-C dataset in the context of end-to-end speech-to-text translation. Abstract: End-to-end speech-to-text translation typically suffers from the scarcity of paired speech-text data. One way to overcome this shortcoming is to utilize the bitext data from the Machine Translation (MT) task and perform Multi-Task Learning (MTL). In this paper, we formulate MTL from a regularization perspective and explore how sequences can be regularized within and across modalities. By thoroughly investigating the effect of consistency regularization (different modality) and R-drop (same modality), we show how they respectively contribute to the total regularization. We also demonstrate that the coefficient of MT loss serves as another source of regularization in the MTL setting. With these three sources of regularization, we introduce the optimal regularization contour in the high-dimensional space, called the regularization horizon. Experiments show that tuning the hyperparameters within the regularization horizon achieves near state-of-the-art performance on the MuST-C dataset.[4] Creativity Benchmark: A benchmark for marketing creativity for LLM models
Ninad Bhat,Kieran Browne,Pip Bingemann
Main category: cs.CL
TL;DR: The Creativity Benchmark evaluates large language models (LLMs) in marketing creativity and finds no dominant model, emphasizing the necessity of human evaluation and diversity-aware approaches.
Details
Motivation: The motivation behind this study is to evaluate the performance of large language models (LLMs) in the context of marketing creativity and to determine whether automated evaluations can replace human judgment. Method: The researchers introduced the Creativity Benchmark, which includes 100 brands across 12 categories and three prompt types. They collected human pairwise preferences from 678 practicing creatives through 11,012 anonymized comparisons, analyzed using Bradley-Terry models. Model diversity was assessed using cosine distances, and the effectiveness of LLM-as-judge setups was compared with human rankings. Result: Results show tightly clustered performance among LLMs, with a top-bottom spread of Δθ ≈ 0.45 and a head-to-head win probability of 0.61. The highest-rated model beats the lowest only about 61% of the time. Model diversity analysis captures intra- and inter-model variation, and comparisons with human rankings reveal weak, inconsistent correlations and judge-specific biases. Conclusion: The study concludes that no single large language model (LLM) dominates in marketing creativity across different brands and prompt types, highlighting the importance of expert human evaluation and diversity-aware workflows. Abstract: We introduce Creativity Benchmark, an evaluation framework for large language models (LLMs) in marketing creativity. The benchmark covers 100 brands (12 categories) and three prompt types (Insights, Ideas, Wild Ideas). Human pairwise preferences from 678 practising creatives over 11,012 anonymised comparisons, analysed with Bradley-Terry models, show tightly clustered performance with no model dominating across brands or prompt types: the top-bottom spread is $\Delta\theta \approx 0.45$, which implies a head-to-head win probability of $0.61$; the highest-rated model beats the lowest only about $61\%$ of the time. We also analyse model diversity using cosine distances to capture intra- and inter-model variation and sensitivity to prompt reframing. Comparing three LLM-as-judge setups with human rankings reveals weak, inconsistent correlations and judge-specific biases, underscoring that automated judges cannot substitute for human evaluation. Conventional creativity tests also transfer only partially to brand-constrained tasks. Overall, the results highlight the need for expert human evaluation and diversity-aware workflows.[5] CTCC: A Robust and Stealthy Fingerprinting Framework for Large Language Models via Cross-Turn Contextual Correlation Backdoor
Zhenhua Xu,Xixiang Zhao,Xubin Yue,Shengwei Tian,Changting Lin,Meng Han
Main category: cs.CL
TL;DR: CTCC is an effective and innovative fingerprinting framework for securing intellectual property in large language models, offering improved stealth, robustness, and practicality.
Details
Motivation: Concerns around intellectual property protection have intensified with the widespread deployment of large language models, necessitating improved model fingerprinting techniques. Method: Introduce CTCC, a novel rule-driven fingerprinting framework that encodes contextual correlations across multiple dialogue turns. Result: CTCC achieves stronger stealth and robustness compared to prior methods, enabling fingerprint verification under black-box access while mitigating false positives and fingerprint leakage. Conclusion: CTCC is a reliable and practical solution for ownership verification in real-world LLM deployment scenarios. Abstract: The widespread deployment of large language models (LLMs) has intensified concerns around intellectual property (IP) protection, as model theft and unauthorized redistribution become increasingly feasible. To address this, model fingerprinting aims to embed verifiable ownership traces into LLMs. However, existing methods face inherent trade-offs between stealthness, robustness, and generalizability, being either detectable via distributional shifts, vulnerable to adversarial modifications, or easily invalidated once the fingerprint is revealed. In this work, we introduce CTCC, a novel rule-driven fingerprinting framework that encodes contextual correlations across multiple dialogue turns, such as counterfactual, rather than relying on token-level or single-turn triggers. CTCC enables fingerprint verification under black-box access while mitigating false positives and fingerprint leakage, supporting continuous construction under a shared semantic rule even if partial triggers are exposed. Extensive experiments across multiple LLM architectures demonstrate that CTCC consistently achieves stronger stealth and robustness than prior work. Our findings position CTCC as a reliable and practical solution for ownership verification in real-world LLM deployment scenarios. Our code and data are publicly available at[6] Temporal Preferences in Language Models for Long-Horizon Assistance
Ali Mazyaki,Mohammad Naghizadeh,Samaneh Ranjkhah Zonouzaghi,Hossein Setareh
Main category: cs.CL
TL;DR: 该研究探讨了语言模型在跨期选择中的时间偏好及其可操纵性,发现推理型模型在面向未来的提示下倾向于选择延迟回报,但其决策个性化程度有限,并提出了AI助手设计中需考虑长期目标和个性化校准的研究方向。
Details
Motivation: 研究语言模型是否具备类似人类的跨期选择偏好,并探索这些偏好是否可以被系统性地操控,以指导AI助手的设计和应用。 Method: 通过改编人类实验协议,使用时间权衡任务评估多个语言模型的表现,并与人类决策者进行对比。引入了一个操作性指标——时间导向可操纵性(MTO),衡量语言模型在不同提示下时间偏好的变化。 Result: 推理型模型(如DeepSeek-Reasoner和grok-3-mini)在面向未来的提示下更倾向于选择较晚的选项,但仅部分实现了身份或地理上的个性化决策;能够正确推理时间导向的模型会为自身作为AI决策者的角色内化一种未来导向。 Conclusion: 语言模型在跨期选择中表现出未来或现在导向的偏好,且这些偏好可以被系统地操纵。研究讨论了AI助手的设计应与异质性、长期目标对齐,并提出了个性化情境校准和社会意识部署的研究议程。 Abstract: We study whether language models (LMs) exhibit future- versus present-oriented preferences in intertemporal choice and whether those preferences can be systematically manipulated. Using adapted human experimental protocols, we evaluate multiple LMs on time-tradeoff tasks and benchmark them against a sample of human decision makers. We introduce an operational metric, the Manipulability of Time Orientation (MTO), defined as the change in an LM's revealed time preference between future- and present-oriented prompts. In our tests, reasoning-focused models (e.g., DeepSeek-Reasoner and grok-3-mini) choose later options under future-oriented prompts but only partially personalize decisions across identities or geographies. Moreover, models that correctly reason about time orientation internalize a future orientation for themselves as AI decision makers. We discuss design implications for AI assistants that should align with heterogeneous, long-horizon goals and outline a research agenda on personalized contextual calibration and socially aware deployment.[7] The Non-Determinism of Small LLMs: Evidence of Low Answer Consistency in Repetition Trials of Standard Multiple-Choice Benchmarks
Claudio Pinhanez,Paulo Cavalin,Cassia Sanctos,Marcelo Grave,Yago Primerano
Main category: cs.CL
TL;DR: The study examines the consistency and accuracy of small and medium-sized LLMs in answering repeated questions, finding that medium models are more consistent, with small models showing moderate consistency at low inference temperatures.
Details
Motivation: To understand the consistency of small LLMs in answering the same question multiple times and how consistency affects accuracy. Method: The study involves testing open-source LLMs on multiple-choice benchmarks (MMLU-Redux and MedQA) with repeated questions, varying inference temperatures, model sizes (small vs. medium), and comparing base and fine-tuned models. New analytical and graphical tools were also developed. Result: Small models typically answer 50%-80% of questions consistently at low inference temperatures, and accuracy among consistent answers correlates with overall accuracy. Conclusion: Medium-sized models show much higher levels of answer consistency compared to small models. Abstract: This work explores the consistency of small LLMs (2B-8B parameters) in answering multiple times the same question. We present a study on known, open-source LLMs responding to 10 repetitions of questions from the multiple-choice benchmarks MMLU-Redux and MedQA, considering different inference temperatures, small vs. medium models (50B-80B), finetuned vs. base models, and other parameters. We also look into the effects of requiring multi-trial answer consistency on accuracy and the trade-offs involved in deciding which model best provides both of them. To support those studies, we propose some new analytical and graphical tools. Results show that the number of questions which can be answered consistently vary considerably among models but are typically in the 50%-80% range for small models at low inference temperatures. Also, accuracy among consistent answers seems to reasonably correlate with overall accuracy. Results for medium-sized models seem to indicate much higher levels of answer consistency.[8] Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal
Nirmalendu Prakash,Yeo Wei Jie,Amir Abdullah,Ranjan Satapathy,Erik Cambria,Roy Ka Wei Lee
Main category: cs.CL
TL;DR: 本研究通过稀疏自编码器分析指令调优模型的拒绝有害提示行为,揭示了其机制并提出可能的干预方法。
Details
Motivation: 拒绝有害提示是指令调优的大语言模型(LLMs)中关键的安全行为,但这种行为的内部原因仍不明确。 Method: 该研究使用稀疏自编码器(SAEs)对两个公共指令调优模型Gemma-2-2B-IT和LLaMA-3.1-8B-IT进行分析,并通过三阶段搜索SAE潜在空间中的特征集,这些特征集的消融可以使模型从拒绝转变为合规。 Result: 研究发现了一组广泛的越狱关键特征,并发现了在早期特征被抑制时才会激活的冗余特征,揭示了拒绝行为的机制基础。 Conclusion: 本文结论指出,通过操作可解释的潜在空间,可以实现对安全行为的细粒度审计和有针对性的干预。 Abstract: Refusal on harmful prompts is a key safety behaviour in instruction-tuned large language models (LLMs), yet the internal causes of this behaviour remain poorly understood. We study two public instruction-tuned models, Gemma-2-2B-IT and LLaMA-3.1-8B-IT, using sparse autoencoders (SAEs) trained on residual-stream activations. Given a harmful prompt, we search the SAE latent space for feature sets whose ablation flips the model from refusal to compliance, demonstrating causal influence and creating a jailbreak. Our search proceeds in three stages: (1) Refusal Direction: find a refusal-mediating direction and collect SAE features near that direction; (2) Greedy Filtering: prune to a minimal set; and (3) Interaction Discovery: fit a factorization machine (FM) that captures nonlinear interactions among the remaining active features and the minimal set. This pipeline yields a broad set of jailbreak-critical features, offering insight into the mechanistic basis of refusal. Moreover, we find evidence of redundant features that remain dormant unless earlier features are suppressed. Our findings highlight the potential for fine-grained auditing and targeted intervention in safety behaviours by manipulating the interpretable latent space.[9] Assisting Research Proposal Writing with Large Language Models: Evaluation and Refinement
Jing Ren,Weiqi Wang
Main category: cs.CL
TL;DR: This study introduces metrics and a prompting method to objectively evaluate and improve LLM academic writing, reducing ethical issues like reference fabrication.
Details
Motivation: The motivation stems from the increasing use of LLMs in academic writing, accompanied by ethical concerns such as fabricated references, and the lack of objective, consistent, and reliable methods for evaluating content quality. Method: This study proposes two evaluation metrics—content quality and reference validity—and an iterative prompting method based on scores derived from these metrics to quantitatively evaluate and enhance the writing performance of LLMs. Result: The experiments show that the proposed metrics provide an objective, quantitative framework for assessing LLM writing performance. Iterative prompting significantly enhances content quality while reducing reference inaccuracies and fabrications. Conclusion: The study concludes that the proposed metrics and iterative prompting method effectively improve the content quality and reference validity of LLMs like ChatGPT, addressing key ethical challenges in academic writing. Abstract: Large language models (LLMs) like ChatGPT are increasingly used in academic writing, yet issues such as incorrect or fabricated references raise ethical concerns. Moreover, current content quality evaluations often rely on subjective human judgment, which is labor-intensive and lacks objectivity, potentially compromising the consistency and reliability. In this study, to provide a quantitative evaluation and enhance research proposal writing capabilities of LLMs, we propose two key evaluation metrics--content quality and reference validity--and an iterative prompting method based on the scores derived from these two metrics. Our extensive experiments show that the proposed metrics provide an objective, quantitative framework for assessing ChatGPT's writing performance. Additionally, iterative prompting significantly enhances content quality while reducing reference inaccuracies and fabrications, addressing critical ethical challenges in academic contexts.[10] Generating Individual Travel Diaries Using Large Language Models Informed by Census and Land-Use Data
Sepehr Golrokh Amin,Devin Rhoads,Fatemeh Fakhrmoosavi,Nicholas E. Lownes,John N. Ivan
Main category: cs.CL
TL;DR: 本文提出了一种基于大语言模型(LLM)的方案,用于在基于代理的交通模型中生成个人旅行日记,通过使用开源数据合成旅行者特征,并与传统方法进行比较,验证了其在旅行目的判断和一致性方面的优势。
Details
Motivation: 传统的旅行日记生成方法依赖大量的专有家庭旅行调查数据,而本文旨在通过使用开源数据和大语言模型生成更灵活、更具代表性的旅行日记,从而降低对昂贵数据的依赖。 Method: 本文通过随机从美国社区调查(ACS)和智能位置数据库(SLD)中提取数据生成旅行者特征,然后通过直接提示生成旅行日记,并引入了一种新的“一到群体”真实性评分机制,通过与康涅狄格州全州交通研究(CSTS)的日记进行比对验证模型效果。 Result: LLM生成的旅行日记在总体真实性评分上与传统方法相当(LLM均值:0.485 vs. 0.455),并在旅行目的判断和一致性方面表现更优,而传统模型在旅行次数和活动时长的数值估计上更占优势。聚合验证进一步确认了LLM在统计代表性上的优势(LLM均值:0.612 vs. 0.435)。 Conclusion: LLM在生成旅行日记方面展现出零样本学习的可行性,并提供了一种可量化的日记真实性评估指标,为未来的合成日记评估系统奠定了基础。 Abstract: This study introduces a Large Language Model (LLM) scheme for generating individual travel diaries in agent-based transportation models. While traditional approaches rely on large quantities of proprietary household travel surveys, the method presented in this study generates personas stochastically from open-source American Community Survey (ACS) and Smart Location Database (SLD) data, then synthesizes diaries through direct prompting. This study features a novel one-to-cohort realism score: a composite of four metrics (Trip Count Score, Interval Score, Purpose Score, and Mode Score) validated against the Connecticut Statewide Transportation Study (CSTS) diaries, matched across demographic variables. The validation utilizes Jensen-Shannon Divergence to measure distributional similarities between generated and real diaries. When compared to diaries generated with classical methods (Negative Binomial for trip generation; Multinomial Logit for mode/purpose) calibrated on the validation set, LLM-generated diaries achieve comparable overall realism (LLM mean: 0.485 vs. 0.455). The LLM excels in determining trip purpose and demonstrates greater consistency (narrower realism score distribution), while classical models lead in numerical estimates of trip count and activity duration. Aggregate validation confirms the LLM's statistical representativeness (LLM mean: 0.612 vs. 0.435), demonstrating LLM's zero-shot viability and establishing a quantifiable metric of diary realism for future synthetic diary evaluation systems.[11] Psychiatry-Bench: A Multi-Task Benchmark for LLMs in Psychiatry
Aya E. Fouda,Abdelrahamn A. Hassan,Radwa J. Hanafy,Mohammed E. Fouda
Main category: cs.CL
TL;DR: PsychiatryBench is a new benchmark for assessing large language models (LLMs) in psychiatric applications, highlighting gaps in clinical performance and the need for specialized improvements.
Details
Motivation: Current evaluations of LLMs in psychiatry rely on limited datasets, which undermines their clinical validity. PsychiatryBench aims to offer a robust and clinically relevant evaluation framework. Method: The study introduces PsychiatryBench, a benchmark derived from psychiatric textbooks and casebooks, comprising 11 question-answering tasks. It evaluates multiple LLMs using standard metrics and an 'LLM-as-judge' scoring system. Result: The evaluation of several LLMs showed significant gaps in clinical consistency and safety, particularly in complex tasks like follow-up and management planning. Conclusion: PsychiatryBench is a valuable platform for evaluating and improving LLMs in mental health applications, highlighting the need for specialized tuning and better evaluation methods. Abstract: Large language models (LLMs) hold great promise in enhancing psychiatric practice, from improving diagnostic accuracy to streamlining clinical documentation and therapeutic support. However, existing evaluation resources heavily rely on small clinical interview corpora, social media posts, or synthetic dialogues, which limits their clinical validity and fails to capture the full complexity of psychiatric reasoning. In this work, we introduce PsychiatryBench, a rigorously curated benchmark grounded exclusively in authoritative, expert-validated psychiatric textbooks and casebooks. PsychiatryBench comprises eleven distinct question-answering tasks ranging from diagnostic reasoning and treatment planning to longitudinal follow-up, management planning, clinical approach, sequential case analysis, and multiple-choice/extended matching formats totaling over 5,300 expert-annotated items. We evaluate a diverse set of frontier LLMs (including Google Gemini, DeepSeek, LLaMA 3, and QWQ-32) alongside leading open-source medical models (e.g., OpenBiloLLM, MedGemma) using both conventional metrics and an "LLM-as-judge" similarity scoring framework. Our results reveal substantial gaps in clinical consistency and safety, particularly in multi-turn follow-up and management tasks, underscoring the need for specialized model tuning and more robust evaluation paradigms. PsychiatryBench offers a modular, extensible platform for benchmarking and improving LLM performance in high-stakes mental health applications.[12] The Thinking Therapist: Training Large Language Models to Deliver Acceptance and Commitment Therapy using Supervised Fine-Tuning and Odds Ratio Policy Optimization
Talha Tahir
Main category: cs.CL
TL;DR: ORPO训练方法显著提升小型语言模型在ACT治疗中的表现,COT推理仅对SFT模型有效。
Details
Motivation: 研究旨在探索不同训练方法和显式推理对小型开放权重语言模型(LLM)进行ACT治疗能力的影响。 Method: 使用Mistral-Large生成的50组合成ACT对话记录,通过监督微调(SFT)和比值优化策略(ORPO)两种方法训练Llama-3.2-3b-Instruct模型,并在训练中加入或不加入显式思维链(COT)推理步骤。 Result: ORPO训练的模型在ACT保真度(χ²(5) = 185.15, p < .001)和治疗共情(χ²(5) = 140.37, p < .001)方面显著优于SFT和基础Instruct模型。COT对SFT模型有显著提升(平均提升2.68分,p < .001),但对ORPO或Instruct模型没有明显优势。 Conclusion: 该研究表明,偏好对齐策略优化(ORPO)比监督微调(SFT)更能有效提升小规模语言模型在ACT治疗中的表现,特别是在治疗过程的理解而非内容模仿方面。 Abstract: Acceptance and Commitment Therapy (ACT) is a third-wave cognitive behavioral therapy with emerging evidence of efficacy in several psychiatric conditions. This study investigates the impact of post-training methodology and explicit reasoning on the ability of a small open-weight large language model (LLM) to deliver ACT. Using 50 sets of synthetic ACT transcripts generated by Mistral-Large, we trained Llama-3.2-3b-Instruct with two distinct approaches, supervised fine-tuning (SFT) and odds ratio policy optimization (ORPO), each with and without an explicit chain-of-thought (COT) reasoning step. Performance was evaluated by comparing these four post-trained variants against the base Instruct model. These models were benchmarked in simulated therapy sessions, with performance quantitatively assessed on the ACT Fidelity Measure (ACT-FM) and the Therapist Empathy Scale (TES) by an LLM judge that had been fine-tuned on human evaluations. Our findings demonstrate that the ORPO-trained models significantly outperformed both their SFT and Instruct counterparts on ACT fidelity ($\chi^2(5) = 185.15, p < .001$) and therapeutic empathy ($\chi^2(5) = 140.37, p < .001$). The effect of COT was conditional as it provided a significant benefit to SFT models, improving ACT-FM scores by an average of 2.68 points ($p < .001$), while offering no discernible advantage to the superior ORPO or instruct-tuned variants. We posit that the superiority of ORPO stems from its ability to learn the therapeutic `process' over imitating `content,' a key aspect of ACT, while COT acts as a necessary scaffold for models trained only via imitation. This study establishes that preference-aligned policy optimization can effectively instill ACT competencies in small LLMs, and that the utility of explicit reasoning is highly dependent on the underlying training paradigm.[13] HANRAG: Heuristic Accurate Noise-resistant Retrieval-Augmented Generation for Multi-hop Question Answering
Duolin Sun,Dan Yang,Yue Shen,Yihan Jiao,Zhehao Tan,Jie Feng,Lianzhen Zhong,Jian Wang,Peng Wei,Jinjie Gu
Main category: cs.CL
TL;DR: 本文提出HANRAG框架,通过启发式方法改进检索增强生成,解决多跳查询中的噪声和效率问题。
Details
Motivation: 现有的RAG方法在处理多跳查询时面临挑战,如过度依赖迭代检索和原始复杂查询导致的噪声积累问题。 Method: 引入HANRAG框架,通过查询路由、分解问题为子问题以及过滤检索文档中的噪声来增强检索增强生成(RAG)方法。 Result: HANRAG在多个基准测试中优于其他领先的方法,具有更好的性能和适应性。 Conclusion: HANRAG是一个基于启发式的方法,能够有效地处理不同复杂程度的问题,提高了系统的适应性和抗噪能力,在单跳和多跳问答任务中都表现出色。 Abstract: The Retrieval-Augmented Generation (RAG) approach enhances question-answering systems and dialogue generation tasks by integrating information retrieval (IR) technologies with large language models (LLMs). This strategy, which retrieves information from external knowledge bases to bolster the response capabilities of generative models, has achieved certain successes. However, current RAG methods still face numerous challenges when dealing with multi-hop queries. For instance, some approaches overly rely on iterative retrieval, wasting too many retrieval steps on compound queries. Additionally, using the original complex query for retrieval may fail to capture content relevant to specific sub-queries, resulting in noisy retrieved content. If the noise is not managed, it can lead to the problem of noise accumulation. To address these issues, we introduce HANRAG, a novel heuristic-based framework designed to efficiently tackle problems of varying complexity. Driven by a powerful revelator, HANRAG routes queries, decomposes them into sub-queries, and filters noise from retrieved documents. This enhances the system's adaptability and noise resistance, making it highly capable of handling diverse queries. We compare the proposed framework against other leading industry methods across various benchmarks. The results demonstrate that our framework obtains superior performance in both single-hop and multi-hop question-answering tasks.[14] How Small Transformation Expose the Weakness of Semantic Similarity Measures
Serge Lionel Nikiema,Albérick Euraste Djire,Abdoul Aziz Bonkoungou,Micheline Bénédicte Moumoula,Jordan Samhi,Abdoul Kader Kabore,Jacques Klein,Tegawendé F. Bissyande
Main category: cs.CL
TL;DR: This study evaluates 18 semantic similarity methods in software engineering contexts, revealing that embedding techniques often misjudge semantic opposites as similar, while LLM-based approaches better distinguish meaning differences.
Details
Motivation: Semantic similarity measurement is crucial for software engineering applications like code search and refactoring tools, but there are concerns about whether current methods truly understand semantic relationships or rely on surface patterns. Method: The study evaluated 18 similarity measurement methods, including word-based, embedding techniques, LLM-based systems, and structure-aware algorithms, using a systematic framework that applied controlled changes to text and code. Result: Embedding methods had notable issues, with some identifying semantic opposites as similar up to 99.9% of the time. Switching to cosine similarity improved their performance by 24-66%. LLM-based approaches performed better, giving low similarity scores (0.00-0.29) to genuinely different meanings, unlike embedding methods that assigned high scores (0.82-0.99) to dissimilar content. Conclusion: LLM-based approaches perform better in distinguishing semantic differences compared to embedding methods, which often give incorrect similarity assessments. Adjusting distance calculation methods, like switching to cosine similarity, can significantly improve performance. Abstract: This research examines how well different methods measure semantic similarity, which is important for various software engineering applications such as code search, API recommendations, automated code reviews, and refactoring tools. While large language models are increasingly used for these similarity assessments, questions remain about whether they truly understand semantic relationships or merely recognize surface patterns. The study tested 18 different similarity measurement approaches, including word-based methods, embedding techniques, LLM-based systems, and structure-aware algorithms. The researchers created a systematic testing framework that applies controlled changes to text and code to evaluate how well each method handles different types of semantic relationships. The results revealed significant issues with commonly used metrics. Some embedding-based methods incorrectly identified semantic opposites as similar up to 99.9 percent of the time, while certain transformer-based approaches occasionally rated opposite meanings as more similar than synonymous ones. The study found that embedding methods' poor performance often stemmed from how they calculate distances; switching from Euclidean distance to cosine similarity improved results by 24 to 66 percent. LLM-based approaches performed better at distinguishing semantic differences, producing low similarity scores (0.00 to 0.29) for genuinely different meanings, compared to embedding methods that incorrectly assigned high scores (0.82 to 0.99) to dissimilar content.[15] Investigating Symbolic Triggers of Hallucination in Gemma Models Across HaluEval and TruthfulQA
Naveen Lamba,Sanju Tiwari,Manas Gaur
Main category: cs.CL
TL;DR: The research identifies key properties that make LLMs vulnerable to hallucinations, highlighting that symbolic elements continue to confuse models regardless of scale.
Details
Motivation: The motivation of the research is to identify and characterize the key properties that make LLMs intrinsically vulnerable to hallucinations, which have not been previously studied. Method: The research utilized two established datasets, HaluEval and TruthfulQA, converting their existing format of question answering into various other formats to identify the properties causing hallucinations. Result: The findings reveal that hallucination percentages across symbolic properties are notably high for Gemma-2-2B, averaging 79.0% across tasks and datasets, and decrease with increased model size. However, a substantial amount of hallucination caused by symbolic properties persists. Conclusion: The research concludes that symbolic properties continue to confuse LLMs, indicating a fundamental weakness in how these models process such inputs regardless of their scale. Abstract: Hallucination in Large Language Models (LLMs) is a well studied problem. However, the properties that make LLM intrinsically vulnerable to hallucinations have not been identified and studied. This research identifies and characterizes the key properties, allowing us to pinpoint vulnerabilities within the model's internal mechanisms. To solidify on these properties, we utilized two established datasets, HaluEval and TruthfulQA and convert their existing format of question answering into various other formats to narrow down these properties as the reason for the hallucinations. Our findings reveal that hallucination percentages across symbolic properties are notably high for Gemma-2-2B, averaging 79.0% across tasks and datasets. With increased model scale, hallucination drops to 73.6% for Gemma-2-9B and 63.9% for Gemma-2-27B, reflecting a 15 percentage point reduction overall. Although the hallucination rate decreases as the model size increases, a substantial amount of hallucination caused by symbolic properties still persists. This is especially evident for modifiers (ranging from 84.76% to 94.98%) and named entities (ranging from 83.87% to 93.96%) across all Gemma models and both datasets. These findings indicate that symbolic elements continue to confuse the models, pointing to a fundamental weakness in how these LLMs process such inputs--regardless of their scale.[16] ALIGNS: Unlocking nomological networks in psychological measurement through a large language model
Kai R. Larsen,Sen Yan,Roland Müller,Lan Sang,Mikko Rönkkö,Ravi Starzl,Donald Edmondson
Main category: cs.CL
TL;DR: This paper introduces ALIGNS, a large language model-based system that provides comprehensive nomological networks across various disciplines, addressing a longstanding challenge in measurement validation.
Details
Motivation: Building nomological networks has remained a challenge since their proposal by Cronbach and Meehl, with practical consequences such as the failure of clinical trials to detect treatment effects and public policy targeting incorrect outcomes. Method: The study introduces ALIGNS, a large language model-based system trained with validated questionnaire measures, which provides comprehensive nomological networks containing over 550,000 indicators across multiple disciplines. Result: ALIGNS successfully developed classification accuracy tests and three evaluations. The first evaluation revealed that NIH PROMIS anxiety and depression instruments converge into a single dimension of emotional distress. The second evaluation identified four new dimensions in child temperament measures and questioned one existing dimension. The third evaluation confirmed the system's importance, accessibility, and suitability according to expert psychometricians. Conclusion: The ALIGNS system is a groundbreaking application of large language models to address a foundational issue in measurement validation, offering extensive nomological networks for various fields and is accessible at nomologicalnetwork.org. Abstract: Psychological measurement is critical to many disciplines. Despite advances in measurement, building nomological networks, theoretical maps of how concepts and measures relate to establish validity, remains a challenge 70 years after Cronbach and Meehl proposed them as fundamental to validation. This limitation has practical consequences: clinical trials may fail to detect treatment effects, and public policy may target the wrong outcomes. We introduce Analysis of Latent Indicators to Generate Nomological Structures (ALIGNS), a large language model-based system trained with validated questionnaire measures. ALIGNS provides three comprehensive nomological networks containing over 550,000 indicators across psychology, medicine, social policy, and other fields. This represents the first application of large language models to solve a foundational problem in measurement validation. We report classification accuracy tests used to develop the model, as well as three evaluations. In the first evaluation, the widely used NIH PROMIS anxiety and depression instruments are shown to converge into a single dimension of emotional distress. The second evaluation examines child temperament measures and identifies four potential dimensions not captured by current frameworks, and questions one existing dimension. The third evaluation, an applicability check, engages expert psychometricians who assess the system's importance, accessibility, and suitability. ALIGNS is freely available at nomologicalnetwork.org, complementing traditional validation methods with large-scale nomological analysis.[17] DiTTO-LLM: Framework for Discovering Topic-based Technology Opportunities via Large Language Model
Wonyoung Kim,Sujeong Seo,Juhyun Lee
Main category: cs.CL
TL;DR: 这篇论文提出了一种基于专利数据分析的技术机会识别框架,重点在于人工智能技术的发展趋势和潜在机会。
Details
Motivation: 技术机会是推动技术、产业和创新进步的关键信息,因此需要一种有效的方法来识别新兴技术机会。 Method: 该论文提出了一种基于技术之间时间关系的框架,通过从专利数据集中提取文本、映射基于文本的主题以发现技术间关系,并跟踪这些主题随时间的变化来识别技术机会。此外,该框架利用大语言模型提取主题,并使用基于提示的聊天语言模型支持技术机会的发现。 Result: 通过使用美国专利商标局提供的人工智能专利数据集进行评估,实验结果表明人工智能技术正在向促进日常可访问性的形式演变。 Conclusion: 该论文提出的框架展示了识别未来技术机会的潜力,并表明人工智能技术正在向促进日常可访问性的形式发展。 Abstract: Technology opportunities are critical information that serve as a foundation for advancements in technology, industry, and innovation. This paper proposes a framework based on the temporal relationships between technologies to identify emerging technology opportunities. The proposed framework begins by extracting text from a patent dataset, followed by mapping text-based topics to discover inter-technology relationships. Technology opportunities are then identified by tracking changes in these topics over time. To enhance efficiency, the framework leverages a large language model to extract topics and employs a prompt for a chat-based language model to support the discovery of technology opportunities. The framework was evaluated using an artificial intelligence patent dataset provided by the United States Patent and Trademark Office. The experimental results suggest that artificial intelligence technology is evolving into forms that facilitate everyday accessibility. This approach demonstrates the potential of the proposed framework to identify future technology opportunities.[18] BIBERT-Pipe on Biomedical Nested Named Entity Linking at BioASQ 2025
Chunyu Li,Xindi Zheng,Siqi Liu
Main category: cs.CL
TL;DR: 本文提出了一种用于多语言生物医学嵌套命名实体链接(Multilingual Biomedical Nested Named Entity Linking)的轻量级系统,并在BioNNE 2025任务中取得了显著成果。
Details
Motivation: 生物医学文本中的实体链接(EL)通常局限于英文语料库,而忽略了更现实的多语言和嵌套实体场景。本文旨在填补这一空白。 Method: 本文提出了一种轻量级的两阶段系统,包括检索和排序阶段。两阶段使用相同的编码器模型,其中检索阶段使用预训练模型,排序阶段进行领域特定微调。此外,引入了边界标记([Ms] / [Me])以提高对嵌套和重叠实体的鲁棒性,并通过数据集增强扩大训练数据。 Result: 在BioNNE 2025排行榜上,该系统BIBERT-Pipe在多语言赛道中排名第三,证明了其有效性和竞争力。 Conclusion: 本文通过最小但原则性的改进,成功实现了多语言生物医学嵌套实体链接,并展示了其在实际应用中的潜力。 Abstract: Entity linking (EL) for biomedical text is typically benchmarked on English-only corpora with flat mentions, leaving the more realistic scenario of nested and multilingual mentions largely unexplored. We present our system for the BioNNE 2025 Multilingual Biomedical Nested Named Entity Linking shared task (English & Russian), closing this gap with a lightweight pipeline that keeps the original EL model intact and modifies only three task-aligned components: Two-stage retrieval-ranking. We leverage the same base encoder model in both stages: the retrieval stage uses the original pre-trained model, while the ranking stage applies domain-specific fine-tuning. Boundary cues. In the ranking stage, we wrap each mention with learnable [Ms] / [Me] tags, providing the encoder with an explicit, language-agnostic span before robustness to overlap and nesting. Dataset augmentation. We also automatically expand the ranking training corpus with three complementary data sources, enhancing coverage without extra manual annotation. On the BioNNE 2025 leaderboard, our two stage system, bilingual bert (BIBERT-Pipe), ranks third in the multilingual track, demonstrating the effectiveness and competitiveness of these minimal yet principled modifications. Code are publicly available at https://github.com/Kaggle-Competitions-Code/BioNNE-L.[19] Natural Language Translation of Formal Proofs through Informalization of Proof Steps and Recursive Summarization along Proof Structure
Seiji Hattori,Takuya Matsuzaki,Makoto Fujiwara
Main category: cs.CL
TL;DR: 这篇论文研究了利用LLM生成自然语言翻译的形式证明的方法,并验证了其输出的可读性和准确性。
Details
Motivation: 论文的动机是探索如何利用LLM的非正式化和摘要能力来生成自然语言翻译的形式证明。 Method: 该论文提出了一种利用LLM的非正式化和摘要能力的机器可验证形式证明的自然语言翻译方法。 Result: 该方法被应用于根据本科水平教科书中的自然语言证明创建的形式证明数据,并与原始自然语言证明进行了比较分析。 Conclusion: 该论文提出的方法能够输出高度可读且准确的自然语言证明,并通过与原始自然语言证明的比较分析验证了其质量。 Abstract: This paper proposes a natural language translation method for machine-verifiable formal proofs that leverages the informalization (verbalization of formal language proof steps) and summarization capabilities of LLMs. For evaluation, it was applied to formal proof data created in accordance with natural language proofs taken from an undergraduate-level textbook, and the quality of the generated natural language proofs was analyzed in comparison with the original natural language proofs. Furthermore, we will demonstrate that this method can output highly readable and accurate natural language proofs by applying it to existing formal proof library of the Lean proof assistant.[20] A Role-Aware Multi-Agent Framework for Financial Education Question Answering with LLMs
Andy Zhu,Yingjun Du
Main category: cs.CL
TL;DR: A new multi-agent framework significantly enhances financial question answering by integrating role-based prompting and iterative refinement, outperforming existing methods.
Details
Motivation: Existing LLM approaches struggle with the nuanced, multistep reasoning required in financial problem-solving, prompting the need for a specialized framework. Method: A multi-agent framework incorporating a Base Generator, Evidence Retriever, and Expert Reviewer was developed. Retrieval-augmented generation (RAG) and domain-specific prompting strategies were employed to refine answers iteratively. Result: The framework improved answer accuracy by 6.6-8.3% over zero-shot Chain-of-Thought baselines, with Gemini-2.0-Flash performing best. GPT-4o-mini achieved performance comparable to FinGPT-mt_Llama3-8B_LoRA. Conclusion: The proposed multi-agent framework enhances financial QA performance effectively, offering a cost-efficient solution and insights for future research on multi-agent LLM systems in finance. Abstract: Question answering (QA) plays a central role in financial education, yet existing large language model (LLM) approaches often fail to capture the nuanced and specialized reasoning required for financial problem-solving. The financial domain demands multistep quantitative reasoning, familiarity with domain-specific terminology, and comprehension of real-world scenarios. We present a multi-agent framework that leverages role-based prompting to enhance performance on domain-specific QA. Our framework comprises a Base Generator, an Evidence Retriever, and an Expert Reviewer agent that work in a single-pass iteration to produce a refined answer. We evaluated our framework on a set of 3,532 expert-designed finance education questions from Study.com, an online learning platform. We leverage retrieval-augmented generation (RAG) for contextual evidence from 6 finance textbooks and prompting strategies for a domain-expert reviewer. Our experiments indicate that critique-based refinement improves answer accuracy by 6.6-8.3% over zero-shot Chain-of-Thought baselines, with the highest performance from Gemini-2.0-Flash. Furthermore, our method enables GPT-4o-mini to achieve performance comparable to the finance-tuned FinGPT-mt_Llama3-8B_LoRA. Our results show a cost-effective approach to enhancing financial QA and offer insights for further research in multi-agent financial LLM systems.[21] A meta-analysis on the performance of machine-learning based language models for sentiment analysis
Elena Rohde,Jonas Klingwort,Christian Borgs
Main category: cs.CL
TL;DR: This paper evaluates ML performance in Twitter sentiment analysis through a meta-analysis, revealing that commonly used metrics like overall accuracy can be misleading and stressing the need for standardized reporting practices.
Details
Motivation: The motivation is to evaluate the average performance of ML models in sentiment analysis for Twitter data, understand heterogeneity between and within studies, and determine how study characteristics influence model performance. Method: The study conducted a meta-analysis using PRISMA guidelines, analyzing 195 trials from 20 studies. It employed double arcsine transformation and a three-level random effects model to estimate average performance and assess heterogeneity. Result: The average overall accuracy of the AIC-optimized model was 0.80 [0.76, 0.84]. The study found that overall accuracy can be misleading and that standardized reporting of model performance is essential for reliable comparisons. Conclusion: The paper concludes that overall accuracy, while widely used, is often misleading due to its sensitivity to class imbalance and the number of sentiment classes, emphasizing the importance of normalization and standardized reporting of model performance, including confusion matrices for reliable comparisons. Abstract: This paper presents a meta-analysis evaluating ML performance in sentiment analysis for Twitter data. The study aims to estimate the average performance, assess heterogeneity between and within studies, and analyze how study characteristics influence model performance. Using PRISMA guidelines, we searched academic databases and selected 195 trials from 20 studies with 12 study features. Overall accuracy, the most reported performance metric, was analyzed using double arcsine transformation and a three-level random effects model. The average overall accuracy of the AIC-optimized model was 0.80 [0.76, 0.84]. This paper provides two key insights: 1) Overall accuracy is widely used but often misleading due to its sensitivity to class imbalance and the number of sentiment classes, highlighting the need for normalization. 2) Standardized reporting of model performance, including reporting confusion matrices for independent test sets, is essential for reliable comparisons of ML classifiers across studies, which seems far from common practice.[22] MultimodalHugs: Enabling Sign Language Processing in Hugging Face
Gerard Sant,Zifan Jiang,Carlos Escolano,Amit Moryossef,Mathias Müller,Rico Sennrich,Sarah Ebling
Main category: cs.CL
TL;DR: 本文介绍MultimodalHugs框架,它扩展了Hugging Face的功能,以解决手语处理研究中的可重复性和灵活性问题。
Details
Motivation: 手语处理研究由于缺乏灵活性的工具支持,导致实验可重复性差和不公平比较,因此需要一个更通用和灵活的框架。 Method: 通过调查手语处理研究人员的观点,并进行定量实验验证MultimodalHugs的性能。 Result: MultimodalHugs成功地支持了手语姿态估计数据和文本字符像素数据等多种模态的实验。 Conclusion: MultimodalHugs是一个建立在Hugging Face之上的框架,它支持更广泛的数据模态和任务,从而解决了手语处理研究中的可重复性和兼容性问题。 Abstract: In recent years, sign language processing (SLP) has gained importance in the general field of Natural Language Processing. However, compared to research on spoken languages, SLP research is hindered by complex ad-hoc code, inadvertently leading to low reproducibility and unfair comparisons. Existing tools that are built for fast and reproducible experimentation, such as Hugging Face, are not flexible enough to seamlessly integrate sign language experiments. This view is confirmed by a survey we conducted among SLP researchers. To address these challenges, we introduce MultimodalHugs, a framework built on top of Hugging Face that enables more diverse data modalities and tasks, while inheriting the well-known advantages of the Hugging Face ecosystem. Even though sign languages are our primary focus, MultimodalHugs adds a layer of abstraction that makes it more widely applicable to other use cases that do not fit one of the standard templates of Hugging Face. We provide quantitative experiments to illustrate how MultimodalHugs can accommodate diverse modalities such as pose estimation data for sign languages, or pixel data for text characters.[23] Benchmarking Vision-Language Models on Chinese Ancient Documents: From OCR to Knowledge Reasoning
Haiyang Yu,Yuchuan Wu,Fan Shi,Lei Liao,Jinghui Lu,Xiaodong Ge,Han Wang,Minghan Zhuo,Xuecheng Wu,Xiang Fei,Hao Feng,Guozhi Tang,An-Lan Wang,Hanshen Zhu,Yangfan He,Quanhuan Liang,Liyuan Meng,Chao Feng,Can Huang,Jingqun Tang,Bin Li
Main category: cs.CL
TL;DR: 为了解决中国古代文献数字化和理解的难题,研究者提出了AncientDoc这个基准测试,以评估视觉语言模型的性能。
Details
Motivation: 中国古代文献是历史和文化的宝贵载体,但面临数字化和理解的挑战。现有的文档基准测试主要关注英文印刷文本或简体中文,缺乏对古代中文文档的评估工具。 Method: AncientDoc包含五个任务和14种文档类型,通过多个指标评估主流视觉语言模型的性能,并利用对齐人类的大语言模型进行评分。 Result: AncientDoc能够评估视觉语言模型从OCR到知识推理的性能,并涵盖了多种文档类型和书籍。 Conclusion: AncientDoc是一个新的基准测试,用于评估视觉语言模型在处理中国古代文献方面的性能,包括OCR到知识推理的多个任务。 Abstract: Chinese ancient documents, invaluable carriers of millennia of Chinese history and culture, hold rich knowledge across diverse fields but face challenges in digitization and understanding, i.e., traditional methods only scan images, while current Vision-Language Models (VLMs) struggle with their visual and linguistic complexity. Existing document benchmarks focus on English printed texts or simplified Chinese, leaving a gap for evaluating VLMs on ancient Chinese documents. To address this, we present AncientDoc, the first benchmark for Chinese ancient documents, designed to assess VLMs from OCR to knowledge reasoning. AncientDoc includes five tasks (page-level OCR, vernacular translation, reasoning-based QA, knowledge-based QA, linguistic variant QA) and covers 14 document types, over 100 books, and about 3,000 pages. Based on AncientDoc, we evaluate mainstream VLMs using multiple metrics, supplemented by a human-aligned large language model for scoring.[24] MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools
Zikang Guo,Benfeng Xu,Chiwei Zhu,Wentao Hong,Xiaorui Wang,Zhendong Mao
Main category: cs.CL
TL;DR: The paper introduces MCP-AgentBench, a comprehensive benchmark for evaluating AI agents' performance in MCP-mediated tool interactions, featuring a robust testbed, 600 structured queries, and a novel evaluation methodology.
Details
Motivation: Current benchmarks fail to accurately assess agent performance in the MCP paradigm, creating a need for a more realistic and comprehensive evaluation framework. Method: The authors developed a benchmark with 600 queries across 6 categories and introduced MCP-Eval, an outcome-oriented evaluation methodology. They also established a testbed with 33 servers and 188 tools. Result: The benchmark enables rigorous assessment of language agent capabilities in MCP-mediated tool interactions and provides foundational insights through empirical evaluations. Conclusion: MCP-AgentBench aims to provide a standardized framework for developing and evaluating AI agents that can fully utilize MCP, advancing progress toward capable and interoperable AI systems. Abstract: The Model Context Protocol (MCP) is rapidly emerging as a pivotal open standard, designed to enhance agent-tool integration and interoperability, and is positioned to unlock a new era of powerful, interconnected, and genuinely utilitarian agentic AI. However, despite MCP's growing adoption, existing benchmarks often fail to capture real-world agent performance within this new paradigm, leading to a distorted perception of their true operational value and an inability to reliably differentiate proficiencies. To bridge this critical evaluation gap, we introduce MCP-AgentBench -- a comprehensive benchmark specifically engineered to rigorously assess language agent capabilities in MCP-mediated tool interactions. Core contributions of MCP-AgentBench include: the establishment of a robust MCP testbed comprising 33 operational servers with 188 distinct tools; the development of a benchmark featuring 600 systematically designed queries distributed across 6 distinct categories of varying interaction complexity; and the introduction of MCP-Eval, a novel outcome-oriented evaluation methodology prioritizing real-world task success. Through extensive empirical evaluation of leading language agents, we provide foundational insights. MCP-AgentBench aims to equip the research community with a standardized and reliable framework to build, validate, and advance agents capable of fully leveraging MCP's transformative benefits, thereby accelerating progress toward truly capable and interoperable AI systems.[25] Discrimination by LLMs: Cross-lingual Bias Assessment and Mitigation in Decision-Making and Summarisation
Willem Huijzer,Jieying Chen
Main category: cs.CL
TL;DR: This study examines gender, age, and background biases in LLMs like GPT-3.5 and GPT-4o, finding significant bias in decision-making tasks but less so in summarization. New prompt-based mitigation strategies reduced bias by up to 27%, with GPT-4o showing less bias than GPT-3.5. The findings emphasize the need for careful LLM adoption and bias mitigation strategies.
Details
Motivation: The rapid integration of Large Language Models (LLMs) into various domains raises concerns about societal inequalities and information bias. This study aims to examine biases related to background, gender, and age in LLMs and their impact on decision-making and summarization tasks, as well as explore strategies to mitigate these biases. Method: The research used an adapted version of the dataset by Tamkin et al. (2023), translated into Dutch, to create over 150,000 unique prompts for decision-making and summarization tasks. These prompts, varying by demographic variables, instructions, and languages, were tested on GPT-3.5 and GPT-4o. The study evaluated bias across tasks and languages and tested the effectiveness of prompt-based mitigation strategies. Result: Both GPT-3.5 and GPT-4o showed significant bias in decision-making, favoring female gender, younger ages, and specific backgrounds like African-American. In contrast, summarization tasks showed minimal bias, although age-related differences were observed for GPT-3.5 in English. Cross-lingual analysis revealed broadly similar bias patterns between English and Dutch, with some differences across demographic categories. Mitigation instructions reduced bias by up to 27% on average. GPT-4o showed reduced biases compared to GPT-3.5, especially in English prompts. Conclusion: The study concludes that while LLMs like GPT-3.5 and GPT-4o show significant bias in decision-making tasks, particularly related to gender, age, and background, newer models like GPT-4o demonstrate reduced biases. Prompt-based mitigation strategies can reduce bias to some extent, emphasizing the importance of context-specific bias testing and continued development of mitigation strategies for responsible AI deployment. Abstract: The rapid integration of Large Language Models (LLMs) into various domains raises concerns about societal inequalities and information bias. This study examines biases in LLMs related to background, gender, and age, with a focus on their impact on decision-making and summarization tasks. Additionally, the research examines the cross-lingual propagation of these biases and evaluates the effectiveness of prompt-instructed mitigation strategies. Using an adapted version of the dataset by Tamkin et al. (2023) translated into Dutch, we created 151,200 unique prompts for the decision task and 176,400 for the summarisation task. Various demographic variables, instructions, salience levels, and languages were tested on GPT-3.5 and GPT-4o. Our analysis revealed that both models were significantly biased during decision-making, favouring female gender, younger ages, and certain backgrounds such as the African-American background. In contrast, the summarisation task showed minimal evidence of bias, though significant age-related differences emerged for GPT-3.5 in English. Cross-lingual analysis showed that bias patterns were broadly similar between English and Dutch, though notable differences were observed across specific demographic categories. The newly proposed mitigation instructions, while unable to eliminate biases completely, demonstrated potential in reducing them. The most effective instruction achieved a 27\% mean reduction in the gap between the most and least favorable demographics. Notably, contrary to GPT-3.5, GPT-4o displayed reduced biases for all prompts in English, indicating the specific potential for prompt-based mitigation within newer models. This research underscores the importance of cautious adoption of LLMs and context-specific bias testing, highlighting the need for continued development of effective mitigation strategies to ensure responsible deployment of AI.[26] HEFT: A Coarse-to-Fine Hierarchy for Enhancing the Efficiency and Accuracy of Language Model Reasoning
Brennen Hill
Main category: cs.CL
TL;DR: 本文提出了一种新的分层适应策略HEFT,结合LoRA和ReFT方法,以更少的计算资源实现了大型语言模型推理能力的显著提升。
Details
Motivation: 大型语言模型在特定推理任务中的适应受到计算资源的限制,而结合不同的参数高效微调方法可能提高性能和效率。 Method: 引入HEFT(Hierarchical Efficient Fine-Tuning)方法,首先在权重空间进行LoRA调整,然后在表示空间进行ReFT调整。 Result: 使用HEFT策略微调的Llama-2-7B模型在BoolQ基准测试中仅训练三个周期就达到了85.17%的准确率,超过了LoRA-only和ReFT-only方法。 Conclusion: HEFT方法通过结合LoRA和ReFT在权重空间和表示空间的分层适应策略,实现了更高效和有效的大型语言模型推理能力提升。 Abstract: The adaptation of large language models (LLMs) to specialized reasoning tasks is fundamentally constrained by computational resources. Parameter-Efficient Fine-Tuning (PEFT) methods have emerged as a powerful solution, yet the landscape of these techniques is diverse, with distinct methods operating in either the model's weight space or its representation space. This paper investigates the hypothesis that a synergistic combination of these paradigms can unlock superior performance and efficiency. We introduce HEFT (Hierarchical Efficient Fine-Tuning), a novel hierarchical adaptation strategy that composes two distinct PEFT methods in a coarse-to-fine manner: first, a broad, foundational adaptation in the weight space using Low-Rank Adaptation (LoRA), followed by a precise, surgical refinement of internal activations using Representation Fine-Tuning (ReFT). We evaluate this approach by fine-tuning a Llama-2-7B model on the BoolQ benchmark, a challenging dataset for inferential reasoning. Our results reveal a profound synergistic effect. A model fine-tuned for only three epochs with our HEFT strategy achieves an accuracy of 85.17\%, exceeding the performance of models trained for 20 epochs with either LoRA-only (85.05\%) or ReFT-only (83.36\%) methodologies. This work demonstrates that the thoughtful composition of PEFT methods is a potent algorithmic innovation, offering a more efficient and effective path toward advancing the reasoning capabilities of language models. By achieving superior results with a fraction of the computational budget, our findings present a principled approach to overcoming the obstacles inherent in adapting large-scale models for complex cognitive tasks.[27] Pragmatic Frames Evoked by Gestures: A FrameNet Brasil Approach to Multimodality in Turn Organization
Helen de Andrade Abreu,Tiago Timponi Torrent,Ely Edison da Silva Matos
Main category: cs.CL
TL;DR: The paper proposes a framework for modeling multimodal conversational turn organization by analyzing correlations between language and gestures, enriching the Frame2 dataset with pragmatic frame annotations to better understand human cognition and language.
Details
Motivation: The motivation is to encode specific strategies, especially gestures, used by communicators in conversational turn organization, which had not been previously recorded in datasets for machine learning. Method: The paper develops an annotation methodology to enrich a multimodal dataset (Frame2) with pragmatic frames to model conversational turn organization. Result: The results confirm that gestures are used for passing, taking, and keeping conversational turns, and variations of some gestures were identified that had not been documented before. Conclusion: The paper concludes that gestures are used by communicators as tools for managing conversational turns, and the annotation of pragmatic frames enhances the understanding of human cognition and language. Abstract: This paper proposes a framework for modeling multimodal conversational turn organization via the proposition of correlations between language and interactive gestures, based on analysis as to how pragmatic frames are conceptualized and evoked by communicators. As a means to provide evidence for the analysis, we developed an annotation methodology to enrich a multimodal dataset (annotated for semantic frames) with pragmatic frames modeling conversational turn organization. Although conversational turn organization has been studied by researchers from diverse fields, the specific strategies, especially gestures used by communicators, had not yet been encoded in a dataset that can be used for machine learning. To fill this gap, we enriched the Frame2 dataset with annotations of gestures used for turn organization. The Frame2 dataset features 10 episodes from the Brazilian TV series Pedro Pelo Mundo annotated for semantic frames evoked in both video and text. This dataset allowed us to closely observe how communicators use interactive gestures outside a laboratory, in settings, to our knowledge, not previously recorded in related literature. Our results have confirmed that communicators involved in face-to-face conversation make use of gestures as a tool for passing, taking and keeping conversational turns, and also revealed variations of some gestures that had not been documented before. We propose that the use of these gestures arises from the conceptualization of pragmatic frames, involving mental spaces, blending and conceptual metaphors. In addition, our data demonstrate that the annotation of pragmatic frames contributes to a deeper understanding of human cognition and language.[28] Topic-Guided Reinforcement Learning with LLMs for Enhancing Multi-Document Summarization
Chuyuan Li,Austin Xu,Shafiq Joty,Giuseppe Carenini
Main category: cs.CL
TL;DR: This paper introduces a topic-guided reinforcement learning method to improve multi-document summarization, showing better performance by leveraging topical cues.
Details
Motivation: The key challenge in Multi-Document Summarization (MDS) is integrating information from multiple sources coherently and topically. While large language models perform well in single-document summarization, their performance on MDS needs improvement. This paper aims to address this by leveraging topical cues. Method: The authors propose a novel topic reward within the Group Relative Policy Optimization (GRPO) framework to improve content selection in multi-document summarization. They explicitly prompt models with topic labels to enhance summary informativeness. Result: Experimental results on the Multi-News and Multi-XScience datasets show that the proposed method consistently outperforms strong baselines, demonstrating the effectiveness of incorporating topical alignment in MDS. Conclusion: The proposed topic-guided reinforcement learning approach enhances the performance of multi-document summarization by leveraging topical cues, as demonstrated on the Multi-News and Multi-XScience datasets. Abstract: A key challenge in Multi-Document Summarization (MDS) is effectively integrating information from multiple sources while maintaining coherence and topical relevance. While Large Language Models have shown impressive results in single-document summarization, their performance on MDS still leaves room for improvement. In this paper, we propose a topic-guided reinforcement learning approach to improve content selection in MDS. We first show that explicitly prompting models with topic labels enhances the informativeness of the generated summaries. Building on this insight, we propose a novel topic reward within the Group Relative Policy Optimization (GRPO) framework to measure topic alignment between the generated summary and source documents. Experimental results on the Multi-News and Multi-XScience datasets demonstrate that our method consistently outperforms strong baselines, highlighting the effectiveness of leveraging topical cues in MDS.[29] Emulating Public Opinion: A Proof-of-Concept of AI-Generated Synthetic Survey Responses for the Chilean Case
Bastián González-Bustamante,Nando Verelst,Carla Cisternas
Main category: cs.CL
TL;DR: 本文研究了大型语言模型在调查研究中的应用潜力与限制,发现其在某些项目上表现优异,但在全面捕捉公众意见的细微差别方面仍面临挑战。
Details
Motivation: 大型语言模型(LLM)在调查研究中提供了方法论和应用上的创新机会,但它们在恢复项目分布方面的可靠性尚不确定,且可能复制社会刻板印象和偏见。 Method: 研究者使用来自Chilean民意调查的真实数据,评估了LLM生成的综合反应的可靠性,涉及128个提示-模型-问题三元组,生成了189,696个综合样本,并进行了元分析以测试关键社会人口维度上的偏差。 Result: 首先,综合反应在信任项目上表现优异(F1分数和准确率 > 0.90)。其次,GPT-4o、GPT-4o-mini和Llama 4 Maverick在此任务上的表现相当。第三,合成-人类对齐在45-59岁受访者中最高。 Conclusion: 尽管LLM生成的综合样本在逼近概率样本的反应方面表现出潜力,但其在项目层面的异质性显著,且需谨慎校准以减少误差和算法失真。 Abstract: Large Language Models (LLMs) offer promising avenues for methodological and applied innovations in survey research by using synthetic respondents to emulate human answers and behaviour, potentially mitigating measurement and representation errors. However, the extent to which LLMs recover aggregate item distributions remains uncertain and downstream applications risk reproducing social stereotypes and biases inherited from training data. We evaluate the reliability of LLM-generated synthetic survey responses against ground-truth human responses from a Chilean public opinion probabilistic survey. Specifically, we benchmark 128 prompt-model-question triplets, generating 189,696 synthetic profiles, and pool performance metrics (i.e., accuracy, precision, recall, and F1-score) in a meta-analysis across 128 question-subsample pairs to test for biases along key sociodemographic dimensions. The evaluation spans OpenAI's GPT family and o-series reasoning models, as well as Llama and Qwen checkpoints. Three results stand out. First, synthetic responses achieve excellent performance on trust items (F1-score and accuracy > 0.90). Second, GPT-4o, GPT-4o-mini and Llama 4 Maverick perform comparably on this task. Third, synthetic-human alignment is highest among respondents aged 45-59. Overall, LLM-based synthetic samples approximate responses from a probabilistic sample, though with substantial item-level heterogeneity. Capturing the full nuance of public opinion remains challenging and requires careful calibration and additional distributional tests to ensure algorithmic fidelity and reduce errors.[30] Large Language Models Meet Legal Artificial Intelligence: A Survey
Zhitian Hou,Zihan Ye,Nanli Zeng,Tianyong Hao,Kun Zeng
Main category: cs.CL
TL;DR: 这篇论文回顾了LLMs在法律AI中的应用,提供了多个框架、基准测试和数据集,并讨论了未来的挑战和方向。
Details
Motivation: 近年来,大型语言模型(LLMs)极大地推动了法律人工智能(Legal AI)的发展,提高了法律任务的效率和准确性。为了推动法律领域基于LLM的方法的研究和应用。 Method: 对16个法律LLMs系列和47个基于LLM的法律任务框架进行了全面回顾,并收集了15个基准测试和29个数据集以评估不同的法律能力。 Result: 提供了对LLM在法律领域应用的全面回顾和评估资源,并分析了基于LLM的方法在法律领域面临的挑战和未来方向。 Conclusion: 这篇论文希望为初学者提供一个系统的入门介绍,并鼓励该领域未来的研究。 Abstract: Large Language Models (LLMs) have significantly advanced the development of Legal Artificial Intelligence (Legal AI) in recent years, enhancing the efficiency and accuracy of legal tasks. To advance research and applications of LLM-based approaches in legal domain, this paper provides a comprehensive review of 16 legal LLMs series and 47 LLM-based frameworks for legal tasks, and also gather 15 benchmarks and 29 datasets to evaluate different legal capabilities. Additionally, we analyse the challenges and discuss future directions for LLM-based approaches in the legal domain. We hope this paper provides a systematic introduction for beginners and encourages future research in this field. Resources are available at https://github.com/ZhitianHou/LLMs4LegalAI.[31] CMHG: A Dataset and Benchmark for Headline Generation of Minority Languages in China
Guixian Xu,Zeli Su,Ziyin Zhang,Jianing Liu,XU Han,Ting Zhang,Yushuang Dong
Main category: cs.CL
TL;DR: This paper introduces a new dataset for headline generation in Chinese minority languages to address the lack of relevant corpora and proposes a native speaker-annotated test set as a benchmark for future research.
Details
Motivation: The motivation stems from the challenges faced by minority languages in China due to their unique writing systems, leading to a lack of relevant corpora for supervised tasks like headline generation. Method: The paper describes the creation of a novel dataset, Chinese Minority Headline Generation (CMHG), including 100,000 entries for Tibetan and 50,000 entries each for Uyghur and Mongolian. It also involves the annotation of a high-quality test set by native speakers. Result: The result is the development of the CMHG dataset and a high-quality test set, intended to act as a benchmark for future research in headline generation for Chinese minority languages. Conclusion: The paper concludes that the introduced dataset can serve as a valuable resource for advancing headline generation in Chinese minority languages and hopes it will contribute to the development of related benchmarks. Abstract: Minority languages in China, such as Tibetan, Uyghur, and Traditional Mongolian, face significant challenges due to their unique writing systems, which differ from international standards. This discrepancy has led to a severe lack of relevant corpora, particularly for supervised tasks like headline generation. To address this gap, we introduce a novel dataset, Chinese Minority Headline Generation (CMHG), which includes 100,000 entries for Tibetan, and 50,000 entries each for Uyghur and Mongolian, specifically curated for headline generation tasks. Additionally, we propose a high-quality test set annotated by native speakers, designed to serve as a benchmark for future research in this domain. We hope this dataset will become a valuable resource for advancing headline generation in Chinese minority languages and contribute to the development of related benchmarks.[32] Unsupervised Hallucination Detection by Inspecting Reasoning Processes
Ponhvoan Srey,Xiaobao Wu,Anh Tuan Luu
Main category: cs.CL
TL;DR: IRIS是一种高效的无监督幻觉检测方法,通过LLM内部表示和不确定性进行训练,解决了现有方法的局限性。
Details
Motivation: 现有的无监督方法依赖与事实正确性无关的代理信号,导致检测偏差和泛化能力受限。 Method: 通过利用LLM内部表示和响应的不确定性,训练一个基于软伪标签的检测模型。 Result: IRIS在多个实验中持续优于现有的无监督方法,计算成本低且适用于小数据集。 Conclusion: IRIS是一个无需人工标注数据的幻觉检测框架,能够在多种场景下实现高效准确的检测,适用于实时应用。 Abstract: Unsupervised hallucination detection aims to identify hallucinated content generated by large language models (LLMs) without relying on labeled data. While unsupervised methods have gained popularity by eliminating labor-intensive human annotations, they frequently rely on proxy signals unrelated to factual correctness. This misalignment biases detection probes toward superficial or non-truth-related aspects, limiting generalizability across datasets and scenarios. To overcome these limitations, we propose IRIS, an unsupervised hallucination detection framework, leveraging internal representations intrinsic to factual correctness. IRIS prompts the LLM to carefully verify the truthfulness of a given statement, and obtain its contextualized embedding as informative features for training. Meanwhile, the uncertainty of each response is considered a soft pseudolabel for truthfulness. Experimental results demonstrate that IRIS consistently outperforms existing unsupervised methods. Our approach is fully unsupervised, computationally low cost, and works well even with few training data, making it suitable for real-time detection.[33] Multi-Intent Recognition in Dialogue Understanding: A Comparison Between Smaller Open-Source LLMs
Adnan Ahmad,Philine Kowol,Stefan Hillmann,Sebastian Möller
Main category: cs.CL
TL;DR: Mistral-7B-v0.1 is best among open-source LLMs for few-shot multi-label intent classification on MultiWOZ 2.1, but BERT-based supervised models perform better.
Details
Motivation: To evaluate open-source LLMs for multi-label intent classification on consumer hardware and improve NLU in task-oriented chatbots. Method: Analyzed Mistral-7B-v0.1, LLama2-7B-hf, and Yi-6B in a few-shot setup on MultiWOZ 2.1 dataset; compared with BERT-based classifier using accuracy, precision, recall, F1 scores, and resource requirements. Result: Mistral-7B-v0.1 outperformed other generative models with an F-score weighted average of 0.50, lower Hamming Loss, and higher Jaccard Similarity. Conclusion: BERT-based supervised classifier outperforms few-shot generative LLMs, but Mistral-7B-v0.1 excels among open-source LLMs in multi-label intent classification. Abstract: In this paper, we provide an extensive analysis of multi-label intent classification using Large Language Models (LLMs) that are open-source, publicly available, and can be run in consumer hardware. We use the MultiWOZ 2.1 dataset, a benchmark in the dialogue system domain, to investigate the efficacy of three popular open-source pre-trained LLMs, namely LLama2-7B-hf, Mistral-7B-v0.1, and Yi-6B. We perform the classification task in a few-shot setup, giving 20 examples in the prompt with some instructions. Our approach focuses on the differences in performance of these models across several performance metrics by methodically assessing these models on multi-label intent classification tasks. Additionally, we compare the performance of the instruction-based fine-tuning approach with supervised learning using the smaller transformer model BertForSequenceClassification as a baseline. To evaluate the performance of the models, we use evaluation metrics like accuracy, precision, and recall as well as micro, macro, and weighted F1 score. We also report the inference time, VRAM requirements, etc. The Mistral-7B-v0.1 outperforms two other generative models on 11 intent classes out of 14 in terms of F-Score, with a weighted average of 0.50. It also has relatively lower Humming Loss and higher Jaccard Similarity, making it the winning model in the few-shot setting. We find BERT based supervised classifier having superior performance compared to the best performing few-shot generative LLM. The study provides a framework for small open-source LLMs in detecting complex multi-intent dialogues, enhancing the Natural Language Understanding aspect of task-oriented chatbots.[34] Linguistic trajectories of bipolar disorder on social media
Laurin Plank,Armin Zlomuzica
Main category: cs.CL
TL;DR: This study uses social media data to identify long-term language patterns associated with bipolar disorder, showing recurring mood-related changes over decades and validating the use of social media for large-scale mental health monitoring.
Details
Motivation: Traditional clinical assessments of affective disorders like BD are limited in scale and longitudinal scope. Social media offers a high temporal resolution and long-term data source, making it a promising tool for studying mental health trends and language markers. Method: The researchers analyzed language trajectories from social media posts of users diagnosed with BD, comparing them with those of users with unipolar depression (UD) and non-affected users (HC). The analysis spanned 3 years before diagnosis to up to 21 years after diagnosis. They looked for linguistic markers of mood disturbance, psychiatric comorbidity, and other related factors. Result: BD diagnosis is associated with significant linguistic changes reflecting mood disturbances, psychiatric issues, and unusual thought patterns. These changes persist and show recurring mood-related patterns across decades, with a 12-month periodicity suggestive of seasonal mood episodes. There is also trend-level evidence that periodicity is more pronounced in users estimated to be female. Conclusion: The study concludes that language alterations are evident in both the acute and chronic phases of bipolar disorder (BD), and these changes can be monitored over extended periods using social media (SM) data. This validates the use of SM for scalable mental health monitoring. Abstract: Language provides valuable markers of affective disorders such as bipolar disorder (BD), yet clinical assessments remain limited in scale. In response, analyses of social media (SM) language have gained prominence due to their high temporal resolution and longitudinal scope. Here, we introduce a method to determine the timing of users' diagnoses and apply it to study language trajectories from 3 years before to 21 years after BD diagnosis - contrasted with uses reporting unipolar depression (UD) and non-affected users (HC). We show that BD diagnosis is accompanied by pervasive linguistic alterations reflecting mood disturbance, psychiatric comorbidity, substance abuse, hospitalization, medical comorbidities, unusual thought content, and disorganized thought. We further observe recurring mood-related language changes across two decades after the diagnosis, with a pronounced 12-month periodicity suggestive of seasonal mood episodes. Finally, trend-level evidence suggests an increased periodicity in users estimated to be female. In sum, our findings provide evidence for language alterations in the acute and chronic phase of BD. This validates and extends recent efforts leveraging SM for scalable monitoring of mental health.[35] !MSA at BAREC Shared Task 2025: Ensembling Arabic Transformers for Readability Assessment
Mohamed Basem,Mohamed Younes,Seif Ahmed,Abdelrahman Moustafa
Main category: cs.CL
TL;DR: MSAs winning system for the BAREC 2025 Shared Task achieved top results in Arabic readability assessment through a confidence-weighted ensemble of transformer models, diverse loss functions, data augmentation, and post-processing improvements.
Details
Motivation: The motivation was to address the challenges of fine-grained Arabic readability assessment, including severe class imbalance, data scarcity, and the need for robust prediction. Method: The approach involved a confidence-weighted ensemble of four transformer models (AraBERTv2, AraELECTRA, MARBERT, and CAMeLBERT), each fine-tuned with different loss functions. It also included weighted training, advanced preprocessing, SAMER corpus relabeling, synthetic data generation, and targeted post-processing. Result: The system achieved first place in all six tracks, with 87.5% QWK at the sentence level and 87.4% at the document level, along with a 6.3% QWK gain from post-processing. Conclusion: The system demonstrated high effectiveness in Arabic readability assessment, achieving top results in all tracks, and highlighting the importance of model and loss diversity, confidence-informed fusion, and intelligent data augmentation. Abstract: We present MSAs winning system for the BAREC 2025 Shared Task on fine-grained Arabic readability assessment, achieving first place in six of six tracks. Our approach is a confidence-weighted ensemble of four complementary transformer models (AraBERTv2, AraELECTRA, MARBERT, and CAMeLBERT) each fine-tuned with distinct loss functions to capture diverse readability signals. To tackle severe class imbalance and data scarcity, we applied weighted training, advanced preprocessing, SAMER corpus relabeling with our strongest model, and synthetic data generation via Gemini 2.5 Flash, adding about 10,000 rare-level samples. A targeted post-processing step corrected prediction distribution skew, delivering a 6.3 percent Quadratic Weighted Kappa (QWK) gain. Our system reached 87.5 percent QWK at the sentence level and 87.4 percent at the document level, demonstrating the power of model and loss diversity, confidence-informed fusion, and intelligent augmentation for robust Arabic readability prediction.[36] Established Psychometric vs. Ecologically Valid Questionnaires: Rethinking Psychological Assessments in Large Language Models
Dongmin Choi,Woojung Song,Jongwook Han,Eun-Ju Lee,Yohan Jo
Main category: cs.CL
TL;DR: 该论文指出将传统的心理测量问卷应用于大型语言模型(LLMs)存在生态效度问题,并发现传统问卷与生态效度问卷的结果存在显著差异,因此不建议在LLMs中使用传统心理测量工具。
Details
Motivation: 传统心理测量问卷(如BFI、PVQ)被用于评估LLMs的人格特质和价值观,但其在LLMs中的生态效度尚不明确,可能导致测量结果偏差。 Method: 对传统心理测量问卷和生态效度问卷在LLMs中的应用进行了全面比较分析。 Result: 研究发现传统心理测量问卷在LLMs中存在以下问题:1)与生态效度问卷得出的LLMs特征图谱存在显著差异;2)测量项不足,导致结果不稳定;3)可能误导LLMs具有稳定心理特征的印象;4)对使用角色提示的LLMs产生夸大的特征图谱。 Conclusion: 论文建议不应直接将传统心理测量问卷应用于LLMs,因为其缺乏生态效度并可能导致误导性结论。 Abstract: Researchers have applied established psychometric questionnaires (e.g., BFI, PVQ) to measure the personality traits and values reflected in the responses of Large Language Models (LLMs). However, concerns have been raised about applying these human-designed questionnaires to LLMs. One such concern is their lack of ecological validity--the extent to which survey questions adequately reflect and resemble real-world contexts in which LLMs generate texts in response to user queries. However, it remains unclear how established questionnaires and ecologically valid questionnaires differ in their outcomes, and what insights these differences may provide. In this paper, we conduct a comprehensive comparative analysis of the two types of questionnaires. Our analysis reveals that established questionnaires (1) yield substantially different profiles of LLMs from ecologically valid ones, deviating from the psychological characteristics expressed in the context of user queries, (2) suffer from insufficient items for stable measurement, (3) create misleading impressions that LLMs possess stable constructs, and (4) yield exaggerated profiles for persona-prompted LLMs. Overall, our work cautions against the use of established psychological questionnaires for LLMs. Our code will be released upon publication.[37] Querying Climate Knowledge: Semantic Retrieval for Scientific Discovery
Mustapha Adamu,Qi Zhang,Huitong Pan,Longin Jan Latecki,Eduard C. Dragut
Main category: cs.CL
TL;DR: 本文提出了一种用于气候科学研究的领域特定知识图谱,通过语义查询和与大型语言模型的集成,提高信息检索的精确性和可靠性。
Details
Motivation: 气候科学文献的复杂性和数量不断增长,使得研究人员难以跨模型、数据集、区域和变量找到相关信息。 Method: 构建了一个从气候出版物和更广泛的科学文本中提取信息的知识图谱(KG),并使用Cypher查询演示了其功能,同时描述了其与大型语言模型在RAG系统中的集成。 Result: 知识图谱支持结构化和语义化的查询,能够回答诸如特定区域中哪些模型已被验证,或某些遥相关模式常用哪些数据集等问题,并提高了气候相关问答的透明度和可靠性。 Conclusion: 本文提出了一种基于气候科学文献的领域特定知识图谱(KG),旨在改进气候知识的访问和使用方式,并展示了其在提高气候研究人员、模型开发人员和其他依赖准确科学信息的人员工作效率方面的实际价值。 Abstract: The growing complexity and volume of climate science literature make it increasingly difficult for researchers to find relevant information across models, datasets, regions, and variables. This paper introduces a domain-specific Knowledge Graph (KG) built from climate publications and broader scientific texts, aimed at improving how climate knowledge is accessed and used. Unlike keyword based search, our KG supports structured, semantic queries that help researchers discover precise connections such as which models have been validated in specific regions or which datasets are commonly used with certain teleconnection patterns. We demonstrate how the KG answers such questions using Cypher queries, and outline its integration with large language models in RAG systems to improve transparency and reliability in climate-related question answering. This work moves beyond KG construction to show its real world value for climate researchers, model developers, and others who rely on accurate, contextual scientific information.[38] Arabic Large Language Models for Medical Text Generation
Abdulrahman Allam,Seif Ahmed,Ali Hamdi,Ammar Mohammed
Main category: cs.CL
TL;DR: This study improves hospital management systems by fine-tuning large language models to generate accurate Arabic medical advice, with the Mistral-7B model showing the best performance in handling real-world, informal patient queries.
Details
Motivation: The motivation for this study stems from the lack of accurate, real-time medical advice systems for underrepresented languages like Arabic, especially in addressing challenges such as overcrowding, limited resources, and urgent healthcare availability in hospital management systems. Method: The study involved collecting a unique dataset of real-world Arabic medical conversations from social media, preprocessing it to handle multiple dialects, and fine-tuning state-of-the-art generative models like Mistral-7B-Instruct-v0.2, LLaMA-2-7B, and GPT-2 Medium to generate medical advice. The models were evaluated using BERT Score for precision, recall, and F1-scores. Result: The fine-tuned Mistral-7B model outperformed other models, achieving average BERT Score values of 68.5% in precision, 69.08% in recall, and 68.5% in F1-score. The system demonstrated the ability to generate coherent and relevant medical replies to informal inputs. Conclusion: This study concludes that fine-tuning large language models, particularly the Mistral-7B model, can effectively generate accurate and relevant Arabic medical text, thereby enhancing hospital management systems and addressing global healthcare challenges in linguistically diverse environments. Abstract: Efficient hospital management systems (HMS) are critical worldwide to address challenges such as overcrowding, limited resources, and poor availability of urgent health care. Existing methods often lack the ability to provide accurate, real-time medical advice, particularly for irregular inputs and underrepresented languages. To overcome these limitations, this study proposes an approach that fine-tunes large language models (LLMs) for Arabic medical text generation. The system is designed to assist patients by providing accurate medical advice, diagnoses, drug recommendations, and treatment plans based on user input. The research methodology required the collection of a unique dataset from social media platforms, capturing real-world medical conversations between patients and doctors. The dataset, which includes patient complaints together with medical advice, was properly cleaned and preprocessed to account for multiple Arabic dialects. Fine-tuning state-of-the-art generative models, such as Mistral-7B-Instruct-v0.2, LLaMA-2-7B, and GPT-2 Medium, optimized the system's ability to generate reliable medical text. Results from evaluations indicate that the fine-tuned Mistral-7B model outperformed the other models, achieving average BERT (Bidirectional Encoder Representations from Transformers) Score values in precision, recall, and F1-scores of 68.5\%, 69.08\%, and 68.5\%, respectively. Comparative benchmarking and qualitative assessments validate the system's ability to produce coherent and relevant medical replies to informal input. This study highlights the potential of generative artificial intelligence (AI) in advancing HMS, offering a scalable and adaptable solution for global healthcare challenges, especially in linguistically and culturally diverse environments.[39] Scaling Arabic Medical Chatbots Using Synthetic Data: Enhancing Generative AI with Synthetic Patient Records
Abdulrahman Allam,Seif Ahmed,Ali Hamdi,Khaled Shaban
Main category: cs.CL
TL;DR: 本研究通过合成数据增强策略扩大了阿拉伯语医疗聊天机器人的训练数据集,并展示了其在提升低资源医疗自然语言处理领域语言模型性能的有效性。
Details
Motivation: 阿拉伯语医疗聊天机器人的发展受到大规模高质量注释数据集稀缺的限制。 Method: 提出了一种可扩展的合成数据增强策略,将训练语料库扩展到100,000条记录,并使用ChatGPT-4o和Gemini 2.5 Pro生成80,000条上下文相关且医学一致的合成问答对。 Result: 使用ChatGPT-4o数据训练的模型在所有模型中显示出更高的F1分数和更少的幻觉。 Conclusion: 合成数据增强策略在低资源医疗NLP中增强领域特定语言模型的可行性,为更包容、可扩展和准确的阿拉伯语医疗聊天机器人系统铺平了道路。 Abstract: The development of medical chatbots in Arabic is significantly constrained by the scarcity of large-scale, high-quality annotated datasets. While prior efforts compiled a dataset of 20,000 Arabic patient-doctor interactions from social media to fine-tune large language models (LLMs), model scalability and generalization remained limited. In this study, we propose a scalable synthetic data augmentation strategy to expand the training corpus to 100,000 records. Using advanced generative AI systems ChatGPT-4o and Gemini 2.5 Pro we generated 80,000 contextually relevant and medically coherent synthetic question-answer pairs grounded in the structure of the original dataset. These synthetic samples were semantically filtered, manually validated, and integrated into the training pipeline. We fine-tuned five LLMs, including Mistral-7B and AraGPT2, and evaluated their performance using BERTScore metrics and expert-driven qualitative assessments. To further analyze the effectiveness of synthetic sources, we conducted an ablation study comparing ChatGPT-4o and Gemini-generated data independently. The results showed that ChatGPT-4o data consistently led to higher F1-scores and fewer hallucinations across all models. Overall, our findings demonstrate the viability of synthetic augmentation as a practical solution for enhancing domain-specific language models in-low resource medical NLP, paving the way for more inclusive, scalable, and accurate Arabic healthcare chatbot systems.[40] Prominence-aware automatic speech recognition for conversational speech
Julian Linke,Barbara Schuppler
Main category: cs.CL
TL;DR: This paper explores prominence-aware ASR for Austrian German, achieving high prominence detection accuracy without compromising ASR performance.
Details
Motivation: To explore prominence-aware ASR by combining prominence detection and speech recognition for conversational Austrian German. Method: Fine-tuned wav2vec2 models were used for word-level prominence detection, and the annotations were used to train prominence-aware ASR systems. Result: The integration of prominence information showed no change in ASR performance but achieved 85.53% prominence detection accuracy when the word sequence was correct. Conclusion: Transformer-based models can effectively encode prosodic information, contributing to prosody-enhanced ASR with potential applications in linguistic research and dialogue systems. Abstract: This paper investigates prominence-aware automatic speech recognition (ASR) by combining prominence detection and speech recognition for conversational Austrian German. First, prominence detectors were developed by fine-tuning wav2vec2 models to classify word-level prominence. The detector was then used to automatically annotate prosodic prominence in a large corpus. Based on those annotations, we trained novel prominence-aware ASR systems that simultaneously transcribe words and their prominence levels. The integration of prominence information did not change performance compared to our baseline ASR system, while reaching a prominence detection accuracy of 85.53% for utterances where the recognized word sequence was correct. This paper shows that transformer-based models can effectively encode prosodic information and represents a novel contribution to prosody-enhanced ASR, with potential applications for linguistic research and prosody-informed dialogue systems.[41] Population-Aligned Persona Generation for LLM-based Social Simulation
Zhengyu Hu,Zheyuan Xiao,Max Xiong,Yuxuan Lei,Tianfu Wang,Jianxun Lian,Kaize Ding,Ziang Xiao,Nicholas Jing Yuan,Xing Xie
Main category: cs.CL
TL;DR: 本文提出了一种利用大型语言模型从社交媒体数据生成高质量角色的方法,并通过重要性抽样与心理测量分布对齐,以减少社会模拟中的偏差。
Details
Motivation: 现有基于大型语言模型的社会模拟研究主要关注代理框架和模拟环境的设计,往往忽视了角色生成的复杂性和不具代表性角色集引入的潜在偏差。因此,需要构建能够真实反映现实世界人群多样性和分布的角色集。 Method: 首先利用大型语言模型从长期社交媒体数据中生成叙述性角色,随后进行严格的质量评估以过滤低保真度的配置文件。接着应用重要性抽样以实现与参考心理测量分布(如大五人格特质)的全球对齐,并引入特定任务模块以适应特定模拟环境的目标子群体。 Result: 实验表明,该方法显著减少了群体层面的偏差,并能够为各种研究和政策应用提供准确、灵活的社会模拟。 Conclusion: 该论文提出了一种系统框架,用于为大型语言模型驱动的社会模拟合成高质量、与人群分布一致的角色集,从而显著减少群体层面的偏差,并为广泛的研究和政策应用提供准确、灵活的社会模拟。 Abstract: Recent advances in large language models (LLMs) have enabled human-like social simulations at unprecedented scale and fidelity, offering new opportunities for computational social science. A key challenge, however, is the construction of persona sets that authentically represent the diversity and distribution of real-world populations. Most existing LLM-based social simulation studies focus primarily on designing agentic frameworks and simulation environments, often overlooking the complexities of persona generation and the potential biases introduced by unrepresentative persona sets. In this paper, we propose a systematic framework for synthesizing high-quality, population-aligned persona sets for LLM-driven social simulation. Our approach begins by leveraging LLMs to generate narrative personas from long-term social media data, followed by rigorous quality assessment to filter out low-fidelity profiles. We then apply importance sampling to achieve global alignment with reference psychometric distributions, such as the Big Five personality traits. To address the needs of specific simulation contexts, we further introduce a task-specific module that adapts the globally aligned persona set to targeted subpopulations. Extensive experiments demonstrate that our method significantly reduces population-level bias and enables accurate, flexible social simulation for a wide range of research and policy applications.[42] Towards Reliable and Interpretable Document Question Answering via VLMs
Alessio Chen,Simone Giovannini,Andrea Gemelli,Fabio Coppini,Simone Marinai
Main category: cs.CL
TL;DR: 本文提出 DocExplainerV0,一个用于提升文档信息提取中空间定位和可解释性的模块,并揭示了当前 VLM 在定位准确性上的问题。
Details
Motivation: 尽管 VLM 在文档理解方面表现出色,但在准确地定位文档中的答案方面仍存在重大挑战,这限制了其可解释性和实际应用。 Method: 引入了一个即插即用的边界框预测模块 DocExplainerV0,将答案生成与空间定位解耦,适用于现有的 VLM,包括无法进行微调的专有系统。 Result: 通过系统评估,研究揭示了文本准确性与空间定位之间的差距,表明即使答案正确,定位也可能不可靠。 Conclusion: DocExplainerV0 模块在提升文档信息提取的可解释性和鲁棒性方面具有潜力,并为未来研究建立了标准化框架和基准。 Abstract: Vision-Language Models (VLMs) have shown strong capabilities in document understanding, particularly in identifying and extracting textual information from complex documents. Despite this, accurately localizing answers within documents remains a major challenge, limiting both interpretability and real-world applicability. To address this, we introduce \textit{DocExplainerV0}, a plug-and-play bounding-box prediction module that decouples answer generation from spatial localization. This design makes it applicable to existing VLMs, including proprietary systems where fine-tuning is not feasible. Through systematic evaluation, we provide quantitative insights into the gap between textual accuracy and spatial grounding, showing that correct answers often lack reliable localization. Our standardized framework highlights these shortcomings and establishes a benchmark for future research toward more interpretable and robust document information extraction VLMs.[43] Benchmark of stylistic variation in LLM-generated texts
Jiří Milička,Anna Marklová,Václav Cvrček
Main category: cs.CL
TL;DR: 本文通过多维分析方法比较了大型语言模型生成的文本与人类写作的差异,并构建了一个可解释的模型评估基准。
Details
Motivation: 由于大型语言模型在非英语语言和不同文本类型中的训练数据不足,本研究旨在系统分析LLMs生成文本与人类写作之间的差异。 Method: 使用Biber的多维分析(MDA)对人类写作文本和AI生成文本进行分析,文本材料包括AI-Brown语料库和AI-Koditex语料库,涵盖了16种前沿模型在不同设置和提示下的表现。 Result: 研究发现了LLMs与人类写作之间在多维分析中最显著和最系统性的差异,并在不同语言(如英语和捷克语)中验证了这些发现。 Conclusion: 大型语言模型(LLMs)在不同语言和文本类型中与人类写作存在显著且系统性的差异,通过创建基准测试,模型可以在可解释的维度上进行比较和排名。 Abstract: This study investigates the register variation in texts written by humans and comparable texts produced by large language models (LLMs). Biber's multidimensional analysis (MDA) is applied to a sample of human-written texts and AI-created texts generated to be their counterparts to find the dimensions of variation in which LLMs differ most significantly and most systematically from humans. As textual material, a new LLM-generated corpus AI-Brown is used, which is comparable to BE-21 (a Brown family corpus representing contemporary British English). Since all languages except English are underrepresented in the training data of frontier LLMs, similar analysis is replicated on Czech using AI-Koditex corpus and Czech multidimensional model. Examined were 16 frontier models in various settings and prompts, with emphasis placed on the difference between base models and instruction-tuned models. Based on this, a benchmark is created through which models can be compared with each other and ranked in interpretable dimensions.[44] Incongruent Positivity: When Miscalibrated Positivity Undermines Online Supportive Conversations
Leen Almajed,Abeer ALdayel
Main category: cs.CL
TL;DR: 研究探讨了在情感支持对话中,大型语言模型(LLM)生成的回复可能表现出不恰当的积极情绪,特别是在高风险情境下。研究提出需要超越生成泛化的积极回应,研究一致的支持措施,并开发了一个弱监督的多标签分类器集成来检测不一致的积极类型。
Details
Motivation: 在情感支持对话中,过于积极的回应有时会导致负面效果,例如显得轻视或过于乐观。这种现象称为不一致的积极情绪,尤其在高风险情境下需要进一步研究。 Method: 收集了Reddit上真实用户与助手的对话,并使用大型语言模型生成了额外的回复。对话根据情感强度分为两个层次:轻度(关系紧张和一般建议)和严重(悲伤和焦虑)。此外,研究还微调了LLM,并开发了一个弱监督的多标签分类器集成(DeBERTa和MentalBERT)来检测不一致的积极类型。 Result: 分析表明,LLM更容易表现出不现实的积极情绪,特别是在高风险情境下。此外,研究开发的弱监督多标签分类器集成在检测不一致的积极类型方面表现出色。 Conclusion: 研究强调了在情感支持对话中,LLM生成的回复可能表现出不恰当的积极情绪,特别是在高风险情境下。研究提出需要超越生成泛化的积极回应,研究一致的支持措施,以平衡积极情感与情感认同。 Abstract: In emotionally supportive conversations, well-intended positivity can sometimes misfire, leading to responses that feel dismissive, minimizing, or unrealistically optimistic. We examine this phenomenon of incongruent positivity as miscalibrated expressions of positive support in both human and LLM generated responses. To this end, we collected real user-assistant dialogues from Reddit across a range of emotional intensities and generated additional responses using large language models for the same context. We categorize these conversations by intensity into two levels: Mild, which covers relationship tension and general advice, and Severe, which covers grief and anxiety conversations. This level of categorization enables a comparative analysis of how supportive responses vary across lower and higher stakes contexts. Our analysis reveals that LLMs are more prone to unrealistic positivity through dismissive and minimizing tone, particularly in high-stakes contexts. To further study the underlying dimensions of this phenomenon, we finetune LLMs on datasets with strong and weak emotional reactions. Moreover, we developed a weakly supervised multilabel classifier ensemble (DeBERTa and MentalBERT) that shows improved detection of incongruent positivity types across two sorts of concerns (Mild and Severe). Our findings shed light on the need to move beyond merely generating generic positive responses and instead study the congruent support measures to balance positive affect with emotional acknowledgment. This approach offers insights into aligning large language models with affective expectations in the online supportive dialogue, paving the way toward context-aware and trust preserving online conversation systems.[45] Beyond Token Limits: Assessing Language Model Performance on Long Text Classification
Miklós Sebők,Viktor Kovács,Martin Bánóczy,Daniel Møller Eriksen,Nathalie Neptune,Philippe Roussille
Main category: cs.CL
TL;DR: This paper evaluates language models like XLM-RoBERTa, Longformer, and GPT variants for handling long legal texts in a multilingual classification task, finding that Longformer doesn't outperform other models and that open models perform best.
Details
Motivation: The motivation stems from the limitation of widely used language models like BERT and RoBERTa, which struggle with long input texts, especially in domains like law where texts can span hundreds of pages. Method: The paper conducted experiments using XLM-RoBERTa, Longformer, GPT-3.5, and GPT-4 models across five languages for the multiclass classification task of the Comparative Agendas Project. The task involved policy topic labeling of long texts such as laws and bills. Result: No significant advantage was observed for the Longformer model in handling long texts. The best open-source model outperformed GPT-3.5 and GPT-4, and performance was influenced by overlaps in support and substance among categories. Conclusion: The study found that Longformer, despite being pre-trained to handle long texts, showed no particular advantage in processing long legal texts for classification. The best-performing open model outperformed GPT variants, and factors like support and substance overlaps between categories were crucial for performance. Abstract: The most widely used large language models in the social sciences (such as BERT, and its derivatives, e.g. RoBERTa) have a limitation on the input text length that they can process to produce predictions. This is a particularly pressing issue for some classification tasks, where the aim is to handle long input texts. One such area deals with laws and draft laws (bills), which can have a length of multiple hundred pages and, therefore, are not particularly amenable for processing with models that can only handle e.g. 512 tokens. In this paper, we show results from experiments covering 5 languages with XLM-RoBERTa, Longformer, GPT-3.5, GPT-4 models for the multiclass classification task of the Comparative Agendas Project, which has a codebook of 21 policy topic labels from education to health care. Results show no particular advantage for the Longformer model, pre-trained specifically for the purposes of handling long inputs. The comparison between the GPT variants and the best-performing open model yielded an edge for the latter. An analysis of class-level factors points to the importance of support and substance overlaps between specific categories when it comes to performance on long text inputs.[46] SI-FACT: Mitigating Knowledge Conflict via Self-Improving Faithfulness-Aware Contrastive Tuning
Shengqiang Fu
Main category: cs.CL
TL;DR: SI FACT is a self-improving framework that enhances the contextual faithfulness of large language models by automatically generating contrastive learning data and training the model to prioritize context over internal knowledge.
Details
Motivation: LLMs often generate unfaithful responses in knowledge-intensive tasks due to reliance on internal parametric knowledge rather than provided context. This work aims to address this issue by improving contextual faithfulness through a self-improving framework. Method: SI FACT uses a self-instruct mechanism to generate contrastive learning data (anchor, positive, and negative samples) and applies contrastive learning to train the model, enabling it to distinguish faithful from unfaithful responses in the representation space. Result: On ECARE KRE and COSE KRE benchmarks, the SI FACT model based on Llama3 8B Instruct improved Contextual Recall Rate by 6.2% over the best baseline and significantly reduced dependence on internal memory. Conclusion: The SI FACT framework effectively enhances the contextual faithfulness of LLMs by reducing reliance on internal parametric knowledge and improving the model's ability to produce contextually accurate responses. Abstract: Large Language Models often generate unfaithful responses in knowledge intensive tasks due to knowledge conflict,that is,a preference for relying on internal parametric knowledge rather than the provided context.To address this issue,we propose a novel self improving framework,Self Improving Faithfulness Aware Contrastive Tuning.The framework uses a self instruct mechanism that allows the base LLM to automatically generate high quality,structured contrastive learning data,including anchor samples,semantically equivalent positive samples,and negative samples simulating unfaithful scenarios.This approach significantly reduces the cost of manual annotation.Subsequently,contrastive learning is applied to train the model,enabling it to pull faithful responses closer and push unfaithful responses farther apart in the representation space.Experiments on knowledge conflict evaluation benchmarks ECARE KRE and COSE KRE show that the SI FACT model based on Llama3 8B Instruct improves the Contextual Recall Rate by 6.2% over the best baseline method,while significantly reducing dependence on internal memory.The results indicate that SI FACT provides strong effectiveness and high data efficiency in enhancing the contextual faithfulness of LLMs,offering a practical pathway toward building more proactive and trustworthy language models.[47] Dropping Experts, Recombining Neurons: Retraining-Free Pruning for Sparse Mixture-of-Experts LLMs
Yixiao Zhou,Ziyu Zhao,Dongzhou Cheng,zhiliang wu,Jie Gui,Yi Yang,Fei Wu,Yu Cheng,Hehe Fan
Main category: cs.CL
TL;DR: DERN is a retraining-free, task-agnostic framework that efficiently prunes and reconstructs SMoE models, significantly improving performance and reducing memory usage for easier deployment.
Details
Motivation: SMoE architectures, while computationally efficient, require loading all expert parameters, leading to high memory usage. Previous approaches mainly focused on expert-level operations, leaving neuron-level structures underexplored. Method: The DERN framework prunes redundant experts using router statistics, decomposes experts into neuron-level segments, assigns these segments to compatible retained experts, and merges segments to build a compact representation. Result: Experiments on Mixtral, Qwen, and DeepSeek models show that DERN improves performance by over 5% on commonsense reasoning and MMLU benchmarks under 50% expert sparsity without extra training. Conclusion: DERN is an effective method for pruning and reconstructing SMoE models, improving performance while reducing memory usage and making models easier to deploy. Abstract: Sparse Mixture-of-Experts (SMoE) architectures are widely used in large language models (LLMs) due to their computational efficiency. However, though only a few experts are activated for each token, SMoE still requires loading all expert parameters, leading to high memory usage and challenges in deployment. Previous work has tried to reduce the overhead by pruning and merging experts, but primarily focused on expert-level operations, leaving neuron-level structure underexplored. We propose DERN (Dropping Experts, Recombining Neurons), a task-agnostic and retraining-free framework for expert pruning and reconstruction. We observe that experts are often misaligned and contain semantic conflicts at the neuron level, which poses challenges for direct merging. To solve this, DERN works in three steps: it first prunes redundant experts using router statistics; then it decomposes them into neuron-level expert segments, assigning each segment to its most compatible retained expert; and finally, it merges segments within each retained expert to build a compact representation. Experiments on Mixtral, Qwen, and DeepSeek SMoE models show that DERN improves performance by more than 5% on commonsense reasoning and MMLU benchmarks under 50% expert sparsity, without extra training. It also greatly reduces the number of experts and memory usage, making SMoE LLMs easier to deploy in practice.[48] Is In-Context Learning Learning?
Adrian de Wynter
Main category: cs.CL
TL;DR: This paper examines the effectiveness of in-context learning (ICL) in autoregressive models, finding that while it is an effective learning paradigm, it has limitations in learning and generalizing to unseen tasks due to its ad-hoc encoding mechanism.
Details
Motivation: The paper aims to understand whether ICL truly constitutes learning and how effective it is at solving and generalizing unseen tasks, given claims about the ability of autoregressive models to solve tasks with only a few shots via next-token prediction. Method: The paper conducts a large-scale analysis of ICL by ablating or accounting for memorisation, pretraining, distributional shifts, and prompting style and phrasing. Result: The research finds that ICL is an effective learning paradigm but limited in its ability to learn and generalise to unseen tasks. The study also finds that accuracy becomes insensitive to various factors as exemplars increase and that the models deduce patterns from regularities in the prompt, leading to distributional sensitivity. Conclusion: The paper concludes that autoregressive models' in-context learning (ICL) is limited in its ability to learn and generalize to unseen tasks, and that its effectiveness diminishes as the number of exemplars increases, due to its ad-hoc encoding mechanism. Abstract: In-context learning (ICL) allows some autoregressive models to solve tasks via next-token prediction and without needing further training. This has led to claims about these model's ability to solve (learn) unseen tasks with only a few shots (exemplars) in the prompt. However, deduction does not always imply learning, as ICL does not explicitly encode a given observation. Instead, the models rely on their prior knowledge and the exemplars given, if any. We argue that, mathematically, ICL does constitute learning, but its full characterisation requires empirical work. We then carry out a large-scale analysis of ICL ablating out or accounting for memorisation, pretraining, distributional shifts, and prompting style and phrasing. We find that ICL is an effective learning paradigm, but limited in its ability to learn and generalise to unseen tasks. We note that, in the limit where exemplars become more numerous, accuracy is insensitive to exemplar distribution, model, prompt style, and the input's linguistic features. Instead, it deduces patterns from regularities in the prompt, which leads to distributional sensitivity, especially in prompting styles such as chain-of-thought. Given the varied accuracies on formally similar tasks, we conclude that autoregression's ad-hoc encoding is not a robust mechanism, and suggests limited all-purpose generalisability.[49] Long Context Automated Essay Scoring with Language Models
Christopher Ormerod,Gitit Kehat
Main category: cs.CL
TL;DR: This study explores models with architectural modifications to address the length limitations of standard transformer models in Automated Essay Scoring, finding that models like XLNet, Longformer, ModernBERT, Mamba, and Llama can better handle long contexts for improved scoring accuracy.
Details
Motivation: Transformer-based language models are limited in processing text beyond a fixed maximum length, which poses challenges for Automated Essay Scoring where long contexts are necessary to assess organizational elements of the scoring rubric. This necessitates the exploration of models that can handle longer text inputs effectively. Method: The study evaluates several models with architectural modifications to the standard transformer architecture, including XLNet, Longformer, ModernBERT, Mamba, and Llama, using the Kaggle ASAP 2.0 dataset. Result: The evaluation of models like XLNet, Longformer, ModernBERT, Mamba, and Llama showed promising results in overcoming the length constraints of standard transformer models, enhancing their ability to process long contexts for essay scoring. Conclusion: The study concludes that models with architectural modifications, such as XLNet, Longformer, ModernBERT, Mamba, and Llama, can overcome the length limitations of standard transformer models, making them more suitable for Automated Essay Scoring tasks that require processing long contexts. Abstract: Transformer-based language models are architecturally constrained to process text of a fixed maximum length. Essays written by higher-grade students frequently exceed the maximum allowed length for many popular open-source models. A common approach to addressing this issue when using these models for Automated Essay Scoring is to truncate the input text. This raises serious validity concerns as it undermines the model's ability to fully capture and evaluate organizational elements of the scoring rubric, which requires long contexts to assess. In this study, we evaluate several models that incorporate architectural modifications of the standard transformer architecture to overcome these length limitations using the Kaggle ASAP 2.0 dataset. The models considered in this study include fine-tuned versions of XLNet, Longformer, ModernBERT, Mamba, and Llama models.[50] RefactorCoderQA: Benchmarking LLMs for Multi-Domain Coding Question Solutions in Cloud and Edge Deployment
Shadikur Rahman,Aroosa Hameed,Gautam Srivastava,Syed Muhammad Danish
Main category: cs.CL
TL;DR: 本文提出了一种用于优化LLMs推理和问题解决能力的云边协同架构,并展示了其在多领域编码任务中的卓越性能。
Details
Motivation: 现有基准测试存在局限性,需要一个覆盖多技术领域的综合性基准测试来评估LLMs在编码任务中的性能。 Method: 设计了一个包含GuideLLM、SolverLLM和JudgeLLM的云边协同架构,并开发了RefactorCoderQA基准测试。 Result: 实验表明,RefactorCoder-MoE模型总体准确率达到76.84%,显著优于现有的开源和商业基线模型。 Conclusion: 提出的云边协同架构和RefactorCoder-MoE模型在推理和解决问题能力方面表现出色,具有较高的准确性和实用性。 Abstract: To optimize the reasoning and problem-solving capabilities of Large Language Models (LLMs), we propose a novel cloud-edge collaborative architecture that enables a structured, multi-agent prompting framework. This framework comprises three specialized components: GuideLLM, a lightweight model deployed at the edge to provide methodological guidance; SolverLLM, a more powerful model hosted in the cloud responsible for generating code solutions; and JudgeLLM, an automated evaluator for assessing solution correctness and quality. To evaluate and demonstrate the effectiveness of this architecture in realistic settings, we introduce RefactorCoderQA, a comprehensive benchmark designed to evaluate and enhance the performance of Large Language Models (LLMs) across multi-domain coding tasks. Motivated by the limitations of existing benchmarks, RefactorCoderQA systematically covers various technical domains, including Software Engineering, Data Science, Machine Learning, and Natural Language Processing, using authentic coding challenges from Stack Overflow. Extensive experiments reveal that our fine-tuned model, RefactorCoder-MoE, achieves state-of-the-art performance, significantly outperforming leading open-source and commercial baselines with an overall accuracy of 76.84%. Human evaluations further validate the interpretability, accuracy, and practical relevance of the generated solutions. In addition, we evaluate system-level metrics, such as throughput and latency, to gain deeper insights into the performance characteristics and trade-offs of the proposed architecture.[51] DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL
Rui Lu,Zhenyu Hou,Zihan Wang,Hanchen Zhang,Xiao Liu,Yujiang Li,Shi Feng,Jie Tang,Yuxiao Dong
Main category: cs.CL
TL;DR: DeepDive通过自动合成复杂问题和使用多轮强化学习提升大型语言模型的深度搜索能力,解决了现有开源模型在长视野推理和数据难度上的不足。
Details
Motivation: 增强大型语言模型与浏览工具的结合,以提高其作为深度搜索代理解决复杂现实任务的潜力,但现有开源模型在此类设置中表现不佳。 Method: 提出DeepDive,通过从开放知识图谱中自动生成复杂问题,并应用端到端多轮强化学习来增强模型的长视野推理能力。 Result: DeepDive-32B在BrowseComp上取得了新的开源竞争性结果,超越了WebSailor、DeepSeek-R1-Browse和Search-o1,多轮强化学习训练显著提升了深度搜索能力。 Conclusion: DeepDive有效提升了模型的深度搜索能力,并支持在测试时扩展工具调用和并行采样,相关数据和代码已公开。 Abstract: Augmenting large language models (LLMs) with browsing tools substantially improves their potential as deep search agents to solve complex, real-world tasks. Yet, open LLMs still perform poorly in such settings due to limited long-horizon reasoning capacity with browsing tools and the lack of sufficiently difficult supervised data. To address these challenges, we present DeepDive to advance deep search agents. First, we propose a strategy to automatically synthesize complex, difficult, and hard-to-find questions from open knowledge graphs. Second, we apply end-to-end multi-turn reinforcement learning (RL) to enhance LLMs' long-horizon reasoning with deep search. Experiments show that DeepDive-32B achieves a new open-source competitive result on BrowseComp, outperforming WebSailor, DeepSeek-R1-Browse, and Search-o1. We demonstrate that multi-turn RL training improves deep search ability and significantly contributes to the performance improvements across multiple benchmarks. We observe that DeepDive enables test-time scaling of tool calls and parallel sampling. All datasets, models, and code are publicly available at https://github.com/THUDM/DeepDive.[52] WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers
Akshat Pandey,Karun Kumar,Raphael Tang
Main category: cs.CL
TL;DR: WhisTLE是一种新的文本自适应方法,用于提高ASR模型在新领域的表现,其在不增加额外运行成本的情况下显著优于现有方法。
Details
Motivation: 预训练的自动语音识别(ASR)模型(如Whisper)虽然表现良好,但需要领域适应来处理未见过的词汇和表达方式,而在许多现实场景中,收集语音数据是不切实际的。 Method: 提出了WhisTLE,一种深度监督的纯文本自适应方法,使用变分自编码器(VAE)对编码器输出进行建模,并可选地结合文本到语音(TTS)适应。 Result: 在四个领域外数据集和四个ASR模型中,结合TTS的WhisTLE相对于仅使用TTS的适应方法减少了12.3%的词错误率(WER),并在32种情景中的27种上优于所有非WhisTLE基线。 Conclusion: WhisTLE方法在没有额外运行成本的情况下提高了ASR模型的领域适应性性能。 Abstract: Pretrained automatic speech recognition (ASR) models such as Whisper perform well but still need domain adaptation to handle unseen vocabulary and parlance. In many real-world settings, collecting speech data is impractical, necessitating text-only adaptation. We propose WhisTLE, a deeply supervised, text-only adaptation method for pretrained encoder-decoder ASR models. WhisTLE trains a variational autoencoder (VAE) to model encoder outputs from text and fine-tunes the decoder using the learned text-to-latent encoder, optionally combined with text-to-speech (TTS) adaptation. At inference, the original encoder is restored, incurring no extra runtime cost. Across four out-of-domain datasets and four ASR models, WhisTLE with TTS reduces word error rate (WER) by 12.3% relative to TTS-only adaptation and outperforms all non-WhisTLE baselines in 27 of 32 scenarios.cs.CV [Back]
[53] Australian Supermarket Object Set (ASOS): A Benchmark Dataset of Physical Objects and 3D Models for Robotics and Computer Vision
Akansel Cosgun,Lachlan Chumbley,Benjamin J. Meyer
Main category: cs.CV
TL;DR: The paper presents ASOS, a dataset of 50 common supermarket items with high-quality 3D meshes, designed for benchmarking in robotics and computer vision applications.
Details
Motivation: The motivation is to provide a cost-effective, accessible, and comprehensive dataset of common household items for benchmarking in robotics and computer vision, addressing limitations in existing datasets that use synthetic or specialized objects. Method: The 3D meshes are acquired using structure-from-motion techniques with high-resolution imaging to generate watertight meshes. Result: The result is the creation of ASOS, a dataset containing 50 supermarket items across 10 categories, with high-quality 3D textured meshes. Conclusion: ASOS is a valuable dataset for benchmarking in robotics and computer vision due to its focus on accessibility and real-world applicability. Abstract: This paper introduces the Australian Supermarket Object Set (ASOS), a comprehensive dataset comprising 50 readily available supermarket items with high-quality 3D textured meshes designed for benchmarking in robotics and computer vision applications. Unlike existing datasets that rely on synthetic models or specialized objects with limited accessibility, ASOS provides a cost-effective collection of common household items that can be sourced from a major Australian supermarket chain. The dataset spans 10 distinct categories with diverse shapes, sizes, and weights. 3D meshes are acquired by a structure-from-motion techniques with high-resolution imaging to generate watertight meshes. The dataset's emphasis on accessibility and real-world applicability makes it valuable for benchmarking object detection, pose estimation, and robotics applications.[54] A Multimodal RAG Framework for Housing Damage Assessment: Collaborative Optimization of Image Encoding and Policy Vector Retrieval
Jiayi Miao,Dingxin Lu,Zhuqi Wang
Main category: cs.CV
TL;DR: This paper proposes a novel multimodal retrieval-augmented generation framework for assessing post-disaster housing damage, combining visual and textual data to achieve improved accuracy in damage classification and retrieval.
Details
Motivation: Accurate evaluations of housing damage after natural disasters are crucial for insurance claims and resource planning. This work aims to enhance damage assessment accuracy by leveraging both visual and textual data through a novel multimodal framework. Method: The framework uses a two-branch multimodal encoder structure, with a visual encoder (ResNet and Transformer) for image processing and a BERT retriever for text vectorization. It integrates a cross-modal interaction module for semantic alignment and employs a modal attention gating mechanism during generation. The model is trained end-to-end with a multi-task optimization approach combining comparison loss, retrieval loss, and generation loss. Result: The MM-RAG framework achieves improved performance in retrieval accuracy and damage severity classification, with a 9.6% increase in Top-1 retrieval accuracy. Conclusion: The proposed MM-RAG framework demonstrates superior performance in retrieval accuracy and classification of damage severity, showing a 9.6% improvement in Top-1 retrieval accuracy. Abstract: After natural disasters, accurate evaluations of damage to housing are important for insurance claims response and planning of resources. In this work, we introduce a novel multimodal retrieval-augmented generation (MM-RAG) framework. On top of classical RAG architecture, we further the framework to devise a two-branch multimodal encoder structure that the image branch employs a visual encoder composed of ResNet and Transformer to extract the characteristic of building damage after disaster, and the text branch harnesses a BERT retriever for the text vectorization of posts as well as insurance policies and for the construction of a retrievable restoration index. To impose cross-modal semantic alignment, the model integrates a cross-modal interaction module to bridge the semantic representation between image and text via multi-head attention. Meanwhile, in the generation module, the introduced modal attention gating mechanism dynamically controls the role of visual evidence and text prior information during generation. The entire framework takes end-to-end training, and combines the comparison loss, the retrieval loss and the generation loss to form multi-task optimization objectives, and achieves image understanding and policy matching in collaborative learning. The results demonstrate superior performance in retrieval accuracy and classification index on damage severity, where the Top-1 retrieval accuracy has been improved by 9.6%.[55] Improving MLLM Historical Record Extraction with Test-Time Image
Taylor Archibald,Tony Martinez
Main category: cs.CV
TL;DR: A new ensemble framework enhances the accuracy of text extraction from historical documents using Gemini Flash and a custom aligner, achieving better results than traditional methods.
Details
Motivation: To stabilize text extraction from noisy historical documents, which poses challenges for existing transcription methods. Method: An ensemble framework using Gemini 2.0 Flash to transcribe augmented variants of images, combined with a Needleman Wunsch style aligner to produce a consensus transcription and confidence score. Result: The method improves transcription accuracy by 4 percentage points compared to a single-shot baseline. Padding and blurring improve accuracy, while grid warp perturbations help distinguish between high and low confidence cases. Conclusion: The ensemble framework improves the accuracy of LLM-based text extraction from noisy historical documents and is scalable and deployable to other document collections. Abstract: We present a novel ensemble framework that stabilizes LLM based text extraction from noisy historical documents. We transcribe multiple augmented variants of each image with Gemini 2.0 Flash and fuse these outputs with a custom Needleman Wunsch style aligner that yields both a consensus transcription and a confidence score. We present a new dataset of 622 Pennsylvania death records, and demonstrate our method improves transcription accuracy by 4 percentage points relative to a single shot baseline. We find that padding and blurring are the most useful for improving accuracy, while grid warp perturbations are best for separating high and low confidence cases. The approach is simple, scalable, and immediately deployable to other document collections and transcription models.[56] MITS: A Large-Scale Multimodal Benchmark Dataset for Intelligent Traffic Surveillance
Kaikai Zhao,Zhaoxiang Liu,Peng Wang,Xin Wang,Zhicheng Ma,Yajun Xu,Wenjing Zhang,Yibing Nan,Kai Wang,Shiguo Lian
Main category: cs.CV
TL;DR: 本文介绍MITS,首个专为智能交通监控设计的大规模多模态基准数据集,并展示了其对提升主流多模态大模型性能的重要性。
Details
Motivation: 通用领域的大型多模态模型在智能交通监控领域的表现有限,因为缺乏专门的多模态数据集。 Method: 通过系统数据生成流程,生成高质量的图像描述和500万对指令跟随的视觉问答对。 Result: 实验结果表明,MITS显著提升了LLaVA-1.5、LLaVA-1.6、Qwen2-VL和Qwen2.5-VL的性能。 Conclusion: MITS显著提升了主流LMMs在ITS应用中的表现,为ITS和LMM研究提供了宝贵的资源。 Abstract: General-domain large multimodal models (LMMs) have achieved significant advances in various image-text tasks. However, their performance in the Intelligent Traffic Surveillance (ITS) domain remains limited due to the absence of dedicated multimodal datasets. To address this gap, we introduce MITS (Multimodal Intelligent Traffic Surveillance), the first large-scale multimodal benchmark dataset specifically designed for ITS. MITS includes 170,400 independently collected real-world ITS images sourced from traffic surveillance cameras, annotated with eight main categories and 24 subcategories of ITS-specific objects and events under diverse environmental conditions. Additionally, through a systematic data generation pipeline, we generate high-quality image captions and 5 million instruction-following visual question-answer pairs, addressing five critical ITS tasks: object and event recognition, object counting, object localization, background analysis, and event reasoning. To demonstrate MITS's effectiveness, we fine-tune mainstream LMMs on this dataset, enabling the development of ITS-specific applications. Experimental results show that MITS significantly improves LMM performance in ITS applications, increasing LLaVA-1.5's performance from 0.494 to 0.905 (+83.2%), LLaVA-1.6's from 0.678 to 0.921 (+35.8%), Qwen2-VL's from 0.584 to 0.926 (+58.6%), and Qwen2.5-VL's from 0.732 to 0.930 (+27.0%). We release the dataset, code, and models as open-source, providing high-value resources to advance both ITS and LMM research.[57] Decomposing Visual Classification: Assessing Tree-Based Reasoning in VLMs
Sary Elmansoury,Islam Mesabah,Gerrit Großmann,Peter Neigel,Raj Bhalwankar,Daniel Kondermann,Sebastian J. Vollmer
Main category: cs.CV
TL;DR: This paper explores the use of structured, tree-based reasoning to enhance vision language model performance on visual classification tasks. A decision tree framework was evaluated on GTSRB and CIFAR-10 datasets, with results showing that while the model could understand the tree structure well, it didn't outperform standard zero-shot prompting. However, adding LLM-generated descriptions improved performance for both tree-based and zero-shot methods.
Details
Motivation: The performance of vision language models (VLMs) on fine-grained tasks and large hierarchical label spaces is understudied, and the paper investigates whether structured, tree-based reasoning can enhance VLM performance. Method: A framework that decomposes classification into interpretable decisions using decision trees was introduced and evaluated on fine-grained (GTSRB) and coarse-grained (CIFAR-10) datasets. Result: The model achieved 98.2% accuracy in understanding the tree knowledge, but tree-based reasoning consistently underperformed standard zero-shot prompting. Conclusion: Structured reasoning has limitations in visual classification tasks, and enhancing tree prompts with LLM-generated classes and image descriptions can improve performance. Abstract: Vision language models (VLMs) excel at zero-shot visual classification, but their performance on fine-grained tasks and large hierarchical label spaces is understudied. This paper investigates whether structured, tree-based reasoning can enhance VLM performance. We introduce a framework that decomposes classification into interpretable decisions using decision trees and evaluates it on fine-grained (GTSRB) and coarse-grained (CIFAR-10) datasets. Although the model achieves 98.2% accuracy in understanding the tree knowledge, tree-based reasoning consistently underperforms standard zero-shot prompting. We also explore enhancing the tree prompts with LLM-generated classes and image descriptions to improve alignment. The added description enhances the performance of the tree-based and zero-shot methods. Our findings highlight limitations of structured reasoning in visual classification and offer insights for designing more interpretable VLM systems.[58] World Modeling with Probabilistic Structure Integration
Klemen Kotar,Wanhee Lee,Rahul Venkatesh,Honglin Chen,Daniel Bear,Jared Watrous,Simon Kim,Khai Loong Aw,Lilian Naing Chen,Stefan Stojanov,Kevin Feigelis,Imran Thobani,Alex Durango,Khaled Jedoui,Atlas Kazemian,Dan Yamins
Main category: cs.CV
TL;DR: The paper introduces PSI, a system for learning world models from data using a three-step cycle involving probabilistic prediction, structure extraction, and integration.
Details
Motivation: The paper aims to learn controllable and promptable world models from data. Method: PSI uses a three-step cycle: Probabilistic prediction, Structure extraction, and Integration. Result: An instance of Psi was trained on 1.4 trillion tokens of internet video data and was able to perform a variety of useful video prediction and understanding inferences. Conclusion: PSI is able to augment capabilities of modeling data and create new control handles through each cycle. Abstract: We present Probabilistic Structure Integration (PSI), a system for learning richly controllable and flexibly promptable world models from data. PSI consists of a three-step cycle. The first step, Probabilistic prediction, involves building a probabilistic graphical model Psi of the data, in the form of a random-access autoregressive sequence model. Psi supports a complete set of learned conditional distributions describing the dependence of any variables in the data on any other set of variables. In step 2, Structure extraction, we show how to extract underlying low-dimensional properties in the data, corresponding to a diverse set of meaningful "intermediate structures", in a zero-shot fashion via causal inference on Psi. Step 3, Integration, completes the cycle by converting these structures into new token types that are then continually mixed back into the training diet as conditioning signals and prediction targets. Each such cycle augments the capabilities of Psi, both allowing it to model the underlying data better, and creating new control handles -- akin to an LLM-like universal prompting language. We train an instance of Psi on 1.4 trillion tokens of internet video data; we use it to perform a variety of useful video prediction and understanding inferences; we extract state-of-the-art optical flow, self-supervised depth and object segmentation; and we use these structures to support a full cycle of predictive improvements.[59] Images in Motion?: A First Look into Video Leakage in Collaborative Deep Learning
Md Fazle Rasul,Alanood Alqobaisi,Bruhadeshwar Bezawada,Indrakshi Ray
Main category: cs.CV
TL;DR: 本文研究了联邦学习中视频数据通过梯度反转攻击泄露隐私的问题,并提出了增强攻击提取视频质量的方法。
Details
Motivation: 联邦学习中的隐私保护机制受到梯度反转攻击的威胁,而视频数据的泄漏情况尚未得到研究。 Method: 评估了两种常见的视频分类方法,并测试了图像超分辨率技术对梯度反转攻击的影响。 Result: 使用特征提取器可以提高对梯度反转攻击的抵抗力,但若分类器复杂度不足,仍可能发生数据泄漏。 Conclusion: 视频数据在联邦学习中的隐私泄露是一个潜在威胁,需要进一步研究其发生条件。 Abstract: Federated learning (FL) allows multiple entities to train a shared model collaboratively. Its core, privacy-preserving principle is that participants only exchange model updates, such as gradients, and never their raw, sensitive data. This approach is fundamental for applications in domains where privacy and confidentiality are important. However, the security of this very mechanism is threatened by gradient inversion attacks, which can reverse-engineer private training data directly from the shared gradients, defeating the purpose of FL. While the impact of these attacks is known for image, text, and tabular data, their effect on video data remains an unexamined area of research. This paper presents the first analysis of video data leakage in FL using gradient inversion attacks. We evaluate two common video classification approaches: one employing pre-trained feature extractors and another that processes raw video frames with simple transformations. Our initial results indicate that the use of feature extractors offers greater resilience against gradient inversion attacks. We also demonstrate that image super-resolution techniques can enhance the frames extracted through gradient inversion attacks, enabling attackers to reconstruct higher-quality videos. Our experiments validate this across scenarios where the attacker has access to zero, one, or more reference frames from the target environment. We find that although feature extractors make attacks more challenging, leakage is still possible if the classifier lacks sufficient complexity. We, therefore, conclude that video data leakage in FL is a viable threat, and the conditions under which it occurs warrant further investigation.[60] A Co-Training Semi-Supervised Framework Using Faster R-CNN and YOLO Networks for Object Detection in Densely Packed Retail Images
Hossein Yazdanjouei,Arash Mansouri,Mohammad Shokouhifar
Main category: cs.CV
TL;DR: 提出一种用于零售环境目标检测的半监督协同训练框架,结合Faster R-CNN和YOLO的优势,并利用集成学习和超参数优化提升性能。
Details
Motivation: 解决密集零售环境中有限标注数据和复杂条件带来的目标检测挑战。 Method: 结合Faster R-CNN和YOLO模型进行目标检测,采用XGBoost、随机森林和SVM集成方法强化分类,使用元启发式驱动算法优化超参数。 Result: 在SKU-110k数据集上的实验展示了该方法的高效性和可扩展性,特别是在处理遮挡和重叠物体时。 Conclusion: 该论文提出的半监督协同训练框架在密集零售环境中有效提升了目标检测的准确性,同时减少了对手动标注数据的依赖,适用于实际零售应用如自动化库存跟踪和结账系统。 Abstract: This study proposes a semi-supervised co-training framework for object detection in densely packed retail environments, where limited labeled data and complex conditions pose major challenges. The framework combines Faster R-CNN (utilizing a ResNet backbone) for precise localization with YOLO (employing a Darknet backbone) for global context, enabling mutual pseudo-label exchange that improves accuracy in scenes with occlusion and overlapping objects. To strengthen classification, it employs an ensemble of XGBoost, Random Forest, and SVM, utilizing diverse feature representations for higher robustness. Hyperparameters are optimized using a metaheuristic-driven algorithm, enhancing precision and efficiency across models. By minimizing reliance on manual labeling, the approach reduces annotation costs and adapts effectively to frequent product and layout changes common in retail. Experiments on the SKU-110k dataset demonstrate strong performance, highlighting the scalability and practicality of the proposed framework for real-world retail applications such as automated inventory tracking, product monitoring, and checkout systems.[61] Purge-Gate: Backpropagation-Free Test-Time Adaptation for Point Clouds Classification via Token Purging
Moslem Yazdanpanah,Ali Bahri,Mehrdad Noori,Sahar Dastani,Gustavo Adolfo Vargas Hakim,David Osowiechi,Ismail Ben Ayed,Christian Desrosiers
Main category: cs.CV
TL;DR: This paper introduces Token Purging (PG), a novel backpropagation-free method for test-time adaptation in 3D point cloud classification that improves accuracy, efficiency, and memory usage compared to existing approaches.
Details
Motivation: To mitigate performance degradation caused by distribution shifts in 3D point cloud classification, particularly through a backpropagation-free, efficient, and robust adaptation method. Method: Token Purging (PG) operates at the token level, removing tokens affected by domain shifts before they reach attention layers. Two variants are proposed: PG-SP, which uses source statistics, and PG-SF, a source-free version using CLS-token-driven adaptation. Result: PG-SP achieves an average of +10.3% higher accuracy than state-of-the-art backpropagation-free methods, while PG-SF sets new benchmarks for source-free adaptation. PG is also 12.4 times faster and 5.5 times more memory efficient than the baseline. Conclusion: Token Purging (PG) is an efficient and effective approach for test-time adaptation in 3D point cloud classification, with two variants, PG-SP and PG-SF, both showing superior performance compared to existing methods. Abstract: Test-time adaptation (TTA) is crucial for mitigating performance degradation caused by distribution shifts in 3D point cloud classification. In this work, we introduce Token Purging (PG), a novel backpropagation-free approach that removes tokens highly affected by domain shifts before they reach attention layers. Unlike existing TTA methods, PG operates at the token level, ensuring robust adaptation without iterative updates. We propose two variants: PG-SP, which leverages source statistics, and PG-SF, a fully source-free version relying on CLS-token-driven adaptation. Extensive evaluations on ModelNet40-C, ShapeNet-C, and ScanObjectNN-C demonstrate that PG-SP achieves an average of +10.3\% higher accuracy than state-of-the-art backpropagation-free methods, while PG-SF sets new benchmarks for source-free adaptation. Moreover, PG is 12.4 times faster and 5.5 times more memory efficient than our baseline, making it suitable for real-world deployment. Code is available at \hyperlink{https://github.com/MosyMosy/Purge-Gate}{https://github.com/MosyMosy/Purge-Gate}[62] Fine-Grained Cross-View Localization via Local Feature Matching and Monocular Depth Priors
Zimin Xia,Chenghao Xu,Alexandre Alahi
Main category: cs.CV
TL;DR: 提出了一种高精度和可解释的细粒度跨视角定位方法,通过直接匹配地面图像和航拍图像的局部特征并使用单目深度先验,避免了传统方法的信息丢失问题。
Details
Motivation: 由于传统方法将地面图像转换为鸟瞰图表示时会导致信息丢失,因此需要一种更精确且可解释的跨视角定位方法。 Method: 通过直接在地面图像和航拍图像之间建立对应关系,并使用单目深度先验将匹配的关键点提升到BEV空间,从而进行细粒度的跨视角定位。 Result: 实验结果表明,该方法在弱监督条件下能够学习到精确的局部特征对应关系,在跨区域泛化和未知方向等挑战条件下实现了优越的定位性能。 Conclusion: 该方法具有高度可解释性且准确,适用于真实世界的部署。 Abstract: We propose an accurate and highly interpretable fine-grained cross-view localization method that estimates the 3 Degrees of Freedom pose of a ground-level image by matching its local features with a reference aerial image. Previous methods typically transform the ground image into a bird's-eye view (BEV) representation and then align it with the aerial image for localization. However, this transformation often leads to information loss due to perspective distortion or compression of height information, thereby degrading alignment quality with the aerial view. In contrast, our method directly establishes correspondences between ground and aerial images and lifts only the matched keypoints to BEV space using monocular depth prior. Notably, modern depth predictors can provide reliable metric depth when the test samples are similar to the training data. When the depth distribution differs, they still produce consistent relative depth, i.e., depth accurate up to an unknown scale. Our method supports both metric and relative depth. It employs a scale-aware Procrustes alignment to estimate the camera pose from the correspondences and optionally recover the scale when using relative depth. Experimental results demonstrate that, with only weak supervision on camera pose, our method learns accurate local feature correspondences and achieves superior localization performance under challenging conditions, such as cross-area generalization and unknown orientation. Moreover, our method is compatible with various relative depth models without requiring per-model finetuning. This flexibility, combined with strong localization performance, makes it well-suited for real-world deployment.[63] Early Detection of Visual Impairments at Home Using a Smartphone Red-Eye Reflex Test
Judith Massmann,Alexander Lichtenstein,Francisco M. López
Main category: cs.CV
TL;DR: 本论文介绍了一个利用红眼反射图像进行儿童视力筛查的应用KidsVisionCheck,其深度学习模型在测试数据上表现出90%的准确率,为普及儿童视力检查提供了可行方案。
Details
Motivation: 近年来智能手机和人工智能的技术进步使得利用移动设备再现Bruckner测试成为可能。 Method: 基于由眼科医生收集和标记的儿童瞳孔图像训练深度神经网络模型。 Result: 模型在未见过的测试数据上达到了90%的准确率,并能确定最佳数据收集条件以提供即时反馈。 Conclusion: 该研究是实现儿童视力筛查普及和视觉异常早期干预的重要第一步,具有高准确率且无需专业设备。 Abstract: Numerous visual impairments can be detected in red-eye reflex images from young children. The so-called Bruckner test is traditionally performed by ophthalmologists in clinical settings. Thanks to the recent technological advances in smartphones and artificial intelligence, it is now possible to recreate the Bruckner test using a mobile device. In this paper, we present a first study conducted during the development of KidsVisionCheck, a free application that can perform vision screening with a mobile device using red-eye reflex images. The underlying model relies on deep neural networks trained on children's pupil images collected and labeled by an ophthalmologist. With an accuracy of 90% on unseen test data, our model provides highly reliable performance without the necessity of specialist equipment. Furthermore, we can identify the optimal conditions for data collection, which can in turn be used to provide immediate feedback to the users. In summary, this work marks a first step toward accessible pediatric vision screenings and early intervention for vision abnormalities worldwide.[64] DGFusion: Depth-Guided Sensor Fusion for Robust Semantic Perception
Tim Broedermannn,Christos Sakaridis,Luigi Piccinelli,Wim Abbeloos,Luc Van Gool
Main category: cs.CV
TL;DR: 本文提出DGFusion,一种深度引导的多模态融合方法,通过结合激光雷达的深度信息,动态调整多传感器融合,实现了自动驾驶中语义感知的最先进性能。
Details
Motivation: 现有的语义感知传感器融合方法通常对输入的空间范围内的传感器数据进行统一处理,而在面对具有挑战性的条件时表现受限,因此需要一种能够结合深度信息的新型多模态融合方法来提升性能。 Method: 提出了一种深度引导的多模态融合方法,将多模态分割视为多任务问题,利用激光雷达测量数据作为输入和深度学习的真实值,并通过辅助深度头学习深度感知特征,结合全局条件标记和局部深度标记进行动态传感器融合。 Result: DGFusion在MUSES和DELIVER数据集上实现了最先进的全景和语义分割性能,并且提出了一种针对稀疏和噪声激光雷达输入的鲁棒深度损失函数。 Conclusion: DGFusion实现了最先进的全景和语义分割性能,证明了其在MUSES和DELIVER数据集上的有效性。 Abstract: Robust semantic perception for autonomous vehicles relies on effectively combining multiple sensors with complementary strengths and weaknesses. State-of-the-art sensor fusion approaches to semantic perception often treat sensor data uniformly across the spatial extent of the input, which hinders performance when faced with challenging conditions. By contrast, we propose a novel depth-guided multimodal fusion method that upgrades condition-aware fusion by integrating depth information. Our network, DGFusion, poses multimodal segmentation as a multi-task problem, utilizing the lidar measurements, which are typically available in outdoor sensor suites, both as one of the model's inputs and as ground truth for learning depth. Our corresponding auxiliary depth head helps to learn depth-aware features, which are encoded into spatially varying local depth tokens that condition our attentive cross-modal fusion. Together with a global condition token, these local depth tokens dynamically adapt sensor fusion to the spatially varying reliability of each sensor across the scene, which largely depends on depth. In addition, we propose a robust loss for our depth, which is essential for learning from lidar inputs that are typically sparse and noisy in adverse conditions. Our method achieves state-of-the-art panoptic and semantic segmentation performance on the challenging MUSES and DELIVER datasets. Code and models will be available at https://github.com/timbroed/DGFusion[65] Patch-based Automatic Rosacea Detection Using the ResNet Deep Learning Framework
Chengyu Yang,Rishik Reddy Yesgari,Chengjun Liu
Main category: cs.CV
TL;DR: This paper presents patch-based strategies for automatic rosacea detection using ResNet-18, enhancing model focus on relevant regions while preserving privacy.
Details
Motivation: Rosacea often requires precise and early detection for significantly improving treatment effectiveness. Method: This paper presents new patch-based automatic rosacea detection strategies using the ResNet-18 deep learning framework. Result: The experimental results indicate that the proposed patch-based strategies guide the deep learning model to focus on clinically relevant regions, enhance robustness and interpretability, and protect patient privacy. Conclusion: The proposed patch-based strategies offer practical insights for improving automated dermatological diagnostics. Abstract: Rosacea, which is a chronic inflammatory skin condition that manifests with facial redness, papules, and visible blood vessels, often requirs precise and early detection for significantly improving treatment effectiveness. This paper presents new patch-based automatic rosacea detection strategies using the ResNet-18 deep learning framework. The contributions of the proposed strategies come from the following aspects. First, various image pateches are extracted from the facial images of people in different sizes, shapes, and locations. Second, a number of investigation studies are carried out to evaluate how the localized visual information influences the deep learing model performance. Third, thorough experiments are implemented to reveal that several patch-based automatic rosacea detection strategies achieve competitive or superior accuracy and sensitivity than the full-image based methods. And finally, the proposed patch-based strategies, which use only localized patches, inherently preserve patient privacy by excluding any identifiable facial features from the data. The experimental results indicate that the proposed patch-based strategies guide the deep learning model to focus on clinically relevant regions, enhance robustness and interpretability, and protect patient privacy. As a result, the proposed strategies offer practical insights for improving automated dermatological diagnostics.[66] Privacy-Preserving Automated Rosacea Detection Based on Medically Inspired Region of Interest Selection
Chengyu Yang,Rishik Reddy Yesgari,Chengjun Liu
Main category: cs.CV
TL;DR: 本文提出了一种基于临床先验并完全使用合成数据训练的隐私保护自动化酒渣鼻检测方法。通过构建固定红斑感知掩码并结合ResNet-18深度学习模型,该方法在真实世界测试数据中表现出色,尤其在隐私敏感的远程医疗和大规模筛查应用中具有潜力。
Details
Motivation: 酒渣鼻是一种常见但易被忽视的炎症性皮肤病,其自动化检测面临症状弥散、标记数据稀缺以及面部图像隐私问题等挑战。因此,需要一种隐私保护且高效的检测方法。 Method: 本文提出了一种新的酒渣鼻检测方法,首先通过选择面部图像中红通道强度持续较高的区域构建固定红斑感知掩码,以聚焦于脸颊、鼻子和额头等诊断相关区域,并排除身份识别特征;其次,使用ResNet-18深度学习模型对掩码处理后的合成图像进行训练,并在真实世界测试数据上评估其性能。 Result: 实验结果显示,与全脸基线方法相比,所提出的方法在准确率、召回率和F1分数方面均有显著提升,证明了合成数据与临床先验结合在酒渣鼻检测中的有效性。 Conclusion: 合成数据与临床先验的结合能够实现准确且符合伦理的皮肤病AI系统,特别是在隐私敏感的应用场景中具有广阔前景。 Abstract: Rosacea is a common but underdiagnosed inflammatory skin condition that primarily affects the central face and presents with subtle redness, pustules, and visible blood vessels. Automated detection remains challenging due to the diffuse nature of symptoms, the scarcity of labeled datasets, and privacy concerns associated with using identifiable facial images. A novel privacy-preserving automated rosacea detection method inspired by clinical priors and trained entirely on synthetic data is presented in this paper. Specifically, the proposed method, which leverages the observation that rosacea manifests predominantly through central facial erythema, first constructs a fixed redness-informed mask by selecting regions with consistently high red channel intensity across facial images. The mask thus is able to focus on diagnostically relevant areas such as the cheeks, nose, and forehead and exclude identity-revealing features. Second, the ResNet-18 deep learning method, which is trained on the masked synthetic images, achieves superior performance over the full-face baselines with notable gains in terms of accuracy, recall and F1 score when evaluated using the real-world test data. The experimental results demonstrate that the synthetic data and clinical priors can jointly enable accurate and ethical dermatological AI systems, especially for privacy sensitive applications in telemedicine and large-scale screening.[67] Investigating the Impact of Various Loss Functions and Learnable Wiener Filter for Laparoscopic Image Desmoking
Chengyu Yang,Chengjun Liu
Main category: cs.CV
TL;DR: This paper evaluates the effectiveness of components in the ULW framework for laparoscopic image desmoking through ablation studies, showing how each part contributes to performance.
Details
Motivation: To rigorously assess the effectiveness and necessity of individual components within the ULW framework for laparoscopic image desmoking. Method: The study conducts a comprehensive ablation analysis on the ULW framework, evaluating the contributions of its components by systematically removing or varying them. The framework is benchmarked on a paired laparoscopic images dataset using quantitative metrics and qualitative visual comparisons. Result: Each component of the ULW framework was evaluated for its specific contribution to overall performance, including the learnable Wiener filter and individual loss terms from the compound loss function. The variants were tested using metrics like SSIM, PSNR, MSE, and CIEDE-2000. Conclusion: The ULW framework's effectiveness and necessity of individual components in laparoscopic image desmoking are validated through systematic ablation studies, demonstrating the importance of each component in enhancing framework performance. Abstract: To rigorously assess the effectiveness and necessity of individual components within the recently proposed ULW framework for laparoscopic image desmoking, this paper presents a comprehensive ablation study. The ULW approach combines a U-Net based backbone with a compound loss function that comprises mean squared error (MSE), structural similarity index (SSIM) loss, and perceptual loss. The framework also incorporates a differentiable, learnable Wiener filter module. In this study, each component is systematically ablated to evaluate its specific contribution to the overall performance of the whole framework. The analysis includes: (1) removal of the learnable Wiener filter, (2) selective use of individual loss terms from the composite loss function. All variants are benchmarked on a publicly available paired laparoscopic images dataset using quantitative metrics (SSIM, PSNR, MSE and CIEDE-2000) alongside qualitative visual comparisons.[68] WAVE-DETR Multi-Modal Visible and Acoustic Real-Life Drone Detector
Razvan Stefanescu,Ethan Oh,Ruben Vazquez,Chris Mesterharm,Constantin Serban,Ritu Chadha
Main category: cs.CV
TL;DR: WAVE-DETR improves drone detection by combining visual and sound data, outperforming existing methods on multi-modal datasets.
Details
Motivation: To address challenges in robust UAV object detection under real-life and difficult environmental conditions by leveraging multi-modal data (RGB images and acoustic signals). Method: WAVE-DETR utilizes Deformable DETR and Wav2Vec2 architectures, fusing visual and acoustic features through four fusion configurations (gated mechanism, linear layer, MLP, cross attention) to improve detection accuracy. Result: The gated fusion approach improved the mAP of Deformable DETR by 11.1% to 15.3% for small drones and provided overall performance gains of 3.27% to 5.84% across all drone sizes on ARDrone datasets. Conclusion: WAVE-DETR is a multi-modal drone detector that effectively combines visual and acoustic signals to enhance UAV object detection performance across varying conditions and datasets. Abstract: We introduce a multi-modal WAVE-DETR drone detector combining visible RGB and acoustic signals for robust real-life UAV object detection. Our approach fuses visual and acoustic features in a unified object detector model relying on the Deformable DETR and Wav2Vec2 architectures, achieving strong performance under challenging environmental conditions. Our work leverage the existing Drone-vs-Bird dataset and the newly generated ARDrone dataset containing more than 7,500 synchronized images and audio segments. We show how the acoustic information is used to improve the performance of the Deformable DETR object detector on the real ARDrone dataset. We developed, trained and tested four different fusion configurations based on a gated mechanism, linear layer, MLP and cross attention. The Wav2Vec2 acoustic embeddings are fused with the multi resolution feature mappings of the Deformable DETR and enhance the object detection performance over all drones dimensions. The best performer is the gated fusion approach, which improves the mAP of the Deformable DETR object detector on our in-distribution and out-of-distribution ARDrone datasets by 11.1% to 15.3% for small drones across all IoU thresholds between 0.5 and 0.9. The mAP scores for medium and large drones are also enhanced, with overall gains across all drone sizes ranging from 3.27% to 5.84%.[69] Surrogate Supervision for Robust and Generalizable Deformable Image Registration
Yihao Liu,Junyu Chen,Lianrui Zuo,Shuwen Wei,Brian D. Boyd,Carmen Andreescu,Olusola Ajilore,Warren D. Taylor,Aaron Carass,Bennett A. Landman
Main category: cs.CV
TL;DR: This paper introduces surrogate supervision, a training paradigm that improves the robustness and generalizability of deep learning-based deformable image registration by decoupling the input domain from the supervision domain, without increasing complexity.
Details
Motivation: Deep learning-based deformable image registration has achieved strong accuracy but remains sensitive to variations in input image characteristics such as artifacts, field-of-view mismatch, or modality difference. The goal is to develop a general training paradigm that improves the robustness and generalizability of registration networks. Method: The paper introduces surrogate supervision, which decouples the input domain from the supervision domain by applying estimated spatial transformations to surrogate images. This method enables training on heterogeneous inputs while ensuring supervision is computed in domains where similarity is well defined. Result: Surrogate supervision demonstrated strong resilience to input variations including inhomogeneity field, inconsistent field-of-view, and modality differences, while maintaining high performance on well-curated data. The framework was evaluated through three representative applications: artifact-robust brain MR registration, mask-agnostic lung CT registration, and multi-modal MR registration. Conclusion: Surrogate supervision provides a principled framework for training robust and generalizable deep learning-based registration models without increasing complexity. Abstract: Objective: Deep learning-based deformable image registration has achieved strong accuracy, but remains sensitive to variations in input image characteristics such as artifacts, field-of-view mismatch, or modality difference. We aim to develop a general training paradigm that improves the robustness and generalizability of registration networks. Methods: We introduce surrogate supervision, which decouples the input domain from the supervision domain by applying estimated spatial transformations to surrogate images. This allows training on heterogeneous inputs while ensuring supervision is computed in domains where similarity is well defined. We evaluate the framework through three representative applications: artifact-robust brain MR registration, mask-agnostic lung CT registration, and multi-modal MR registration. Results: Across tasks, surrogate supervision demonstrated strong resilience to input variations including inhomogeneity field, inconsistent field-of-view, and modality differences, while maintaining high performance on well-curated data. Conclusions: Surrogate supervision provides a principled framework for training robust and generalizable deep learning-based registration models without increasing complexity. Significance: Surrogate supervision offers a practical pathway to more robust and generalizable medical image registration, enabling broader applicability in diverse biomedical imaging scenarios.[70] An Autoencoder and Vision Transformer-based Interpretability Analysis of the Differences in Automated Staging of Second and Third Molars
Barkin Buyukcakir,Jannick De Tobel,Patrick Thevissen,Dirk Vandermeulen,Peter Claes
Main category: cs.CV
TL;DR: This paper introduces a deep learning framework combining a convolutional autoencoder and a Vision Transformer to improve performance and transparency in dental age estimation, particularly addressing the 'black box' issue in high-stakes forensic applications.
Details
Motivation: The motivation is to address the 'black box' nature of deep learning models in high-stakes forensic applications, particularly dental age estimation, by enhancing both performance and transparency. Method: The study proposes a framework that integrates a convolutional autoencoder (AE) with a Vision Transformer (ViT) to improve classification accuracy and provide diagnostic insights in dental age estimation. Result: The proposed framework improves classification accuracy for mandibular second (tooth 37) and third (tooth 38) molars, increasing accuracy from 0.712 to 0.815 for tooth 37 and from 0.462 to 0.543 for tooth 38. Conclusion: The study concludes that combining a convolutional autoencoder with a Vision Transformer enhances both performance and transparency in dental age estimation, addressing model uncertainty and supporting expert decision-making. Abstract: The practical adoption of deep learning in high-stakes forensic applications, such as dental age estimation, is often limited by the 'black box' nature of the models. This study introduces a framework designed to enhance both performance and transparency in this context. We use a notable performance disparity in the automated staging of mandibular second (tooth 37) and third (tooth 38) molars as a case study. The proposed framework, which combines a convolutional autoencoder (AE) with a Vision Transformer (ViT), improves classification accuracy for both teeth over a baseline ViT, increasing from 0.712 to 0.815 for tooth 37 and from 0.462 to 0.543 for tooth 38. Beyond improving performance, the framework provides multi-faceted diagnostic insights. Analysis of the AE's latent space metrics and image reconstructions indicates that the remaining performance gap is data-centric, suggesting high intra-class morphological variability in the tooth 38 dataset is a primary limiting factor. This work highlights the insufficiency of relying on a single mode of interpretability, such as attention maps, which can appear anatomically plausible yet fail to identify underlying data issues. By offering a methodology that both enhances accuracy and provides evidence for why a model may be uncertain, this framework serves as a more robust tool to support expert decision-making in forensic age estimation.[71] SCoDA: Self-supervised Continual Domain Adaptation
Chirayu Agrawal,Snehasis Mukherjee
Main category: cs.CV
TL;DR: 本文提出了一种新的源无关域适应方法SCoDA,该方法通过自我监督学习初始化和几何流形对齐,提高了模型在无源数据情况下的适应能力。
Details
Motivation: 现有的SFDA方法依赖于源域数据的实例级特征对齐,而忽略了源模型潜在流形的几何信息,这限制了模型的适应能力。 Method: 使用完全通过自我监督学习(SSL)预训练的教师模型,并通过结合实例级特征匹配与空间相似性损失的复合目标训练学生模型,同时通过学生参数的指数移动平均(EMA)更新教师参数以防止灾难性遗忘。 Result: 在基准数据集上的广泛实验证明了SCoDA显著优于最先进的SFDA方法。 Conclusion: SCoDA通过避免对监督预训练的依赖和适应几何流形对齐原理,显著优于现有的SFDA方法。 Abstract: Source-Free Domain Adaptation (SFDA) addresses the challenge of adapting a model to a target domain without access to the data of the source domain. Prevailing methods typically start with a source model pre-trained with full supervision and distill the knowledge by aligning instance-level features. However, these approaches, relying on cosine similarity over L2-normalized feature vectors, inadvertently discard crucial geometric information about the latent manifold of the source model. We introduce Self-supervised Continual Domain Adaptation (SCoDA) to address these limitations. We make two key departures from standard practice: first, we avoid the reliance on supervised pre-training by initializing the proposed framework with a teacher model pre-trained entirely via self-supervision (SSL). Second, we adapt the principle of geometric manifold alignment to the SFDA setting. The student is trained with a composite objective combining instance-level feature matching with a Space Similarity Loss. To combat catastrophic forgetting, the teacher's parameters are updated via an Exponential Moving Average (EMA) of the student's parameters. Extensive experiments on benchmark datasets demonstrate that SCoDA significantly outperforms state-of-the-art SFDA methods.[72] Segment Anything for Cell Tracking
Zhu Chen,Mert Edgü,Er Jin,Johannes Stegmaier
Main category: cs.CV
TL;DR: This paper proposes a zero-shot cell tracking framework using SAM2, a foundation model for segmentation, which achieves strong performance without requiring manually labeled data or dataset-specific fine-tuning.
Details
Motivation: The motivation stems from the challenges in cell tracking and mitotic event detection due to low signal-to-noise ratios, dense clusters, and diverse microscopy data, along with the high cost and time required for manually labeling datasets in existing deep learning methods. Method: The method uses Segment Anything 2 (SAM2), a general image and video segmentation model, in a fully unsupervised manner to track cells and detect mitotic events in time-lapse microscopy images without requiring manually labeled training data. Result: The approach achieves competitive accuracy in both 2D and large-scale 3D time-lapse microscopy videos while eliminating the need for dataset-specific adaptation. Conclusion: The proposed zero-shot cell tracking framework integrates SAM2, a large foundation model, into the tracking pipeline, enabling generalization across diverse microscopy datasets without fine-tuning. Abstract: Tracking cells and detecting mitotic events in time-lapse microscopy image sequences is a crucial task in biomedical research. However, it remains highly challenging due to dividing objects, low signal-tonoise ratios, indistinct boundaries, dense clusters, and the visually similar appearance of individual cells. Existing deep learning-based methods rely on manually labeled datasets for training, which is both costly and time-consuming. Moreover, their generalizability to unseen datasets remains limited due to the vast diversity of microscopy data. To overcome these limitations, we propose a zero-shot cell tracking framework by integrating Segment Anything 2 (SAM2), a large foundation model designed for general image and video segmentation, into the tracking pipeline. As a fully-unsupervised approach, our method does not depend on or inherit biases from any specific training dataset, allowing it to generalize across diverse microscopy datasets without finetuning. Our approach achieves competitive accuracy in both 2D and large-scale 3D time-lapse microscopy videos while eliminating the need for dataset-specific adaptation.[73] Online 3D Multi-Camera Perception through Robust 2D Tracking and Depth-based Late Aggregation
Vu-Minh Le,Thao-Anh Tran,Duc Huy Do,Xuan Canh Do,Huong Ninh,Hai Tran
Main category: cs.CV
TL;DR: 本文提出了一种基于深度信息和在线数据关联机制的3D多目标多摄像头跟踪框架,能够在不完全替换2D系统的情况下实现高效的3D跟踪。
Details
Motivation: 现有的MTMC系统难以完全替换为3D跟踪组件,因此需要一种方法在保留2D系统的基础上引入3D跟踪能力。 Method: 利用深度信息在点云空间中重构目标,并通过聚类和偏航角优化恢复其3D框,同时引入了一种增强的在线数据关联机制,利用目标局部ID的一致性分配跨帧的全局ID。 Result: 该方法在2025 AI City Challenge的3D MTMC数据集上表现良好,取得了第三名的成绩。 Conclusion: 本文提出了一种将在线2D多摄像头跟踪系统扩展到3D空间的方法,并在2025 AI City Challenge的3D MTMC数据集中取得了排行榜第三名的成绩。 Abstract: Multi-Target Multi-Camera Tracking (MTMC) is an essential computer vision task for automating large-scale surveillance. With camera calibration and depth information, the targets in the scene can be projected into 3D space, offering unparalleled levels of automatic perception of a 3D environment. However, tracking in the 3D space requires replacing all 2D tracking components from the ground up, which may be infeasible for existing MTMC systems. In this paper, we present an approach for extending any online 2D multi-camera tracking system into 3D space by utilizing depth information to reconstruct a target in point-cloud space, and recovering its 3D box through clustering and yaw refinement following tracking. We also introduced an enhanced online data association mechanism that leverages the target's local ID consistency to assign global IDs across frames. The proposed framework is evaluated on the 2025 AI City Challenge's 3D MTMC dataset, achieving 3rd place on the leaderboard.[74] Zero-Shot Referring Expression Comprehension via Visual-Language True/False Verification
Jeffrey Liu,Rongbin Hu
Main category: cs.CV
TL;DR: A zero-shot workflow for Referring Expression Comprehension achieves strong performance without task-specific training by reformulating the task as box-wise visual-language verification.
Details
Motivation: The authors aim to challenge the conventional approach of using task-trained grounding models for REC by demonstrating that a zero-shot workflow can achieve competitive or superior results. Method: The method reformulates REC as box-wise visual-language verification using a generic detector (YOLO-World) and a general-purpose Vision-Language Model (VLM) to answer True/False queries for each region. Result: On benchmarks such as RefCOCO, RefCOCO+, and RefCOCOg, the proposed method outperforms a zero-shot GroundingDINO baseline and exceeds results from GroundingDINO trained on REC and GroundingDINO+CRG. Conclusion: The proposed zero-shot workflow for Referring Expression Comprehension (REC) outperforms existing methods without requiring task-specific training. Abstract: Referring Expression Comprehension (REC) is usually addressed with task-trained grounding models. We show that a zero-shot workflow, without any REC-specific training, can achieve competitive or superior performance. Our approach reformulates REC as box-wise visual-language verification: given proposals from a COCO-clean generic detector (YOLO-World), a general-purpose VLM independently answers True/False queries for each region. This simple procedure reduces cross-box interference, supports abstention and multiple matches, and requires no fine-tuning. On RefCOCO, RefCOCO+, and RefCOCOg, our method not only surpasses a zero-shot GroundingDINO baseline but also exceeds reported results for GroundingDINO trained on REC and GroundingDINO+CRG. Controlled studies with identical proposals confirm that verification significantly outperforms selection-based prompting, and results hold with open VLMs. Overall, we show that workflow design, rather than task-specific pretraining, drives strong zero-shot REC performance.[75] Augment to Segment: Tackling Pixel-Level Imbalance in Wheat Disease and Pest Segmentation
Tianqi Wei,Xin Yu,Zhi Chen,Scott Chapman,Zi Huang
Main category: cs.CV
TL;DR: The paper introduces an effective augmentation technique (RPCP) to address extreme pixel imbalance in wheat disease and insect damage segmentation tasks.
Details
Motivation: Accurate segmentation of foliar diseases and insect damage in wheat is crucial for crop management, but insect damage pixels are rare, causing challenges in segmentation performance. Method: A Random Projected Copy-and-Paste (RPCP) augmentation technique was developed, which involves extracting rare insect-damage patches, applying random geometric transformations, and pasting them in appropriate regions while avoiding overlaps. A random projection filter was also applied for feature refinement. Result: The proposed method significantly improved segmentation performance for the insect damage class, while maintaining or slightly improving accuracy for other categories. Conclusion: The proposed RPCP augmentation technique effectively addresses the pixel imbalance problem in segmentation tasks for agricultural applications. Abstract: Accurate segmentation of foliar diseases and insect damage in wheat is crucial for effective crop management and disease control. However, the insect damage typically occupies only a tiny fraction of annotated pixels. This extreme pixel-level imbalance poses a significant challenge to the segmentation performance, which can result in overfitting to common classes and insufficient learning of rare classes, thereby impairing overall performance. In this paper, we propose a Random Projected Copy-and-Paste (RPCP) augmentation technique to address the pixel imbalance problem. Specifically, we extract rare insect-damage patches from annotated training images and apply random geometric transformations to simulate variations. The transformed patches are then pasted in appropriate regions while avoiding overlaps with lesions or existing damaged regions. In addition, we apply a random projection filter to the pasted regions, refining local features and ensuring a natural blend with the new background. Experiments show that our method substantially improves segmentation performance on the insect damage class, while maintaining or even slightly enhancing accuracy on other categories. Our results highlight the effectiveness of targeted augmentation in mitigating extreme pixel imbalance, offering a straightforward yet effective solution for agricultural segmentation problems.[76] An HMM-based framework for identity-aware long-term multi-object tracking from sparse and uncertain identification: use case on long-term tracking in livestock
Anne Marthe Sophie Ngo Bibinbe,Chiron Bang,Patrick Gagnon,Jamie Ahloy-Dallaire,Eric R. Paquet
Main category: cs.CV
TL;DR: 本文提出了一种新的隐马尔可夫模型(HMM)框架,结合不确定的身份信息,有效解决了长期多目标跟踪中的身份切换问题,并在实际数据集和标准基准数据集上验证了其性能提升。
Details
Motivation: 现有的MOT方法在长期视频中因身份切换问题导致性能下降,难以满足实际应用(如畜牧业)中对个体行为分析的需求。而实际应用中可以通过某些设备(如喂食器)获取部分个体的身份信息,因此论文试图利用这些不确定的身份信息来改进跟踪效果。 Method: 论文提出了一种新的HMM框架,结合了不确定的身份识别信息和跟踪方法。该框架在ByteTrack和FairMOT等现有MOT方法的基础上,利用HMM模型处理身份不确定性,从而提高跟踪性能。 Result: 论文提出的HMM框架在包含10分钟的猪群跟踪数据集上,即使仅提供21次身份识别,也显著提高了ByteTrack的F1分数。此外,在MOT17和MOT20基准数据集上也验证了该方法的有效性。同时,实验表明身份识别频率越高,跟踪性能越强。 Conclusion: 该论文提出了一种基于隐马尔可夫模型(HMM)的框架,结合了不确定身份信息和跟踪,以解决长期多目标跟踪(MOT)中的身份切换问题。这种方法在实际应用中表现良好,并在MOT17和MOT20数据集上验证了其性能。 Abstract: The need for long-term multi-object tracking (MOT) is growing due to the demand for analyzing individual behaviors in videos that span several minutes. Unfortunately, due to identity switches between objects, the tracking performance of existing MOT approaches decreases over time, making them difficult to apply for long-term tracking. However, in many real-world applications, such as in the livestock sector, it is possible to obtain sporadic identifications for some of the animals from sources like feeders. To address the challenges of long-term MOT, we propose a new framework that combines both uncertain identities and tracking using a Hidden Markov Model (HMM) formulation. In addition to providing real-world identities to animals, our HMM framework improves the F1 score of ByteTrack, a leading MOT approach even with re-identification, on a 10 minute pig tracking dataset with 21 identifications at the pen's feeding station. We also show that our approach is robust to the uncertainty of identifications, with performance increasing as identities are provided more frequently. The improved performance of our HMM framework was also validated on the MOT17 and MOT20 benchmark datasets using both ByteTrack and FairMOT. The code for this new HMM framework and the new 10-minute pig tracking video dataset are available at: https://github.com/ngobibibnbe/uncertain-identity-aware-tracking[77] Event Camera Guided Visual Media Restoration & 3D Reconstruction: A Survey
Aupendu Kar,Vishnu Raj,Guan-Ming Su
Main category: cs.CV
TL;DR: 这篇论文调查了事件相机传感器的发展,以及它们与传统帧捕捉的融合如何显著改善视频修复和3D重建任务。
Details
Motivation: 事件相机传感器作为新兴领域,具有低延迟、低功耗和超高捕捉率等优势,正在快速发展,这促使了这篇论文的出现。 Method: 论文系统地回顾了图像/视频增强和修复领域的主要深度学习贡献,关注两个维度:时间增强和空间增强,并探讨了3D重建领域如何随着事件驱动融合的发展而演变。 Result: 这篇论文探索了如何通过事件流与传统帧捕捉的融合显著改善各种视频修复和3D重建任务,并深入讨论了在挑战条件下提高视觉质量的最新工作。 Conclusion: 这篇论文旨在通过整合最新的进展和见解,激发进一步的研究,特别是在深度学习的结合下,利用事件相机系统进行先进的视觉媒体修复和增强。 Abstract: Event camera sensors are bio-inspired sensors which asynchronously capture per-pixel brightness changes and output a stream of events encoding the polarity, location and time of these changes. These systems are witnessing rapid advancements as an emerging field, driven by their low latency, reduced power consumption, and ultra-high capture rates. This survey explores the evolution of fusing event-stream captured with traditional frame-based capture, highlighting how this synergy significantly benefits various video restoration and 3D reconstruction tasks. The paper systematically reviews major deep learning contributions to image/video enhancement and restoration, focusing on two dimensions: temporal enhancement (such as frame interpolation and motion deblurring) and spatial enhancement (including super-resolution, low-light and HDR enhancement, and artifact reduction). This paper also explores how the 3D reconstruction domain evolves with the advancement of event driven fusion. Diverse topics are covered, with in-depth discussions on recent works for improving visual quality under challenging conditions. Additionally, the survey compiles a comprehensive list of openly available datasets, enabling reproducible research and benchmarking. By consolidating recent progress and insights, this survey aims to inspire further research into leveraging event camera systems, especially in combination with deep learning, for advanced visual media restoration and enhancement.[78] ISTASTrack: Bridging ANN and SNN via ISTA Adapter for RGB-Event Tracking
Siying Liu,Zikai Wang,Hanle Zheng,Yifan Hu,Xilin Wang,Qingkai Yang,Jibin Wu,Hao Guo,Lei Deng
Main category: cs.CV
TL;DR: ISTASTrack是一种基于变压器的ANN-SNN混合跟踪器,通过ISTA适配器实现RGB和事件数据的高效融合,在多个基准测试中表现出色。
Details
Motivation: 现有的人工神经网络(ANNs)难以充分利用事件流的稀疏性和异步性,而结合ANN和脉冲神经网络(SNNs)的混合架构虽然在RGB-Event感知中显示出潜力,但跨范式的特征融合仍是一个挑战。 Method: ISTASTrack采用双分支模型,一个分支使用视觉变压器从RGB输入中提取空间上下文信息,另一个分支使用脉冲变压器从事件流中捕捉时空动态。通过基于稀疏表示理论的ISTA适配器实现两个分支之间的双向特征交互,并引入时间下采样注意力模块对齐多步SNN特征与单步ANN特征。 Result: ISTASTrack在FE240hz、VisEvent、COESOT和FELT等RGB-Event跟踪基准测试中达到了最先进的性能,同时保持了高能效。 Conclusion: ISTASTrack通过结合ANN和SNN的优势,实现了高效的RGB-Event跟踪,并在多个基准测试中表现出色,证明了混合设计在视觉跟踪中的有效性与实用性。 Abstract: RGB-Event tracking has become a promising trend in visual object tracking to leverage the complementary strengths of both RGB images and dynamic spike events for improved performance. However, existing artificial neural networks (ANNs) struggle to fully exploit the sparse and asynchronous nature of event streams. Recent efforts toward hybrid architectures combining ANNs and spiking neural networks (SNNs) have emerged as a promising solution in RGB-Event perception, yet effectively fusing features across heterogeneous paradigms remains a challenge. In this work, we propose ISTASTrack, the first transformer-based \textbf{A}NN-\textbf{S}NN hybrid \textbf{Track}er equipped with \textbf{ISTA} adapters for RGB-Event tracking. The two-branch model employs a vision transformer to extract spatial context from RGB inputs and a spiking transformer to capture spatio-temporal dynamics from event streams. To bridge the modality and paradigm gap between ANN and SNN features, we systematically design a model-based ISTA adapter for bidirectional feature interaction between the two branches, derived from sparse representation theory by unfolding the iterative shrinkage thresholding algorithm. Additionally, we incorporate a temporal downsampling attention module within the adapter to align multi-step SNN features with single-step ANN features in the latent space, improving temporal fusion. Experimental results on RGB-Event tracking benchmarks, such as FE240hz, VisEvent, COESOT, and FELT, have demonstrated that ISTASTrack achieves state-of-the-art performance while maintaining high energy efficiency, highlighting the effectiveness and practicality of hybrid ANN-SNN designs for robust visual tracking. The code is publicly available at https://github.com/lsying009/ISTASTrack.git.[79] FLARE-SSM: Deep State Space Models with Influence-Balanced Loss for 72-Hour Solar Flare Prediction
Yusuke Takagi,Shunya Nagashima,Komei Sugiura
Main category: cs.CV
TL;DR: This study presents a solar flare prediction model using deep state space models and a novel loss function (FLARE loss) to address class imbalance, achieving improved performance and reliability in forecasting solar flares.
Details
Motivation: Accurate and reliable solar flare predictions are essential to mitigate potential impacts on critical infrastructure, but the current performance of solar flare forecasting is insufficient, particularly due to challenges posed by severe class imbalance across flare classes. Method: The study introduces a solar flare prediction model using multiple deep state space models and proposes the frequency & local-boundary-aware reliability loss (FLARE loss) to improve predictive performance and reliability. Experiments were conducted on a multi-wavelength solar image dataset covering a full 11-year solar activity cycle. Result: The proposed method outperformed baseline approaches in terms of both the Gandin-Murphy-Gerrity score and the true skill statistic, which are standard metrics for evaluating performance and reliability in solar flare forecasting. Conclusion: The proposed solar flare prediction model based on multiple deep state space models and the FLARE loss outperforms baseline approaches in predictive performance and reliability under class imbalance. Abstract: Accurate and reliable solar flare predictions are essential to mitigate potential impacts on critical infrastructure. However, the current performance of solar flare forecasting is insufficient. In this study, we address the task of predicting the class of the largest solar flare expected to occur within the next 72 hours. Existing methods often fail to adequately address the severe class imbalance across flare classes. To address this issue, we propose a solar flare prediction model based on multiple deep state space models. In addition, we introduce the frequency & local-boundary-aware reliability loss (FLARE loss) to improve predictive performance and reliability under class imbalance. Experiments were conducted on a multi-wavelength solar image dataset covering a full 11-year solar activity cycle. As a result, our method outperformed baseline approaches in terms of both the Gandin-Murphy-Gerrity score and the true skill statistic, which are standard metrics in terms of the performance and reliability.[80] TUNI: Real-time RGB-T Semantic Segmentation with Unified Multi-Modal Feature Extraction and Cross-Modal Feature Fusion
Xiaodong Guo,Tong Liu,Yike Li,Zi'ang Lin,Zhihong Deng
Main category: cs.CV
TL;DR: 本文提出TUNI方法,通过一个统一的RGB-T编码器和局部模块,在RGB-thermal语义分割任务中实现了高效且竞争性能的结果。
Details
Motivation: 现有的RGB-T语义分割模型由于使用单独的编码器处理RGB和红外输入,导致热特征提取有限、跨模态融合效果不佳以及模型实时效率下降。 Method: 提出了一种名为TUNI的方法,该方法包含一个RGB-T编码器和一个RGB-T局部模块。编码器通过多模态特征提取和跨模态融合的联合学习,利用RGB和伪热数据的大规模预训练,同时通过瘦身热分支来实现更紧凑的架构。局部模块使用自适应余弦相似度来强调RGB-T模态间的显著一致和差异局部特征。 Result: TUNI在FMB、PST900和CART数据集上取得了与现有最先进模型相当的性能,同时参数更少、计算成本更低,并在Jetson Orin NX上实现了27 FPS的实时推理速度。 Conclusion: TUNI通过统一的RGB-T编码器和局部模块,在RGB-thermal语义分割任务上实现了竞争性能,同时减少了参数数量和计算成本,并具有实时部署能力。 Abstract: RGB-thermal (RGB-T) semantic segmentation improves the environmental perception of autonomous platforms in challenging conditions. Prevailing models employ encoders pre-trained on RGB images to extract features from both RGB and infrared inputs, and design additional modules to achieve cross-modal feature fusion. This results in limited thermal feature extraction and suboptimal cross-modal fusion, while the redundant encoders further compromises the model's real-time efficiency. To address the above issues, we propose TUNI, with an RGB-T encoder consisting of multiple stacked blocks that simultaneously perform multi-modal feature extraction and cross-modal fusion. By leveraging large-scale pre-training with RGB and pseudo-thermal data, the RGB-T encoder learns to integrate feature extraction and fusion in a unified manner. By slimming down the thermal branch, the encoder achieves a more compact architecture. Moreover, we introduce an RGB-T local module to strengthen the encoder's capacity for cross-modal local feature fusion. The RGB-T local module employs adaptive cosine similarity to selectively emphasize salient consistent and distinct local features across RGB-T modalities. Experimental results show that TUNI achieves competitive performance with state-of-the-art models on FMB, PST900 and CART, with fewer parameters and lower computational cost. Meanwhile, it achieves an inference speed of 27 FPS on a Jetson Orin NX, demonstrating its real-time capability in deployment. Codes are available at https://github.com/xiaodonguo/TUNI.[81] Few-Part-Shot Font Generation
Masaki Akiba,Shumpei Takezaki,Daichi Haraguchi,Seiichi Uchida
Main category: cs.CV
TL;DR: 该论文介绍了一种新的少样本字体生成方法,仅需部分字符形状即可生成完整字体,提高了效率并提供了设计洞察。
Details
Motivation: 现有的少样本字体生成方法需要完整的字符形状,而该论文旨在通过仅使用部分设计元素来提高字体生成的效率和灵活性。 Method: 设计了一种仅需部分形状作为输入的字体生成模型,与需要完整字符形状的传统少样本字体生成方法不同。 Result: 该模型能够在仅提供部分形状的情况下生成完整字体,并揭示了部分设计对整体字符结构的影响。 Conclusion: 该论文提出了一种基于部分设计元素的少样本字体生成新模型,不仅提高了字体创建的效率,还提供了关于部分设计细节如何影响单个字符整体结构的见解。 Abstract: This paper proposes a novel model of few-part-shot font generation, which designs an entire font based on a set of partial design elements, i.e., partial shapes. Unlike conventional few-shot font generation, which requires entire character shapes for a couple of character classes, our approach only needs partial shapes as input. The proposed model not only improves the efficiency of font creation but also provides insights into how partial design details influence the entire structure of the individual characters.[82] Efficient and Accurate Downfacing Visual Inertial Odometry
Jonas Kühne,Christian Vogt,Michele Magno,Luca Benini
Main category: cs.CV
TL;DR: This paper introduces an efficient Visual Inertial Odometry pipeline optimized for ultra-low-power chips, achieving significant improvements in accuracy and computational efficiency for micro- and nano-UAVs.
Details
Motivation: The motivation is to develop an efficient and accurate Visual Inertial Odometry (VIO) pipeline suitable for micro- and nano-UAVs, which operate on ultra-low-power systems and require real-time performance with limited computational resources. Method: The paper employs state-of-the-art feature detection and tracking methods (SuperPoint, PX4FLOW, ORB), optimized and quantized for RISC-V-based ultra-low-power SoCs, and incorporates a rigid body motion model to enhance accuracy in planar motion scenarios. Result: The optimized VIO pipeline implemented on the GAP9 low-power SoC demonstrated an average reduction in RMSE of up to 3.65x using the ORB feature tracker, and PX4FLOW showed comparable tracking accuracy with lower runtime at movement speeds below 24 pixels/frame. Conclusion: The paper concludes that the proposed VIO pipeline effectively bridges the gap between high-accuracy VIO systems and lightweight implementations suitable for microcontrollers, demonstrating significant improvements in accuracy and efficiency on ultra-low-power SoCs. Abstract: Visual Inertial Odometry (VIO) is a widely used computer vision method that determines an agent's movement through a camera and an IMU sensor. This paper presents an efficient and accurate VIO pipeline optimized for applications on micro- and nano-UAVs. The proposed design incorporates state-of-the-art feature detection and tracking methods (SuperPoint, PX4FLOW, ORB), all optimized and quantized for emerging RISC-V-based ultra-low-power parallel systems on chips (SoCs). Furthermore, by employing a rigid body motion model, the pipeline reduces estimation errors and achieves improved accuracy in planar motion scenarios. The pipeline's suitability for real-time VIO is assessed on an ultra-low-power SoC in terms of compute requirements and tracking accuracy after quantization. The pipeline, including the three feature tracking methods, was implemented on the SoC for real-world validation. This design bridges the gap between high-accuracy VIO pipelines that are traditionally run on computationally powerful systems and lightweight implementations suitable for microcontrollers. The optimized pipeline on the GAP9 low-power SoC demonstrates an average reduction in RMSE of up to a factor of 3.65x over the baseline pipeline when using the ORB feature tracker. The analysis of the computational complexity of the feature trackers further shows that PX4FLOW achieves on-par tracking accuracy with ORB at a lower runtime for movement speeds below 24 pixels/frame.[83] Hierarchical MLANet: Multi-level Attention for 3D Face Reconstruction From Single Images
Danling Cao
Main category: cs.CV
TL;DR: This paper introduces MLANet, a deep learning method for reconstructing 3D face models from single 2D images using attention mechanisms and semi-supervised training, showing strong performance on benchmark datasets.
Details
Motivation: Recovering 3D face models from 2D in-the-wild images is challenging due to the lack of labeled datasets and real-world complexity, yet it has broad applications in computer vision. Method: A convolutional neural network-based approach called Hierarchical Multi-Level Attention Network (MLANet) was developed, which uses a pre-trained hierarchical backbone and multi-level attention mechanisms. A semi-supervised training strategy incorporating 3D Morphable Model (3DMM) parameters and a differentiable renderer was used for end-to-end training. Result: MLANet achieved effective 3D face reconstruction and alignment on two benchmark datasets (AFLW2000-3D and MICC Florence), validated through both quantitative and qualitative evaluations. Conclusion: The proposed MLANet effectively reconstructs detailed 3D face models from single in-the-wild images, demonstrating strong performance through extensive experiments on benchmark datasets. Abstract: Recovering 3D face models from 2D in-the-wild images has gained considerable attention in the computer vision community due to its wide range of potential applications. However, the lack of ground-truth labeled datasets and the complexity of real-world environments remain significant challenges. In this chapter, we propose a convolutional neural network-based approach, the Hierarchical Multi-Level Attention Network (MLANet), for reconstructing 3D face models from single in-the-wild images. Our model predicts detailed facial geometry, texture, pose, and illumination parameters from a single image. Specifically, we employ a pre-trained hierarchical backbone network and introduce multi-level attention mechanisms at different stages of 2D face image feature extraction. A semi-supervised training strategy is employed, incorporating 3D Morphable Model (3DMM) parameters from publicly available datasets along with a differentiable renderer, enabling an end-to-end training process. Extensive experiments, including both comparative and ablation studies, were conducted on two benchmark datasets, AFLW2000-3D and MICC Florence, focusing on 3D face reconstruction and 3D face alignment tasks. The effectiveness of the proposed method was evaluated both quantitatively and qualitatively.[84] LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA
Jing Huang,Zhiya Tan,Shutao Gong,Fanwei Zeng,Jianshu Li
Main category: cs.CV
TL;DR: LaV-CoT 是一种新的语言感知视觉思维链框架,通过多阶段推理和自动化数据生成,在多语言视觉问答任务中取得了显著性能提升,并优于现有开源和专有模型。
Details
Motivation: 现有方法主要依赖文本思维链 (CoT),对多语言多模态推理支持有限,难以满足实际应用需求。因此需要一种更强大且可扩展的框架来提升多语言视觉问答 (mVQA) 的性能。 Method: LaV-CoT 结合了监督微调 (SFT) 和语言感知组相对策略优化 (GRPO),并采用多方面奖励优化,包括语言一致性、结构准确性和语义对齐。同时,它使用多阶段推理管道,包括文本摘要、语言识别、空间对象级描述和逐步逻辑推理。 Result: LaV-CoT 在 MMMB、Multilingual MMBench 和 MTVQA 等公共数据集上取得了高达约 9.5% 的准确率提升,甚至比两倍规模的模型高出约 2.6%,并且优于 GPT-4o-0513 和 Gemini-2.5-flash 等先进专有模型。 Conclusion: LaV-CoT 是第一个支持多语言多模态推理的语言感知视觉思维链框架,通过多阶段训练和自动化数据生成,实现了比现有开源基线模型更高的准确性,并优于更大规模的模型以及先进的专有模型。 Abstract: As large vision language models (VLMs) advance, their capabilities in multilingual visual question answering (mVQA) have significantly improved. Chain-of-thought (CoT) reasoning has been proven to enhance interpretability and complex reasoning. However, most existing approaches rely primarily on textual CoT and provide limited support for multilingual multimodal reasoning, constraining their deployment in real-world applications. To address this gap, we introduce \textbf{LaV-CoT}, the first Language-aware Visual CoT framework with Multi-Aspect Reward Optimization. LaV-CoT incorporates an interpretable multi-stage reasoning pipeline consisting of Text Summary with Bounding Box (BBox), Language Identification, Spatial Object-level Captioning, and Step-by-step Logical Reasoning. Following this reasoning pipeline, we design an automated data curation method that generates multilingual CoT annotations through iterative generation, correction, and refinement, enabling scalable and high-quality training data. To improve reasoning and generalization, LaV-CoT adopts a two-stage training paradigm combining Supervised Fine-Tuning (SFT) with Language-aware Group Relative Policy Optimization (GRPO), guided by verifiable multi-aspect rewards including language consistency, structural accuracy, and semantic alignment. Extensive evaluations on public datasets including MMMB, Multilingual MMBench, and MTVQA show that LaV-CoT achieves up to \(\sim\)9.5\% accuracy improvements over open-source baselines of similar size and even surpasses models with 2$\times$ larger scales by \(\sim\)2.6\%. Moreover, LaV-CoT outperforms advanced proprietary models such as GPT-4o-0513 and Gemini-2.5-flash. We further conducted an online A/B test to validate our method on real-world data, highlighting its effectiveness for industrial deployment. Our code is available at this link: \href{https://github.com/HJNVR/LaV-CoT}[85] Color Me Correctly: Bridging Perceptual Color Spaces and Text Embeddings for Improved Diffusion Generation
Sung-Lin Tsai,Bo-Lun Huang,Yu Ting Shen,Cheng Yu Yeo,Chiang Tseng,Bo-Kai Ruan,Wen-Sheng Lien,Hong-Han Shuai
Main category: cs.CV
TL;DR: This paper proposes a training-free framework for improving color alignment in text-to-image generation by using an LLM to clarify ambiguous color terms and refining embeddings based on color space relationships.
Details
Motivation: The motivation is to accurately align colors in text-to-image generation, particularly for nuanced and compound color terms that current diffusion models struggle with, which is critical for applications like fashion, product visualization, and interior design. Method: The method involves using a large language model (LLM) to resolve ambiguous color terms in the text prompt and then refining the text embeddings based on the spatial relationships of the resulting color terms in the CIELAB color space. Result: Experimental results demonstrate that the framework improves color alignment without compromising image quality, effectively bridging the gap between text semantics and visual generation. Conclusion: The proposed training-free framework effectively enhances color fidelity in text-to-image generation by leveraging an LLM to disambiguate color terms and refining text embeddings based on spatial relationships in the CIELAB color space, without requiring additional training or reference images. Abstract: Accurate color alignment in text-to-image (T2I) generation is critical for applications such as fashion, product visualization, and interior design, yet current diffusion models struggle with nuanced and compound color terms (e.g., Tiffany blue, lime green, hot pink), often producing images that are misaligned with human intent. Existing approaches rely on cross-attention manipulation, reference images, or fine-tuning but fail to systematically resolve ambiguous color descriptions. To precisely render colors under prompt ambiguity, we propose a training-free framework that enhances color fidelity by leveraging a large language model (LLM) to disambiguate color-related prompts and guiding color blending operations directly in the text embedding space. Our method first employs a large language model (LLM) to resolve ambiguous color terms in the text prompt, and then refines the text embeddings based on the spatial relationships of the resulting color terms in the CIELAB color space. Unlike prior methods, our approach improves color accuracy without requiring additional training or external reference images. Experimental results demonstrate that our framework improves color alignment without compromising image quality, bridging the gap between text semantics and visual generation.[86] Multimodal Mathematical Reasoning Embedded in Aerial Vehicle Imagery: Benchmarking, Analysis, and Exploration
Yue Zhou,Litong Feng,Mengcheng Lan,Xue Yang,Qingyun Li,Yiping Ke,Xue Jiang,Wayne Zhang
Main category: cs.CV
TL;DR: This paper introduces AVI-Math, the first benchmark for evaluating mathematical reasoning in vision-language models (VLMs) applied to UAV-based remote sensing. It demonstrates that existing VLMs perform poorly on complex mathematical tasks and explores methods to enhance their reasoning abilities.
Details
Motivation: Current vision-language models (VLMs) lack adequate evaluation in mathematical reasoning for UAV-based remote sensing tasks such as distance computation, trajectory estimation, and spatial analysis. This work aims to address this gap by introducing a domain-specific benchmark dataset. Method: The authors introduce AVI-Math, a benchmark dataset with 3,773 high-quality questions involving mathematical reasoning in UAV imagery. They evaluate 14 prominent VLMs through comprehensive testing and explore techniques like Chain-of-Thought prompting and fine-tuning to improve reasoning performance. Result: Despite their success on previous multimodal benchmarks, the tested VLMs struggle with the mathematical reasoning tasks in AVI-Math. The analysis reveals significant limitations in their reasoning capabilities, highlighting the need for further research and improvements in this domain. Conclusion: The paper concludes that current vision-language models (VLMs) have significant limitations in mathematical reasoning, particularly in UAV-based remote sensing tasks. However, techniques like Chain-of-Thought prompting and fine-tuning show promise in addressing these challenges. The findings provide insights for advancing trustworthy VLMs for real-world applications. Abstract: Mathematical reasoning is critical for tasks such as precise distance and area computations, trajectory estimations, and spatial analysis in unmanned aerial vehicle (UAV) based remote sensing, yet current vision-language models (VLMs) have not been adequately tested in this domain. To address this gap, we introduce AVI-Math, the first benchmark to rigorously evaluate multimodal mathematical reasoning in aerial vehicle imagery, moving beyond simple counting tasks to include domain-specific knowledge in areas such as geometry, logic, and algebra. The dataset comprises 3,773 high-quality vehicle-related questions captured from UAV views, covering 6 mathematical subjects and 20 topics. The data, collected at varying altitudes and from multiple UAV angles, reflects real-world UAV scenarios, ensuring the diversity and complexity of the constructed mathematical problems. In this paper, we benchmark 14 prominent VLMs through a comprehensive evaluation and demonstrate that, despite their success on previous multimodal benchmarks, these models struggle with the reasoning tasks in AVI-Math. Our detailed analysis highlights significant limitations in the mathematical reasoning capabilities of current VLMs and suggests avenues for future research. Furthermore, we explore the use of Chain-of-Thought prompting and fine-tuning techniques, which show promise in addressing the reasoning challenges in AVI-Math. Our findings not only expose the limitations of VLMs in mathematical reasoning but also offer valuable insights for advancing UAV-based trustworthy VLMs in real-world applications. The code, and datasets will be released at https://github.com/VisionXLab/avi-math[87] BEVTraj: Map-Free End-to-End Trajectory Prediction in Bird's-Eye View with Deformable Attention and Sparse Goal Proposals
Minsang Kong,Myeongjun Kim,Sang Gu Kang,Sang Hun Lee
Main category: cs.CV
TL;DR: BEVTraj 是一种不依赖预建地图的轨迹预测框架,利用实时传感器数据和可变形注意力机制,实现了高效准确的自动驾驶轨迹预测。
Details
Motivation: 现有的轨迹预测方法依赖于预建高清地图或实时局部地图构建模块,但这些方法在适应性、灵活性和预测性能方面存在局限性。 Method: BEVTraj 使用鸟瞰图空间中的实时传感器数据,结合可变形注意力机制和稀疏目标候选提案模块,实现端到端的轨迹预测。 Result: 实验表明,BEVTraj 在性能上与当前最先进的基于高清地图的模型相当,并且无需任何后处理步骤,具备更高的灵活性。 Conclusion: BEVTraj 提出了一种新的轨迹预测框架,无需依赖预建地图,同时保持与现有基于高清地图模型相当的性能,提高了自动驾驶系统的灵活性。 Abstract: In autonomous driving, trajectory prediction is essential for ensuring safe and efficient navigation. To improve prediction accuracy, recent approaches often rely on pre-built high-definition (HD) maps or real-time local map construction modules to incorporate static environmental information. However, pre-built HD maps are limited to specific regions and cannot adapt to transient changes. In addition, local map construction modules, which recognize only predefined elements, may fail to capture critical scene details or introduce errors that degrade prediction performance. To overcome these limitations, we propose Bird's-Eye View Trajectory Prediction (BEVTraj), a novel trajectory prediction framework that operates directly in the bird's-eye view (BEV) space utilizing real-time sensor data without relying on any pre-built maps. The BEVTraj leverages deformable attention to efficiently extract relevant context from dense BEV features. Furthermore, we introduce a Sparse Goal Candidate Proposal (SGCP) module, which enables full end-to-end prediction without requiring any post-processing steps. Extensive experiments demonstrate that the BEVTraj achieves performance comparable to state-of-the-art HD map-based models while offering greater flexibility by eliminating the dependency on pre-built maps. The source code is available at https://github.com/Kongminsang/bevtraj.[88] Leveraging Multi-View Weak Supervision for Occlusion-Aware Multi-Human Parsing
Laura Bragagnolo,Matteo Terreran,Leonardo Barcellona,Stefano Ghidoni
Main category: cs.CV
TL;DR: This paper proposes a novel training framework for multi-human parsing models that uses multi-view information to improve segmentation in cases of overlapping bodies, achieving a notable improvement in occlusion scenarios.
Details
Motivation: The motivation is that while current state-of-the-art approaches have achieved good results, they struggle with segmenting people with overlapping bodies, and overlapping people may appear separated from a different point of view. Method: The method involves a novel training framework that uses weak supervision on human instances and a multi-view consistency loss to integrate multi-view knowledge during training. Result: The experiments show that the approach can achieve up to a 4.20% relative improvement on human parsing over the baseline model in occlusion scenarios. Conclusion: The paper concludes that by using a novel training framework exploiting multi-view information, the multi-human parsing models can be significantly improved in occlusion scenarios. Abstract: Multi-human parsing is the task of segmenting human body parts while associating each part to the person it belongs to, combining instance-level and part-level information for fine-grained human understanding. In this work, we demonstrate that, while state-of-the-art approaches achieved notable results on public datasets, they struggle considerably in segmenting people with overlapping bodies. From the intuition that overlapping people may appear separated from a different point of view, we propose a novel training framework exploiting multi-view information to improve multi-human parsing models under occlusions. Our method integrates such knowledge during the training process, introducing a novel approach based on weak supervision on human instances and a multi-view consistency loss. Given the lack of suitable datasets in the literature, we propose a semi-automatic annotation strategy to generate human instance segmentation masks from multi-view RGB+D data and 3D human skeletons. The experiments demonstrate that the approach can achieve up to a 4.20\% relative improvement on human parsing over the baseline model in occlusion scenarios.[89] VARCO-VISION-2.0 Technical Report
Young-rok Cha,Jeongho Ju,SunYoung Park,Jong-Hyeon Lee,Younghyun Yu,Youngjune Kim
Main category: cs.CV
TL;DR: 本文提出 VARCO-VISION-2.0,一种改进的韩英双语视觉-语言模型,具备更强的多图像理解和布局感知 OCR 功能,并通过四阶段课程学习和内存高效技术实现增强的多模态对齐。
Details
Motivation: 开发一个比之前的 VARCO-VISION-14B 更强大的双语视觉-语言模型,支持韩语和英语,并具有更强的多图像理解能力和布局感知 OCR。 Method: 采用四阶段课程学习和内存高效技术进行训练,实现增强的多模态对齐,并通过偏好优化提高安全性。 Result: 模型在 OpenCompass VLM 排行榜上取得了第 8 名的成绩,并展示了强大的空间定位能力和两种语言的竞争性结果。 Conclusion: VARCO-VISION-2.0 有两个版本,14B 和 1.7B,分别适用于不同场景的部署。这些模型推动了双语 VLM 的发展并促进了其实际应用。 Abstract: We introduce VARCO-VISION-2.0, an open-weight bilingual vision-language model (VLM) for Korean and English with improved capabilities compared to the previous model VARCO-VISION-14B. The model supports multi-image understanding for complex inputs such as documents, charts, and tables, and delivers layoutaware OCR by predicting both textual content and its spatial location. Trained with a four-stage curriculum with memory-efficient techniques, the model achieves enhanced multimodal alignment, while preserving core language abilities and improving safety via preference optimization. Extensive benchmark evaluations demonstrate strong spatial grounding and competitive results for both languages, with the 14B model achieving 8th place on the OpenCompass VLM leaderboard among models of comparable scale. Alongside the 14B-scale model, we release a 1.7B version optimized for on-device deployment. We believe these models advance the development of bilingual VLMs and their practical applications. Two variants of VARCO-VISION-2.0 are available at Hugging Face: a full-scale 14B model and a lightweight 1.7B model.[90] A Lightweight Ensemble-Based Face Image Quality Assessment Method with Correlation-Aware Loss
MohammadAli Hamidi,Hadi Amirpour,Luigi Atzori,Christian Timmerer
Main category: cs.CV
TL;DR: This paper proposes a lightweight, efficient FIQA method using an ensemble of compact CNNs and a correlation-aware loss to better align with human perception, achieving high performance on the VQualA benchmark.
Details
Motivation: Existing general-purpose image quality assessment methods fail to capture face-specific degradations, and state-of-the-art FIQA models are computationally heavy, limiting their practical use. Method: The approach uses an ensemble of MobileNetV3-Small and ShuffleNetV2 CNNs with prediction-level fusion and a correlation-aware loss (MSECorrLoss) to align with human perceptual judgments. Result: On the VQualA benchmark, the model achieved a Spearman rank correlation coefficient (SRCC) of 0.9829 and a Pearson linear correlation coefficient (PLCC) of 0.9894, while meeting efficiency constraints. Conclusion: The proposed lightweight FIQA method balances accuracy and computational efficiency, making it suitable for real-world deployment. Abstract: Face image quality assessment (FIQA) plays a critical role in face recognition and verification systems, especially in uncontrolled, real-world environments. Although several methods have been proposed, general-purpose no-reference image quality assessment techniques often fail to capture face-specific degradations. Meanwhile, state-of-the-art FIQA models tend to be computationally intensive, limiting their practical applicability. We propose a lightweight and efficient method for FIQA, designed for the perceptual evaluation of face images in the wild. Our approach integrates an ensemble of two compact convolutional neural networks, MobileNetV3-Small and ShuffleNetV2, with prediction-level fusion via simple averaging. To enhance alignment with human perceptual judgments, we employ a correlation-aware loss (MSECorrLoss), combining mean squared error (MSE) with a Pearson correlation regularizer. Our method achieves a strong balance between accuracy and computational cost, making it suitable for real-world deployment. Experiments on the VQualA FIQA benchmark demonstrate that our model achieves a Spearman rank correlation coefficient (SRCC) of 0.9829 and a Pearson linear correlation coefficient (PLCC) of 0.9894, remaining within competition efficiency constraints.[91] Realism Control One-step Diffusion for Real-World Image Super-Resolution
Zongliang Wu,Siming Zheng,Peng-Tao Jiang,Xin Yuan
Main category: cs.CV
TL;DR: The RCOD framework improves real-world image super-resolution by offering flexible control over the balance between fidelity and realism, achieving superior performance while maintaining computational efficiency.
Details
Motivation: Traditional one-step diffusion methods for image super-resolution lack flexible control mechanisms to balance fidelity and realism across different scenarios. Method: The RCOD framework uses a latent domain grouping strategy, a degradation-aware sampling strategy, and a visual prompt injection module to control the fidelity-realism trade-off during noise prediction. Result: RCOD outperforms state-of-the-art one-step diffusion methods in both quantitative metrics and visual qualities, with flexible realism control capabilities during inference. Conclusion: RCOD is an effective framework for Real-ISR that allows flexible control of the fidelity-realism trade-off, achieving superior performance in both quantitative metrics and visual quality while maintaining computational efficiency. Abstract: Pre-trained diffusion models have shown great potential in real-world image super-resolution (Real-ISR) tasks by enabling high-resolution reconstructions. While one-step diffusion (OSD) methods significantly improve efficiency compared to traditional multi-step approaches, they still have limitations in balancing fidelity and realism across diverse scenarios. Since the OSDs for SR are usually trained or distilled by a single timestep, they lack flexible control mechanisms to adaptively prioritize these competing objectives, which are inherently manageable in multi-step methods through adjusting sampling steps. To address this challenge, we propose a Realism Controlled One-step Diffusion (RCOD) framework for Real-ISR. RCOD provides a latent domain grouping strategy that enables explicit control over fidelity-realism trade-offs during the noise prediction phase with minimal training paradigm modifications and original training data. A degradation-aware sampling strategy is also introduced to align distillation regularization with the grouping strategy and enhance the controlling of trade-offs. Moreover, a visual prompt injection module is used to replace conventional text prompts with degradation-aware visual tokens, enhancing both restoration accuracy and semantic consistency. Our method achieves superior fidelity and perceptual quality while maintaining computational efficiency. Extensive experiments demonstrate that RCOD outperforms state-of-the-art OSD methods in both quantitative metrics and visual qualities, with flexible realism control capabilities in the inference stage. The code will be released.[92] Grad-CL: Source Free Domain Adaptation with Gradient Guided Feature Disalignment
Rini Smita Thakur,Rajeev Ranjan Dwivedi,Vinod K Kurmi
Main category: cs.CV
TL;DR: Grad-CL is a source-free domain adaptation framework for optic disc and cup segmentation that improves cross-domain performance by combining gradient-guided pseudolabel refinement and cosine similarity-based contrastive learning.
Details
Motivation: Accurate segmentation of the optic disc and cup is essential for diagnosing ocular diseases like glaucoma, but existing models often perform poorly when applied to data from different imaging conditions, motivating the need for a domain adaptation framework. Method: The paper proposes Grad-CL, which combines a gradient-guided pseudolabel refinement module with a cosine similarity-based contrastive learning strategy to enhance segmentation performance without access to the original source data. Result: Grad-CL outperforms state-of-the-art unsupervised and source-free domain adaptation methods on cross-domain fundus imaging datasets in terms of segmentation accuracy and boundary delineation. Conclusion: Grad-CL is an effective source-free domain adaptation framework that improves segmentation accuracy and boundary delineation for optic disc and cup segmentation across different imaging protocols and conditions. Abstract: Accurate segmentation of the optic disc and cup is critical for the early diagnosis and management of ocular diseases such as glaucoma. However, segmentation models trained on one dataset often suffer significant performance degradation when applied to target data acquired under different imaging protocols or conditions. To address this challenge, we propose \textbf{Grad-CL}, a novel source-free domain adaptation framework that leverages a pre-trained source model and unlabeled target data to robustly adapt segmentation performance without requiring access to the original source data. Grad-CL combines a gradient-guided pseudolabel refinement module with a cosine similarity-based contrastive learning strategy. In the first stage, salient class-specific features are extracted via a gradient-based mechanism, enabling more accurate uncertainty quantification and robust prototype estimation for refining noisy pseudolabels. In the second stage, a contrastive loss based on cosine similarity is employed to explicitly enforce inter-class separability between the gradient-informed features of the optic cup and disc. Extensive experiments on challenging cross-domain fundus imaging datasets demonstrate that Grad-CL outperforms state-of-the-art unsupervised and source-free domain adaptation methods, achieving superior segmentation accuracy and improved boundary delineation. Project and code are available at https://visdomlab.github.io/GCL/.[93] Scalable Training for Vector-Quantized Networks with 100% Codebook Utilization
Yifan Chang,Jie Qin,Limeng Qiao,Xiaofeng Wang,Zheng Zhu,Lin Ma,Xingang Wang
Main category: cs.CV
TL;DR: 这项工作解决了矢量量化网络在训练过程中由于直通估计偏差、一步滞后更新和稀疏码本梯度导致的不稳定问题,并提出了一种简单而有效的解决方案VQBridge。
Details
Motivation: 矢量量化在图像生成的离散标记器中是关键组成部分,但由于直通估计偏差、一步滞后更新和稀疏码本梯度,其训练通常不稳定。 Method: 提出了VQBridge,一种基于映射函数方法的投影仪,通过压缩-处理-恢复流程优化码向量。 Result: FVQ在实验中展现了高效性、可扩展性和泛化能力,即使在262k码本的情况下也实现了100%的码本使用率,并且在更大的码本、更高的向量通道或更长的训练时间下持续改进。 Conclusion: FVQ实现了100%的码本使用率,并且在不同的码本配置中保持高效和稳定,同时提升了图像生成性能。 Abstract: Vector quantization (VQ) is a key component in discrete tokenizers for image generation, but its training is often unstable due to straight-through estimation bias, one-step-behind updates, and sparse codebook gradients, which lead to suboptimal reconstruction performance and low codebook usage. In this work, we analyze these fundamental challenges and provide a simple yet effective solution. To maintain high codebook usage in VQ networks (VQN) during learning annealing and codebook size expansion, we propose VQBridge, a robust, scalable, and efficient projector based on the map function method. VQBridge optimizes code vectors through a compress-process-recover pipeline, enabling stable and effective codebook training. By combining VQBridge with learning annealing, our VQN achieves full (100%) codebook usage across diverse codebook configurations, which we refer to as FVQ (FullVQ). Through extensive experiments, we demonstrate that FVQ is effective, scalable, and generalizable: it attains 100% codebook usage even with a 262k-codebook, achieves state-of-the-art reconstruction performance, consistently improves with larger codebooks, higher vector channels, or longer training, and remains effective across different VQ variants. Moreover, when integrated with LlamaGen, FVQ significantly enhances image generation performance, surpassing visual autoregressive models (VAR) by 0.5 and diffusion models (DiT) by 0.2 rFID, highlighting the importance of high-quality tokenizers for strong autoregressive image generation.[94] LayerLock: Non-collapsing Representation Learning with Progressive Freezing
Goker Erdogan,Nikhil Parthasarathy,Catalin Ionescu,Drew Hudson,Alexander Lerchner,Andrew Zisserman,Mehdi Sajjadi,Joao Carreira
Main category: cs.CV
TL;DR: LayerLock通过渐进式层冻结改进视频Transformer的自我监督学习,提高训练效率并避免表示崩溃,应用于大型模型取得优异结果。
Details
Motivation: 研究者观察到在视频掩码自动编码(MAE)模型的训练过程中,Transformer层按深度顺序收敛(浅层早,深层晚),这一现象可以被利用来改进模型训练效率和解决潜在预测中的表示崩溃问题。 Method: LayerLock方法利用视频Transformer(ViT)层在训练期间按深度顺序收敛的观察,通过在训练过程中根据明确的时间表逐步冻结模型层,从而加速标准MAE。 Result: 将LayerLock应用于最大达40亿参数的大型模型,在4DS感知套件上的结果超过了非潜在掩码预测方法。 Conclusion: LayerLock是一个简单而有效的自我监督视觉表示学习方法,通过逐步锁定模型层来加速标准MAE,并且能够进行不会导致“表示崩溃”的潜在预测。 Abstract: We introduce LayerLock, a simple yet effective approach for self-supervised visual representation learning, that gradually transitions from pixel to latent prediction through progressive layer freezing. First, we make the observation that during training of video masked-autoencoding (MAE) models, ViT layers converge in the order of their depth: shallower layers converge early, deeper layers converge late. We then show that this observation can be exploited to accelerate standard MAE by progressively freezing the model according to an explicit schedule, throughout training. Furthermore, this same schedule can be used in a simple and scalable approach to latent prediction that does not suffer from "representation collapse". We apply our proposed approach, LayerLock, to large models of up to 4B parameters with results surpassing those of non-latent masked prediction on the 4DS perception suite.[95] On the Geometric Accuracy of Implicit and Primitive-based Representations Derived from View Rendering Constraints
Elias De Smijter,Renaud Detry,Christophe De Vleeschouwer
Main category: cs.CV
TL;DR: The study finds that appearance embeddings improve lighting accuracy but not geometric precision in 3D object reconstruction for space robotics, and convex splatting offers more efficient representations than Gaussian splatting.
Details
Motivation: The motivation is to understand the effectiveness of appearance embeddings in improving 3D object reconstruction for space-based robotics, particularly in terms of geometric accuracy, which is crucial for tasks like interaction and collision avoidance. Method: The study uses a systematic comparison of implicit and explicit novel view synthesis methods, specifically K-Planes, Gaussian Splatting, and Convex Splatting, on the SPEED+ dataset. The focus is on evaluating the impact of appearance embeddings on geometric and photometric fidelity. Result: The results show that appearance embeddings mainly reduce the number of primitives needed for explicit methods rather than enhancing geometric fidelity. Convex splatting provides more compact and clutter-free representations compared to Gaussian splatting. Conclusion: The paper concludes that while appearance embeddings enhance photometric fidelity, they do not significantly improve geometric accuracy for space robotics applications. Convex splatting offers better representation efficiency than Gaussian splatting. Abstract: We present the first systematic comparison of implicit and explicit Novel View Synthesis methods for space-based 3D object reconstruction, evaluating the role of appearance embeddings. While embeddings improve photometric fidelity by modeling lighting variation, we show they do not translate into meaningful gains in geometric accuracy - a critical requirement for space robotics applications. Using the SPEED+ dataset, we compare K-Planes, Gaussian Splatting, and Convex Splatting, and demonstrate that embeddings primarily reduce the number of primitives needed for explicit methods rather than enhancing geometric fidelity. Moreover, convex splatting achieves more compact and clutter-free representations than Gaussian splatting, offering advantages for safety-critical applications such as interaction and collision avoidance. Our findings clarify the limits of appearance embeddings for geometry-centric tasks and highlight trade-offs between reconstruction quality and representation efficiency in space scenarios.[96] GAMMA: Generalizable Alignment via Multi-task and Manipulation-Augmented Training for AI-Generated Image Detection
Haozhen Yan,Yan Hong,Suning Lang,Jiahui Zhan,Yikun Ji,Yujie Gao,Jun Lan,Huijia Zhu,Weiqiang Wang,Jianfu Zhang
Main category: cs.CV
TL;DR: 本文提出了一种名为GAMMA的新训练框架,有效提升了AI生成图像检测的泛化能力和鲁棒性。
Details
Motivation: 现有AI生成图像检测方法在面对未见过的生成模型时泛化能力有限,主要依赖于生成特定的伪影。 Method: GAMMA引入了多种操作策略,并采用多任务监督和反向交叉注意力机制。 Result: 该方法在GenImage基准测试中达到了最先进的泛化性能,准确率提高了5.8%,并且在新发布的生成模型(如GPT-4o)上保持了强大的鲁棒性。 Conclusion: GAMMA通过减少领域偏差和增强语义对齐,提高了AI生成图像检测的泛化性能和鲁棒性。 Abstract: With generative models becoming increasingly sophisticated and diverse, detecting AI-generated images has become increasingly challenging. While existing AI-genereted Image detectors achieve promising performance on in-distribution generated images, their generalization to unseen generative models remains limited. This limitation is largely attributed to their reliance on generation-specific artifacts, such as stylistic priors and compression patterns. To address these limitations, we propose GAMMA, a novel training framework designed to reduce domain bias and enhance semantic alignment. GAMMA introduces diverse manipulation strategies, such as inpainting-based manipulation and semantics-preserving perturbations, to ensure consistency between manipulated and authentic content. We employ multi-task supervision with dual segmentation heads and a classification head, enabling pixel-level source attribution across diverse generative domains. In addition, a reverse cross-attention mechanism is introduced to allow the segmentation heads to guide and correct biased representations in the classification branch. Our method achieves state-of-the-art generalization performance on the GenImage benchmark, imporving accuracy by 5.8%, but also maintains strong robustness on newly released generative model such as GPT-4o.[97] Robustness and Diagnostic Performance of Super-Resolution Fetal Brain MRI
Ema Masterl,Tina Vipotnik Vesnaver,Žiga Špiclin
Main category: cs.CV
TL;DR: 本研究比较了三种超分辨率重建方法在胎儿大脑MRI中的应用效果,发现NeSVoR具有最优的重建成功率,且尽管不同方法导致体积估计差异,但诊断分类性能保持稳定。
Details
Motivation: 胎儿大脑MRI通常依赖快速多视角2D切片采集以减少胎儿运动引起的伪影,但这些图像通常是低分辨率的,可能受到运动伪影的影响,且不能充分捕捉3D解剖结构。虽然已有多种SRR方法被提出,但它们在病理情况下的表现以及对下游体积分析和诊断任务的影响尚未得到充分探索。 Method: 研究应用了三种最先进的超分辨率重建(SRR)方法(NiftyMIC、SVRTK和NeSVoR)对140个胎儿大脑MRI扫描进行处理,包括健康对照组和病理组(脑室扩张)。通过BoUNTi算法对每个高分辨率重建结果进行分割,提取九个主要脑结构的体积,并评估了视觉质量、SRR成功率、体积测量一致性和诊断分类性能。 Result: NeSVoR在健康对照组和病理组中均表现出最高且最一致的重建成功率(>90%)。尽管不同SRR方法在体积估计上存在显著差异,但选择不同的SRR方法并未影响脑室扩张的诊断分类性能。 Conclusion: NeSVoR在胎儿大脑MRI的超分辨率重建中表现出最高的重建成功率和一致性,尽管不同的超分辨率重建方法在体积估计上有显著差异,但诊断分类性能未受影响,这表明诊断结果对超分辨率重建引起的体积变化具有鲁棒性。 Abstract: Fetal brain MRI relies on rapid multi-view 2D slice acquisitions to reduce motion artifacts caused by fetal movement. However, these stacks are typically low resolution, may suffer from motion corruption, and do not adequately capture 3D anatomy. Super-resolution reconstruction (SRR) methods aim to address these limitations by combining slice-to-volume registration and super-resolution techniques to generate high-resolution (HR) 3D volumes. While several SRR methods have been proposed, their comparative performance - particularly in pathological cases - and their influence on downstream volumetric analysis and diagnostic tasks remain underexplored. In this study, we applied three state-of-the-art SRR method - NiftyMIC, SVRTK, and NeSVoR - to 140 fetal brain MRI scans, including both healthy controls (HC) and pathological cases (PC) with ventriculomegaly (VM). Each HR reconstruction was segmented using the BoUNTi algorithm to extract volumes of nine principal brain structures. We evaluated visual quality, SRR success rates, volumetric measurement agreement, and diagnostic classification performance. NeSVoR demonstrated the highest and most consistent reconstruction success rate (>90%) across both HC and PC groups. Although significant differences in volumetric estimates were observed between SRR methods, classification performance for VM was not affected by the choice of SRR method. These findings highlight NeSVoR's robustness and the resilience of diagnostic performance despite SRR-induced volumetric variability.[98] Mask Consistency Regularization in Object Removal
Hua Yuan,Jin Yuan,Yicheng Jiang,Yao Zhang,Xin Geng,Yong Rui
Main category: cs.CV
TL;DR: 本文提出了一种新的训练策略MCR,通过引入两种mask扰动来解决对象移除任务中的mask hallucination和mask-shape bias问题,从而提高图像修复的效果。
Details
Motivation: 为了解决图像修复中对象移除任务的两个关键挑战:mask hallucination和mask-shape bias。 Method: 提出了一种名为Mask Consistency Regularization (MCR)的训练策略,通过引入两种mask扰动:dilation和reshape,强制这些扰动分支的输出与原始mask保持一致性。 Result: 实验表明,MCR能显著减少幻觉和mask-shape偏差,从而在对象移除中实现更好的性能。 Conclusion: MCR是一种新的训练策略,可以有效解决对象移除任务中的mask hallucination和mask-shape bias问题,从而提高图像修复的效果。 Abstract: Object removal, a challenging task within image inpainting, involves seamlessly filling the removed region with content that matches the surrounding context. Despite advancements in diffusion models, current methods still face two critical challenges. The first is mask hallucination, where the model generates irrelevant or spurious content inside the masked region, and the second is mask-shape bias, where the model fills the masked area with an object that mimics the mask's shape rather than surrounding content. To address these issues, we propose Mask Consistency Regularization (MCR), a novel training strategy designed specifically for object removal tasks. During training, our approach introduces two mask perturbations: dilation and reshape, enforcing consistency between the outputs of these perturbed branches and the original mask. The dilated masks help align the model's output with the surrounding content, while reshaped masks encourage the model to break the mask-shape bias. This combination of strategies enables MCR to produce more robust and contextually coherent inpainting results. Our experiments demonstrate that MCR significantly reduces hallucinations and mask-shape bias, leading to improved performance in object removal.[99] MagicMirror: A Large-Scale Dataset and Benchmark for Fine-Grained Artifacts Assessment in Text-to-Image Generation
Jia Wang,Jie Hu,Xiaoqi Ma,Hanghang Ma,Yanbing Zeng,Xiaoming Wei
Main category: cs.CV
TL;DR: MagicMirror introduces a detailed framework for evaluating and addressing artifacts in text-to-image generation, showing that artifact reduction is a key challenge for future research.
Details
Motivation: Despite progress in text-to-image generation, physical artifacts like anatomical and structural flaws persist, degrading quality and limiting applications. A systematic evaluation framework is needed due to the complexity and diversity of these artifacts. Method: The MagicMirror framework includes a taxonomy of artifacts, a large-scale annotated dataset called MagicData340K, a Vision-Language Model named MagicAssessor, and a benchmark called MagicBench for automated evaluation. Result: MagicMirror reveals that even widely adopted, high-quality models like GPT-image-1 suffer from significant artifacts, demonstrating the critical need for improvement in this area. Conclusion: MagicMirror provides a comprehensive framework for assessing artifacts in text-to-image generation, highlighting the prevalence of such artifacts even in top-tier models and emphasizing the importance of artifact reduction for future development. Abstract: Text-to-image (T2I) generation has achieved remarkable progress in instruction following and aesthetics. However, a persistent challenge is the prevalence of physical artifacts, such as anatomical and structural flaws, which severely degrade perceptual quality and limit application. Given the diversity and complexity of these artifacts, a systematic and fine-grained evaluation framework is required, which is lacking in current benchmarks. To fill this gap, we introduce MagicMirror, a comprehensive framework for artifacts assessment. We first establish a detailed taxonomy of generated image artifacts. Guided by this taxonomy, we manually annotate MagicData340K, the first human-annotated large-scale dataset of 340K generated images with fine-grained artifact labels. Building on this dataset, we train MagicAssessor, a Vision-Language Model (VLM) that provides detailed assessments and corresponding labels. To overcome challenges like class imbalance and reward hacking, we design a novel data sampling strategy and a multi-level reward system for Group Relative Policy Optimization (GRPO). Finally, we leverage MagicAssessor to construct MagicBench, an automated benchmark for evaluating the image artifacts of current T2I models. Our evaluation with MagicBench reveals that despite their widespread adoption, even top-tier models like GPT-image-1 are consistently plagued by significant artifacts, highlighting artifact reduction as a critical frontier for future T2I development. Project page: https://wj-inf.github.io/MagicMirror-page/.[100] SignClip: Leveraging Mouthing Cues for Sign Language Translation by Multimodal Contrastive Fusion
Wenfang Wu,Tingting Yuan,Yupeng Li,Daling Wang,Xiaoming Fu
Main category: cs.CV
TL;DR: SignClip is a novel framework for sign language translation that effectively combines manual and non-manual signals, achieving superior performance on benchmark datasets.
Details
Motivation: Most sign language translation approaches overlook non-manual cues like mouthing, which convey essential linguistic information and help disambiguate visually similar signs. Method: SignClip utilizes a hierarchical contrastive learning framework with multi-level alignment objectives, fusing spatial gesture and lip movement features. Result: SignClip improved BLEU-4 from 24.32 to 24.71 and ROUGE from 46.57 to 48.38 on the PHOENIX14T dataset in the Gloss-free setting. Conclusion: SignClip successfully enhances sign language translation by integrating both manual and non-manual cues, outperforming previous models on benchmark datasets. Abstract: Sign language translation (SLT) aims to translate natural language from sign language videos, serving as a vital bridge for inclusive communication. While recent advances leverage powerful visual backbones and large language models, most approaches mainly focus on manual signals (hand gestures) and tend to overlook non-manual cues like mouthing. In fact, mouthing conveys essential linguistic information in sign languages and plays a crucial role in disambiguating visually similar signs. In this paper, we propose SignClip, a novel framework to improve the accuracy of sign language translation. It fuses manual and non-manual cues, specifically spatial gesture and lip movement features. Besides, SignClip introduces a hierarchical contrastive learning framework with multi-level alignment objectives, ensuring semantic consistency across sign-lip and visual-text modalities. Extensive experiments on two benchmark datasets, PHOENIX14T and How2Sign, demonstrate the superiority of our approach. For example, on PHOENIX14T, in the Gloss-free setting, SignClip surpasses the previous state-of-the-art model SpaMo, improving BLEU-4 from 24.32 to 24.71, and ROUGE from 46.57 to 48.38.[101] Detecting Text Manipulation in Images using Vision Language Models
Vidit Vidit,Pavel Korshunov,Amir Mohammadi,Christophe Ecabert,Ketan Kotwal,Sébastien Marcel
Main category: cs.CV
TL;DR: 本文研究了大型视觉语言模型在文本操作检测中的表现,发现开源模型虽有进展,但仍落后于闭源模型,并且特定模型在真实世界场景中表现不佳。
Details
Motivation: 最近的工作显示了大型视觉语言模型(VLMs)在图像操作检测方面的有效性,但在文本操作检测方面存在知识空白。 Method: 通过分析闭源和开源VLMs在不同文本操作数据集上的表现,并对特定于图像篡改检测的VLMs进行基准测试。 Result: 结果表明,开源模型在文本操作检测方面仍落后于闭源模型,特定于图像操作检测的VLMs存在泛化问题。 Conclusion: 本文得出结论,尽管开源模型正在接近,但像GPT-4o这样的闭源模型在文本操作检测方面仍然领先。 Abstract: Recent works have shown the effectiveness of Large Vision Language Models (VLMs or LVLMs) in image manipulation detection. However, text manipulation detection is largely missing in these studies. We bridge this knowledge gap by analyzing closed- and open-source VLMs on different text manipulation datasets. Our results suggest that open-source models are getting closer, but still behind closed-source ones like GPT- 4o. Additionally, we benchmark image manipulation detection-specific VLMs for text manipulation detection and show that they suffer from the generalization problem. We benchmark VLMs for manipulations done on in-the-wild scene texts and on fantasy ID cards, where the latter mimic a challenging real-world misuse.[102] MCL-AD: Multimodal Collaboration Learning for Zero-Shot 3D Anomaly Detection
Gang Li,Tianjiao Chen,Mingle Zhou,Min Li,Delong Han,Jin Wan
Main category: cs.CV
TL;DR: 本文提出了一种新的多模态协作学习框架MCL-AD,用于零样本3D异常检测,通过整合点云、RGB图像和文本语义,实现了最先进的性能。
Details
Motivation: 现有的零样本3D异常检测方法主要集中在点云上,忽略了RGB图像和文本先验等补充模态中的丰富语义线索。 Method: 提出了一种名为MCL-AD的新型框架,包括多模态提示学习机制(MPLM)和协作调制机制(CMM),通过结合点云、RGB图像和文本语义进行零样本3D异常检测。 Result: 实验表明,MCL-AD框架在零样本3D异常检测中表现优异,达到了最先进的水平。 Conclusion: MCL-AD框架在零样本3D异常检测中实现了最先进的性能,证明了多模态协作学习在这一领域的有效性。 Abstract: Zero-shot 3D (ZS-3D) anomaly detection aims to identify defects in 3D objects without relying on labeled training data, making it especially valuable in scenarios constrained by data scarcity, privacy, or high annotation cost. However, most existing methods focus exclusively on point clouds, neglecting the rich semantic cues available from complementary modalities such as RGB images and texts priors. This paper introduces MCL-AD, a novel framework that leverages multimodal collaboration learning across point clouds, RGB images, and texts semantics to achieve superior zero-shot 3D anomaly detection. Specifically, we propose a Multimodal Prompt Learning Mechanism (MPLM) that enhances the intra-modal representation capability and inter-modal collaborative learning by introducing an object-agnostic decoupled text prompt and a multimodal contrastive loss. In addition, a collaborative modulation mechanism (CMM) is proposed to fully leverage the complementary representations of point clouds and RGB images by jointly modulating the RGB image-guided and point cloud-guided branches. Extensive experiments demonstrate that the proposed MCL-AD framework achieves state-of-the-art performance in ZS-3D anomaly detection.[103] Adversarial robustness through Lipschitz-Guided Stochastic Depth in Neural Networks
Laith Nayal,Mahmoud Mousatat,Bader Rasheed
Main category: cs.CV
TL;DR: The paper proposes a depth-dependent DropPath method to improve robustness and reduce computation in Vision Transformers without sacrificing accuracy.
Details
Motivation: Deep neural networks and Vision Transformers are vulnerable to adversarial perturbations, and standard defenses often have high computational costs or lack formal guarantees. Method: Lipschitz-guided stochastic depth (DropPath) method with depth-dependent drop probabilities to control the effective Lipschitz constant of the network. Result: Experiments on CIFAR-10 with ViT-Tiny show near-baseline clean accuracy, enhanced robustness under FGSM, PGD-20, and AutoAttack, and reduced FLOPs compared to baseline and linear DropPath schedules. Conclusion: The proposed Lipschitz-guided stochastic depth method improves robustness and reduces computation in Vision Transformers while maintaining clean accuracy. Abstract: Deep neural networks and Vision Transformers achieve state-of-the-art performance in computer vision but are highly vulnerable to adversarial perturbations. Standard defenses often incur high computational cost or lack formal guarantees. We propose a Lipschitz-guided stochastic depth (DropPath) method, where drop probabilities increase with depth to control the effective Lipschitz constant of the network. This approach regularizes deeper layers, improving robustness while preserving clean accuracy and reducing computation. Experiments on CIFAR-10 with ViT-Tiny show that our custom depth-dependent schedule maintains near-baseline clean accuracy, enhances robustness under FGSM, PGD-20, and AutoAttack, and significantly reduces FLOPs compared to baseline and linear DropPath schedules.[104] A Stochastic Birth-and-Death Approach for Street Furniture Geolocation in Urban Environments
Evan Murphy,Marco Viola,Vladimir A. Krylov
Main category: cs.CV
TL;DR: 本文提出了一种利用能量地图和随机生死优化算法进行城市家具精确定位的新方法,具有较高的可扩展性和准确性。
Details
Motivation: 论文的动机是解决复杂城市环境中城市家具的精确定位问题,这对于地方当局和私人利益相关者有效监控和维护公共基础设施至关重要。 Method: 论文提出了一种基于能量地图的概率框架,并引入了随机生死优化算法来推断资产的最可能配置。 Result: 论文的结果显示,该方法在利用都柏林市中心街道照明基础设施的地理定位数据进行真实模拟时,展示了其在可扩展和精确的城市资产测绘方面的潜力。 Conclusion: 论文得出结论,所提出的基于能量图的概率框架结合随机生死优化算法,能够有效地实现城市家具的精确地理定位,具有较高的可扩展性和准确性。 Abstract: In this paper we address the problem of precise geolocation of street furniture in complex urban environments, which is a critical task for effective monitoring and maintenance of public infrastructure by local authorities and private stakeholders. To this end, we propose a probabilistic framework based on energy maps that encode the spatial likelihood of object locations. Representing the energy in a map-based geopositioned format allows the optimisation process to seamlessly integrate external geospatial information, such as GIS layers, road maps, or placement constraints, which improves contextual awareness and localisation accuracy. A stochastic birth-and-death optimisation algorithm is introduced to infer the most probable configuration of assets. We evaluate our approach using a realistic simulation informed by a geolocated dataset of street lighting infrastructure in Dublin city centre, demonstrating its potential for scalable and accurate urban asset mapping. The implementation of the algorithm will be made available in the GitHub repository https://github.com/EMurphy0108/SBD_Street_Furniture.[105] Compute Only 16 Tokens in One Timestep: Accelerating Diffusion Transformers with Cluster-Driven Feature Caching
Zhixin Zheng,Xinyu Wang,Chang Zou,Shaobo Wang,Linfeng Zhang
Main category: cs.CV
TL;DR: This paper introduces Cluster-Driven Feature Caching (ClusCa), a new method to accelerate diffusion transformers by leveraging spatial similarity through token clustering, significantly reducing computation while maintaining generation quality.
Details
Motivation: Diffusion transformers suffer from high computational costs due to iterative denoising. Feature caching has been introduced to accelerate them, but focuses only on temporal similarity, ignoring spatial similarity. Method: ClusCa performs spatial clustering on tokens in each timestep, computes only one token in each cluster, and propagates their information to all other tokens, reducing the number of tokens by over 90%. Result: ClusCa achieves 4.96x acceleration on FLUX with an ImageReward of 99.49%, surpassing the original model by 0.51%. It is applicable to any diffusion transformer without requiring training. Conclusion: Cluster-Driven Feature Caching (ClusCa) is an effective and versatile approach for accelerating diffusion transformers, achieving significant speed improvements without compromising quality. Abstract: Diffusion transformers have gained significant attention in recent years for their ability to generate high-quality images and videos, yet still suffer from a huge computational cost due to their iterative denoising process. Recently, feature caching has been introduced to accelerate diffusion transformers by caching the feature computation in previous timesteps and reusing it in the following timesteps, which leverage the temporal similarity of diffusion models while ignoring the similarity in the spatial dimension. In this paper, we introduce Cluster-Driven Feature Caching (ClusCa) as an orthogonal and complementary perspective for previous feature caching. Specifically, ClusCa performs spatial clustering on tokens in each timestep, computes only one token in each cluster and propagates their information to all the other tokens, which is able to reduce the number of tokens by over 90%. Extensive experiments on DiT, FLUX and HunyuanVideo demonstrate its effectiveness in both text-to-image and text-to-video generation. Besides, it can be directly applied to any diffusion transformer without requirements for training. For instance, ClusCa achieves 4.96x acceleration on FLUX with an ImageReward of 99.49%, surpassing the original model by 0.51%. The code is available at https://github.com/Shenyi-Z/Cache4Diffusion.[106] I-Segmenter: Integer-Only Vision Transformer for Efficient Semantic Segmentation
Jordan Sassoon,Michal Szczepanski,Martyna Poreba
Main category: cs.CV
TL;DR: I-Segmenter is an integer-only Vision Transformer framework for semantic segmentation that significantly reduces computational costs while maintaining competitive accuracy, enabling efficient deployment on low-resource devices.
Details
Motivation: Vision Transformers (ViTs) have high memory and computational demands, limiting their use on resource-constrained devices. Quantization improves efficiency but causes instability in ViT-based segmentation models due to error accumulation. Method: I-Segmenter replaces floating-point operations with integer-only counterparts and introduces the λ-ShiftGELU activation function to stabilize training and inference. Result: I-Segmenter achieves accuracy close to its FP32 baseline (average drop of 5.1%) while reducing model size by up to 3.8x and speeding up inference by up to 1.2x. It also performs well under one-shot PTQ with minimal calibration. Conclusion: I-Segmenter provides a practical solution for deploying ViT-based segmentation models on resource-constrained devices by achieving integer-only execution without significant loss in accuracy. Abstract: Vision Transformers (ViTs) have recently achieved strong results in semantic segmentation, yet their deployment on resource-constrained devices remains limited due to their high memory footprint and computational cost. Quantization offers an effective strategy to improve efficiency, but ViT-based segmentation models are notoriously fragile under low precision, as quantization errors accumulate across deep encoder-decoder pipelines. We introduce I-Segmenter, the first fully integer-only ViT segmentation framework. Building on the Segmenter architecture, I-Segmenter systematically replaces floating-point operations with integer-only counterparts. To further stabilize both training and inference, we propose $\lambda$-ShiftGELU, a novel activation function that mitigates the limitations of uniform quantization in handling long-tailed activation distributions. In addition, we remove the L2 normalization layer and replace bilinear interpolation in the decoder with nearest neighbor upsampling, ensuring integer-only execution throughout the computational graph. Extensive experiments show that I-Segmenter achieves accuracy within a reasonable margin of its FP32 baseline (5.1 % on average), while reducing model size by up to 3.8x and enabling up to 1.2x faster inference with optimized runtimes. Notably, even in one-shot PTQ with a single calibration image, I-Segmenter delivers competitive accuracy, underscoring its practicality for real-world deployment.[107] GARD: Gamma-based Anatomical Restoration and Denoising for Retinal OCT
Botond Fazekas,Thomas Pinetz,Guilherme Aresta,Taha Emre,Hrvoje Bogunovic
Main category: cs.CV
TL;DR: This paper proposes GARD, a deep learning approach for reducing speckle noise in OCT images, which surpasses traditional and state-of-the-art methods in preserving anatomical details and reducing noise, as evidenced by superior PSNR, SSIM, and MSE metrics.
Details
Motivation: OCT images are inherently degraded by speckle noise, which obscures fine details and hinders accurate interpretation, and existing denoising methods struggle to balance noise reduction with the preservation of crucial anatomical structures. Method: GARD employs a Denoising Diffusion Gamma Model and introduces a Noise-Reduced Fidelity Term to accurately reflect the statistical properties of speckle and prevent the reintroduction of high-frequency noise. The Denoising Diffusion Implicit Model framework is adapted to accelerate the inference process. Result: Experiments on a dataset with paired noisy and less-noisy OCT B-scans demonstrate that GARD significantly outperforms traditional denoising methods and state-of-the-art deep learning models in terms of PSNR, SSIM, and MSE. Qualitative results confirm that GARD produces sharper edges and better preserves fine anatomical details. Conclusion: GARD is a novel deep learning approach for OCT image despeckling that outperforms traditional denoising methods and state-of-the-art deep learning models by leveraging the strengths of diffusion probabilistic models and preserving fine anatomical details. Abstract: Optical Coherence Tomography (OCT) is a vital imaging modality for diagnosing and monitoring retinal diseases. However, OCT images are inherently degraded by speckle noise, which obscures fine details and hinders accurate interpretation. While numerous denoising methods exist, many struggle to balance noise reduction with the preservation of crucial anatomical structures. This paper introduces GARD (Gamma-based Anatomical Restoration and Denoising), a novel deep learning approach for OCT image despeckling that leverages the strengths of diffusion probabilistic models. Unlike conventional diffusion models that assume Gaussian noise, GARD employs a Denoising Diffusion Gamma Model to more accurately reflect the statistical properties of speckle. Furthermore, we introduce a Noise-Reduced Fidelity Term that utilizes a pre-processed, less-noisy image to guide the denoising process. This crucial addition prevents the reintroduction of high-frequency noise. We accelerate the inference process by adapting the Denoising Diffusion Implicit Model framework to our Gamma-based model. Experiments on a dataset with paired noisy and less-noisy OCT B-scans demonstrate that GARD significantly outperforms traditional denoising methods and state-of-the-art deep learning models in terms of PSNR, SSIM, and MSE. Qualitative results confirm that GARD produces sharper edges and better preserves fine anatomical details.[108] GLAM: Geometry-Guided Local Alignment for Multi-View VLP in Mammography
Yuexi Du,Lihui Chen,Nicha C. Dvornek
Main category: cs.CV
TL;DR: GLAM improves mammography analysis using a foundation visual language model that captures multi-view relationships through global and local alignment.
Details
Motivation: Existing mammography VLMs ignore domain-specific characteristics, such as multi-view relationships, leading to suboptimal predictions. Method: GLAM uses joint global and local, visual-visual, and visual-language contrastive learning to capture local cross-view alignments and fine-grained features in mammography. Result: The proposed GLAM model achieves better performance on multiple datasets compared to baseline methods. Conclusion: GLAM outperforms baselines across multiple datasets under different settings by leveraging multi-view imaging knowledge for global and local alignment. Abstract: Mammography screening is an essential tool for early detection of breast cancer. The speed and accuracy of mammography interpretation have the potential to be improved with deep learning methods. However, the development of a foundation visual language model (VLM) is hindered by limited data and domain differences between natural and medical images. Existing mammography VLMs, adapted from natural images, often ignore domain-specific characteristics, such as multi-view relationships in mammography. Unlike radiologists who analyze both views together to process ipsilateral correspondence, current methods treat them as independent images or do not properly model the multi-view correspondence learning, losing critical geometric context and resulting in suboptimal prediction. We propose GLAM: Global and Local Alignment for Multi-view mammography for VLM pretraining using geometry guidance. By leveraging the prior knowledge about the multi-view imaging process of mammograms, our model learns local cross-view alignments and fine-grained local features through joint global and local, visual-visual, and visual-language contrastive learning. Pretrained on EMBED [14], one of the largest open mammography datasets, our model outperforms baselines across multiple datasets under different settings.[109] Towards Understanding Visual Grounding in Visual Language Models
Georgios Pantazopoulos,Eda B. Özyiğit
Main category: cs.CV
TL;DR: This survey paper explores the role of visual grounding in vision language models (VLMs), reviewing key research areas, methodologies, and applications, while discussing evaluation metrics and future research directions.
Details
Motivation: Visual grounding allows models to link textual descriptions with specific visual elements, enabling a wide range of applications such as image captioning, question answering, and environment control. This survey aims to provide a structured overview of the field and its developments. Method: The paper conducts a comprehensive survey of existing research on visual grounding in VLMs, analyzing key components, methodologies, and evaluation benchmarks. Result: The paper outlines the significance of visual grounding in VLMs, identifies core development paradigms, evaluates practical applications and benchmarks, and explores the connections between visual grounding, multimodal reasoning, and chain-of-thought processes. Conclusion: The paper concludes that visual grounding is essential for the advancement of vision language models (VLMs), highlighting its role in enhancing multimodal understanding and interaction. It emphasizes the importance of addressing current challenges to unlock future research directions. Abstract: Visual grounding refers to the ability of a model to identify a region within some visual input that matches a textual description. Consequently, a model equipped with visual grounding capabilities can target a wide range of applications in various domains, including referring expression comprehension, answering questions pertinent to fine-grained details in images or videos, caption visual context by explicitly referring to entities, as well as low and high-level control in simulated and real environments. In this survey paper, we review representative works across the key areas of research on modern general-purpose vision language models (VLMs). We first outline the importance of grounding in VLMs, then delineate the core components of the contemporary paradigm for developing grounded models, and examine their practical applications, including benchmarks and evaluation metrics for grounded multimodal generation. We also discuss the multifaceted interrelations among visual grounding, multimodal chain-of-thought, and reasoning in VLMs. Finally, we analyse the challenges inherent to visual grounding and suggest promising directions for future research.[110] Immunizing Images from Text to Image Editing via Adversarial Cross-Attention
Matteo Trippodo,Federico Becattini,Lorenzo Seidenari
Main category: cs.CV
TL;DR: 提出了一种名为Attention Attack的新对抗攻击方法,通过破坏文本提示与图像视觉表示之间的交叉注意力,削弱文本驱动图像编辑的效果。
Details
Motivation: 文本驱动的图像编辑方法容易受到对抗攻击的影响,因此需要一种无需了解编辑方法或编辑提示即可实施的有效攻击策略。 Method: 引入了一种名为Attention Attack的方法,利用自动生成的图像标题作为编辑提示的代理,干扰文本提示与图像视觉表示之间的交叉注意力。 Result: 在TEDBench++基准上的实验表明,这种攻击显著削弱了编辑性能,同时仍然难以察觉。 Conclusion: 注意力攻击是一种针对文本驱动图像编辑方法视觉组件的新攻击方式,它通过破坏图像内容与其文本描述之间的对齐来削弱编辑效果。 Abstract: Recent advances in text-based image editing have enabled fine-grained manipulation of visual content guided by natural language. However, such methods are susceptible to adversarial attacks. In this work, we propose a novel attack that targets the visual component of editing methods. We introduce Attention Attack, which disrupts the cross-attention between a textual prompt and the visual representation of the image by using an automatically generated caption of the source image as a proxy for the edit prompt. This breaks the alignment between the contents of the image and their textual description, without requiring knowledge of the editing method or the editing prompt. Reflecting on the reliability of existing metrics for immunization success, we propose two novel evaluation strategies: Caption Similarity, which quantifies semantic consistency between original and adversarial edits, and semantic Intersection over Union (IoU), which measures spatial layout disruption via segmentation masks. Experiments conducted on the TEDBench++ benchmark demonstrate that our attack significantly degrades editing performance while remaining imperceptible.[111] Efficient Learned Image Compression Through Knowledge Distillation
Fabien Allemand,Attilio Fiandrotti,Sumanta Chaudhuri,Alaa Eddine Mazouz
Main category: cs.CV
TL;DR: This study explores the use of knowledge distillation to reduce the resource requirements of neural networks in image compression while maintaining performance, with potential for future research in different models and loss functions.
Details
Motivation: Neural network-based compression methods have shown to outperform conventional codecs but require significant processing power, making them unsuitable for real-time use on resource-constrained platforms. This hinders their deployment in mainstream applications. Method: The study uses knowledge distillation, a training paradigm where smaller neural networks are trained on the outputs of larger, more complex models, to reduce the resource requirements of neural networks used for image compression. Result: The study demonstrates that knowledge distillation can achieve better performance in image compression tasks, achieving different image quality/bit rate tradeoffs and saving processing and energy resources. Conclusion: Knowledge distillation can be effectively applied to image compression tasks and has potential for future research and application in transformer-based models. Abstract: Learned image compression sits at the intersection of machine learning and image processing. With advances in deep learning, neural network-based compression methods have emerged. In this process, an encoder maps the image to a low-dimensional latent space, which is then quantized, entropy-coded into a binary bitstream, and transmitted to the receiver. At the receiver end, the bitstream is entropy-decoded, and a decoder reconstructs an approximation of the original image. Recent research suggests that these models consistently outperform conventional codecs. However, they require significant processing power, making them unsuitable for real-time use on resource-constrained platforms, which hinders their deployment in mainstream applications. This study aims to reduce the resource requirements of neural networks used for image compression by leveraging knowledge distillation, a training paradigm where smaller neural networks, partially trained on the outputs of larger, more complex models, can achieve better performance than when trained independently. Our work demonstrates that knowledge distillation can be effectively applied to image compression tasks: i) across various architecture sizes, ii) to achieve different image quality/bit rate tradeoffs, and iii) to save processing and energy resources. This approach introduces new settings and hyperparameters, and future research could explore the impact of different teacher models, as well as alternative loss functions. Knowledge distillation could also be extended to transformer-based models. The code is publicly available at: https://github.com/FABallemand/PRIM .[112] Ordinality of Visible-Thermal Image Intensities for Intrinsic Image Decomposition
Zeqing Leo Yuan,Mani Ramanagopal,Aswin C. Sankaranarayanan,Srinivasa G. Narasimhan
Main category: cs.CV
TL;DR: This paper proposes a training-free method for intrinsic image decomposition using visible and thermal images, achieving better performance than existing learning-based models.
Details
Motivation: The lack of extensive ground-truth data for intrinsic image decomposition motivates a new approach that avoids reliance on synthetic data or sparse annotations. Method: A self-supervised neural network leveraging the relationship between visible and thermal image intensities to infer ordinalities of shading and reflectance. Result: Quantitative evaluations show superior performance under both natural and artificial lighting, with promising qualitative results across outdoor scenes. Conclusion: The proposed training-free approach using visible and thermal images outperforms recent learning-based models in intrinsic image decomposition and provides a scalable method for real-world ordinal supervision. Abstract: Decomposing an image into its intrinsic photometric factors--shading and reflectance--is a long-standing challenge due to the lack of extensive ground-truth data for real-world scenes. Recent methods rely on synthetic data or sparse annotations for limited indoor and even fewer outdoor scenes. We introduce a novel training-free approach for intrinsic image decomposition using only a pair of visible and thermal images. We leverage the principle that light not reflected from an opaque surface is absorbed and detected as heat by a thermal camera. This allows us to relate the ordinalities between visible and thermal image intensities to the ordinalities of shading and reflectance, which can densely self-supervise an optimizing neural network to recover shading and reflectance. We perform quantitative evaluations with known reflectance and shading under natural and artificial lighting, and qualitative experiments across diverse outdoor scenes. The results demonstrate superior performance over recent learning-based models and point toward a scalable path to curating real-world ordinal supervision, previously infeasible via manual labeling.[113] Compressed Video Quality Enhancement: Classifying and Benchmarking over Standards
Xiem HoangVan,Dang BuiDinh,Sang NguyenQuang,Wen-Hsiao Peng
Main category: cs.CV
TL;DR: This paper presents a comprehensive review of compressed video quality enhancement (CVQE), introducing a novel taxonomy, a benchmarking framework, and systematic analysis of trade-offs to improve research assessment and model selection.
Details
Motivation: The motivation is to address the limitations in existing CVQE surveys, including lack of systematic classification, insufficient comparative analysis, and underdeveloped benchmarking practices. Method: The paper proposes a novel taxonomy, a unified benchmarking framework, and a systematic analysis of trade-offs in CVQE methods using deep learning. Result: The paper delivers a comprehensive review with three key contributions: a novel taxonomy, a benchmarking framework, and an analysis of performance-complexity trade-offs in CVQE methods. Conclusion: The paper concludes that systematic classification, benchmarking, and analysis of CVQE methods are essential for informed model selection and future research directions. Abstract: Compressed video quality enhancement (CVQE) is crucial for improving user experience with lossy video codecs like H.264/AVC, H.265/HEVC, and H.266/VVC. While deep learning based CVQE has driven significant progress, existing surveys still suffer from limitations: lack of systematic classification linking methods to specific standards and artifacts, insufficient comparative analysis of architectural paradigms across coding types, and underdeveloped benchmarking practices. To address these gaps, this paper presents three key contributions. First, it introduces a novel taxonomy classifying CVQE methods across architectural paradigms, coding standards, and compressed-domain feature utilization. Second, it proposes a unified benchmarking framework integrating modern compression protocols and standard test sequences for fair multi-criteria evaluation. Third, it provides a systematic analysis of the critical trade-offs between reconstruction performance and computational complexity observed in state-of-the-art methods and highlighting promising directions for future research. This comprehensive review aims to establish a foundation for consistent assessment and informed model selection in CVQE research and deployment.[114] Multimodal SAM-adapter for Semantic Segmentation
Iacopo Curti,Pierluigi Zama Ramirez,Alioscia Petrelli,Luigi Di Stefano
Main category: cs.CV
TL;DR: This paper introduces MM SAM-adapter, a novel framework for multimodal semantic segmentation that enhances the Segment Anything Model (SAM) by integrating auxiliary sensor data, achieving robust performance across various challenging conditions.
Details
Motivation: Semantic segmentation, although advanced with deep learning, remains vulnerable to challenging conditions like poor lighting, occlusions, and adverse weather. This motivates the development of multimodal methods that use auxiliary sensor data to improve robustness. Method: The authors propose MM SAM-adapter, which uses an adapter network to inject fused multimodal features into SAM's RGB features. This approach allows the model to maintain the generalization ability of RGB features while incorporating auxiliary modalities when beneficial. Result: MM SAM-adapter achieves state-of-the-art performance on three challenging benchmarks—DeLiVER, FMB, and MUSES. The model shows consistent improvements in both favorable and adverse conditions, validating the effectiveness of the proposed multimodal adaptation strategy. Conclusion: The paper concludes that the MM SAM-adapter framework effectively enhances the robustness of semantic segmentation by integrating multimodal features into the Segment Anything Model (SAM), demonstrating state-of-the-art performance on multiple benchmarks under both favorable and adverse conditions. Abstract: Semantic segmentation, a key task in computer vision with broad applications in autonomous driving, medical imaging, and robotics, has advanced substantially with deep learning. Nevertheless, current approaches remain vulnerable to challenging conditions such as poor lighting, occlusions, and adverse weather. To address these limitations, multimodal methods that integrate auxiliary sensor data (e.g., LiDAR, infrared) have recently emerged, providing complementary information that enhances robustness. In this work, we present MM SAM-adapter, a novel framework that extends the capabilities of the Segment Anything Model (SAM) for multimodal semantic segmentation. The proposed method employs an adapter network that injects fused multimodal features into SAM's rich RGB features. This design enables the model to retain the strong generalization ability of RGB features while selectively incorporating auxiliary modalities only when they contribute additional cues. As a result, MM SAM-adapter achieves a balanced and efficient use of multimodal information. We evaluate our approach on three challenging benchmarks, DeLiVER, FMB, and MUSES, where MM SAM-adapter delivers state-of-the-art performance. To further analyze modality contributions, we partition DeLiVER and FMB into RGB-easy and RGB-hard subsets. Results consistently demonstrate that our framework outperforms competing methods in both favorable and adverse conditions, highlighting the effectiveness of multimodal adaptation for robust scene understanding. The code is available at the following link: https://github.com/iacopo97/Multimodal-SAM-Adapter.[115] InfGen: A Resolution-Agnostic Paradigm for Scalable Image Synthesis
Tao Han,Wanghan Xu,Junchao Gong,Xiaoyu Yue,Song Guo,Luping Zhou,Lei Bai
Main category: cs.CV
TL;DR: InfGen 是一种新的图像生成方法,可以在不重新训练扩散模型的情况下生成任意分辨率的图像,显著减少了生成4K图像的时间。
Details
Motivation: 当前的扩散模型在提高分辨率时计算需求急剧增加,导致生成4K图像的时间过长。 Method: 提出了一种新的图像生成方法 InfGen,它用一个新的一步生成器替代了 VAE 解码器,可以在不重新训练扩散模型的情况下生成任意分辨率的图像。 Result: 实验显示,InfGen 能够简化过程,降低计算复杂度,并且可以应用于任何使用相同潜在空间的模型。 Conclusion: InfGen 可以将现有的模型改进到任意高分辨率时代,并显著减少生成4K图像的时间。 Abstract: Arbitrary resolution image generation provides a consistent visual experience across devices, having extensive applications for producers and consumers. Current diffusion models increase computational demand quadratically with resolution, causing 4K image generation delays over 100 seconds. To solve this, we explore the second generation upon the latent diffusion models, where the fixed latent generated by diffusion models is regarded as the content representation and we propose to decode arbitrary resolution images with a compact generated latent using a one-step generator. Thus, we present the \textbf{InfGen}, replacing the VAE decoder with the new generator, for generating images at any resolution from a fixed-size latent without retraining the diffusion models, which simplifies the process, reducing computational complexity and can be applied to any model using the same latent space. Experiments show InfGen is capable of improving many models into the arbitrary high-resolution era while cutting 4K image generation time to under 10 seconds.[116] SSL-AD: Spatiotemporal Self-Supervised Learning for Generalizability and Adaptability Across Alzheimer's Prediction Tasks and Datasets
Emily Kaczmarek,Justin Szeto,Brennan Nichyporuk,Tal Arbel
Main category: cs.CV
TL;DR: 这项研究通过使用时间自监督学习(SSL)方法处理阿尔茨海默病(AD)的预测任务,解决了深度学习模型在数据标签不足、跨数据集泛化能力差以及对输入扫描数量和时间间隔适应性差的问题。研究在多个AD预测任务中展示了模型的卓越性能。