Table of Contents
cs.CL [Back]
[1] Activation-Space Personality Steering: Hybrid Layer Selection for Stable Trait Control in LLMs
Pranav Bhandari,Nicolas Fay,Sanjeevan Selvaganapathy,Amitava Datta,Usman Naseem,Mehwish Nasim
Main category: cs.CL
TL;DR: 提出一种新方法,通过大五人格特质从Transformer模型中提取隐藏状态,利用低秩子空间发现技术识别不同架构下的最优层,实现对大型语言模型人格特征的精确控制。
Details
Motivation: 大型语言模型在生成过程中表现出隐式人格,但如何可靠地控制或对齐这些人格特征仍是一个开放问题。现有研究缺乏有效的行为调控机制,且心理构念与模型内部表征之间的关系尚不明确。 Method: 提出一个新流程:使用大五人格特质(OCEAN)提取Transformer层的隐藏状态激活,应用低秩子空间发现方法,识别跨模型架构的特质特定最优层,并通过动态层选择的灵活引导框架实现行为调控。 Result: 发现人格特质占据低秩共享子空间,这些潜在结构可通过精细扰动转化为有效的引导机制,在不影响流畅性、多样性和整体能力的前提下实现对模型输出人格表达的精准控制。 Conclusion: 该工作弥合了心理理论与实际模型对齐之间的差距,为构建可操控的人格感知大型语言模型提供了可行路径。 Abstract: Large Language Models exhibit implicit personalities in their generation, but reliably controlling or aligning these traits to meet specific needs remains an open challenge. The need for effective mechanisms for behavioural manipulation of the model during generation is a critical gap in the literature that needs to be fulfilled. Personality-aware LLMs hold a promising direction towards this objective. However, the relationship between these psychological constructs and their representations within LLMs remains underexplored and requires further investigation. Moreover, it is intriguing to understand and study the use of these representations to steer the models' behaviour. We propose a novel pipeline that extracts hidden state activations from transformer layers using the Big Five Personality Traits (Openness, Conscientiousness, Extraversion, Agreeableness and Neuroticism), which is a comprehensive and empirically validated framework to model human personality applies low-rank subspace discovery methods, and identifies trait-specific optimal layers across different model architectures for robust injection. The resulting personality-aligned directions are then operationalised through a flexible steering framework with dynamic layer selection, enabling precise control of trait expression in LLM outputs. Our findings reveal that personality traits occupy a low-rank shared subspace, and that these latent structures can be transformed into actionable mechanisms for effective steering through careful perturbations without impacting the fluency, variance and general capabilities, helping to bridge the gap between psychological theory and practical model alignment.[2] TextualVerifier: Verify TextGrad Step-by-Step
Eugenius Mario Situmorang,Adila Alfa Krisnadhi,Ari Wibisono
Main category: cs.CL
TL;DR: 本文提出了TextualVerifier,一个基于大语言模型的文本验证框架,用于增强TextGrad在无显式数值梯度情况下的推理有效性。该框架通过思维链分解、变体生成、多数投票和共识聚合四阶段流程,在两个评估阶段中显著提升了推理正确性与系统性能。
Details
Motivation: TextGrad缺乏确保文本决策中推理有效性的自验证机制,限制了其在复杂AI系统中的可靠性。因此,需要一种无需数值梯度即可实现自我验证的方法来提升文本优化系统的可信度。 Method: 提出TextualVerifier框架,采用思维链推理和大语言模型的多数投票机制,构建四阶段验证流程(分解、生成、投票、聚合),并以非侵入方式集成到TextGrad的损失函数和优化结果验证环节。 Result: 在PRM800K上单独评估时,推理步骤有效性提升29%;与TextGrad集成后,在GPQA-Diamond、MMLU-ML和MMLU-CP上分别取得8.08、10.71和3.92个百分点的改进,损失函数验证使准确率从68.2%提升至70.4%,平均增加5.9次LLM调用,且结果具有统计显著性(p < 0.001)。 Conclusion: TextualVerifier是首个为TextGrad设计的基于LLM的自验证框架,无需依赖数值梯度即可提高文本优化中的推理可靠性,为文本驱动系统的验证提供了新方向。 Abstract: TextGrad is a novel approach to text-based automatic differentiation that enables composite AI systems to perform optimization without explicit numerical equations. However, it currently lacks self-verification mechanisms that ensure reasoning validity in text-based decision making. This research introduces TextualVerifier, a verification framework that leverages chain-of-thought reasoning and majority voting with large language models to address this verification gap. TextualVerifier implements a four-stage workflow: chain-of-thought decomposition, variant generation, majority voting, and consensus aggregation. It integrates non-invasively with TextGrad at both the loss function and optimization result verification stages. Experimental evaluation using the Gemini 1.5 Pro model is conducted in two phases: (1) standalone evaluation on PRM800K, and (2) integrated evaluation with TextGrad on GPQA-Diamond, MMLU-ML, and MMLU-CP benchmarks. Results show statistically significant improvements (p < 0.001). In phase one, TextualVerifier improves the validity of reasoning steps by 29 percent. In phase two, integration into TextGrad loss function yields a 2.2 percentage point gain from 68.2 to 70.4 percent with a moderate overhead of 5.9 LLM calls on average. Further evaluations of TextualVerifier versioning yield 8.08, 10.71, and 3.92 percentage point improvements on GPQA, MMLU-ML, and MMLU-CP respectively. TextualVerifier thus presents the first self-verification framework for TextGrad through LLM-based techniques without requiring numerical gradients, enabling more reliable reasoning and opening new directions for verification in text-based optimization.[3] GRDD+: An Extended Greek Dialectal Dataset with Cross-Architecture Fine-tuning Evaluation
Stergios Chatzikyriakidis,Dimitris Papadakis,Sevasti-Ioanna Papaioannou,Erofili Psaltaki
Main category: cs.CL
TL;DR: 本文提出了一个扩展的希腊方言数据集(GRDD+),新增了多种希腊方言,总规模达到637万词,涵盖10种方言,是目前规模最大、多样性最广的此类数据集。研究还评估了高质量方言数据对多种大语言模型的微调效果。
Details
Motivation: 现有的希腊方言数据集在覆盖范围和数据量上有限,缺乏对某些重要方言的代表性,因此需要构建一个更全面、更大规模的数据集以支持方言处理和语言模型研究。 Method: 通过补充Cretan、Cypriot、Pontic和Northern Greek等方言数据,并新增六种新方言(如Greco-Corsican、Griko等),构建了GRDD+数据集;随后对三种模型架构(Llama-3-8B、Llama-3.1-8B、Krikri-8B)进行微调,并与前沿模型(Claude-3.7-Sonnet、Gemini-2.5、ChatGPT-5)进行比较。 Result: GRDD+成为目前最大且方言种类最多的希腊方言数据集;微调实验表明,高质量的方言数据能显著提升语言模型在方言处理任务上的表现。 Conclusion: GRDD+为希腊方言研究提供了重要资源,证明了大规模、多样化的方言数据对提升语言模型性能具有关键作用,推动了低资源语言变体的建模发展。 Abstract: We present an extended Greek Dialectal Dataset (GRDD+) 1that complements the existing GRDD dataset with more data from Cretan, Cypriot, Pontic and Northern Greek, while we add six new varieties: Greco-Corsican, Griko (Southern Italian Greek), Maniot, Heptanesian, Tsakonian, and Katharevusa Greek. The result is a dataset with total size 6,374,939 words and 10 varieties. This is the first dataset with such variation and size to date. We conduct a number of fine-tuning experiments to see the effect of good quality dialectal data on a number of LLMs. We fine-tune three model architectures (Llama-3-8B, Llama-3.1-8B, Krikri-8B) and compare the results to frontier models (Claude-3.7-Sonnet, Gemini-2.5, ChatGPT-5).[4] PLLuM: A Family of Polish Large Language Models
Jan Kocoń,Maciej Piasecki,Arkadiusz Janz,Teddy Ferdinan,Łukasz Radliński,Bartłomiej Koptyra,Marcin Oleksy,Stanisław Woźniak,Paweł Walkowiak,Konrad Wojtasik,Julia Moska,Tomasz Naskręt,Bartosz Walkowiak,Mateusz Gniewkowski,Kamil Szyc,Dawid Motyka,Dawid Banach,Jonatan Dalasiński,Ewa Rudnicka,Bartłomiej Alberski,Tomasz Walkowiak,Aleksander Szczęsny,Maciej Markiewicz,Tomasz Bernaś,Hubert Mazur,Kamil Żyta,Mateusz Tykierko,Grzegorz Chodak,Tomasz Kajdanowicz,Przemysław Kazienko,Agnieszka Karlińska,Karolina Seweryn,Anna Kołos,Maciej Chrabąszcz,Katarzyna Lorenc,Aleksandra Krasnodębska,Artur Wilczek,Katarzyna Dziewulska,Paula Betscher,Zofia Cieślińska,Katarzyna Kowol,Daria Mikoś,Maciej Trzciński,Dawid Krutul,Marek Kozłowski,Sławomir Dadas,Rafał Poświata,Michał Perełkiewicz,Małgorzata Grębowiec,Maciej Kazuła,Marcin Białas,Roman Roszko,Danuta Roszko,Jurgita Vaičenonienė,Andrius Utka,Paweł Levchuk,Paweł Kowalski,Irena Prawdzic-Jankowska,Maciej Ogrodniczuk,Monika Borys,Anna Bulińska,Wiktoria Gumienna,Witold Kieraś,Dorota Komosińska,Katarzyna Krasnowska-Kieraś,Łukasz Kobyliński,Martyna Lewandowska,Marek Łaziński,Mikołaj Łątkowski,Dawid Mastalerz,Beata Milewicz,Agnieszka Anna Mykowiecka,Angelika Peljak-Łapińska,Sandra Penno,Zuzanna Przybysz,Michał Rudolf,Piotr Rybak,Karolina Saputa,Aleksandra Tomaszewska,Aleksander Wawer,Marcin Woliński,Joanna Wołoszyn,Alina Wróblewska,Bartosz Żuk,Filip Żarnecki,Konrad Kaczyński,Anna Cichosz,Zuzanna Deckert,Monika Garnys,Izabela Grabarczyk,Wojciech Janowski,Sylwia Karasińska,Aleksandra Kujawiak,Piotr Misztela,Maria Szymańska,Karolina Walkusz,Igor Siek,Jakub Kwiatkowski,Piotr Pęzik
Main category: cs.CL
TL;DR: PLLuM是波兰首个大规模开源语言模型家族,专注于波兰语,包含1400亿token的预训练语料和定制指令与偏好数据集,并集成负责任AI框架以确保安全与合规。
Details
Motivation: 现有大语言模型主要以英语为中心,其他语言支持有限,尤其缺乏高质量、透明且符合本地文化需求的波兰语模型。 Method: 构建了1400亿token的波兰语文本语料库,开发7.7万条定制指令数据集和10万条偏好优化数据集,采用基础模型与指令调优相结合的训练方式,并引入包含数据治理和输出安全过滤的负责任AI框架。 Result: 成功训练出PLLUM系列模型,在公共管理下游任务中表现出良好实用性,并公开发布以促进开放研究和波兰自主AI技术发展。 Conclusion: PLLuM填补了非英语大模型的空白,为波兰语提供了高质量、可信赖的语言模型基础,推动了本土化和主权AI的发展。 Abstract: Large Language Models (LLMs) play a central role in modern artificial intelligence, yet their development has been primarily focused on English, resulting in limited support for other languages. We present PLLuM (Polish Large Language Model), the largest open-source family of foundation models tailored specifically for the Polish language. Developed by a consortium of major Polish research institutions, PLLuM addresses the need for high-quality, transparent, and culturally relevant language models beyond the English-centric commercial landscape. We describe the development process, including the construction of a new 140-billion-token Polish text corpus for pre-training, a 77k custom instructions dataset, and a 100k preference optimization dataset. A key component is a Responsible AI framework that incorporates strict data governance and a hybrid module for output correction and safety filtering. We detail the models' architecture, training procedures, and alignment techniques for both base and instruction-tuned variants, and demonstrate their utility in a downstream task within public administration. By releasing these models publicly, PLLuM aims to foster open research and strengthen sovereign AI technologies in Poland.[5] STARS: Segment-level Token Alignment with Rejection Sampling in Large Language Models
Mohammad Atif Quamar,Mohammad Areeb,Mikhail Kuznetsov,Muslum Ozgur Ozmen,Z. Berkay Celik
Main category: cs.CL
TL;DR: 提出STARS:一种在解码时通过分段采样、评分和拒绝/接受短token段来对齐大语言模型与人类价值观的高效算法,优于SFT和DPO方法。
Details
Motivation: 现有对齐方法如微调计算成本高,而推理时方法如Best-of-N需要过高计算资源,难以实用。 Method: 提出STARS算法,在生成过程中逐段进行奖励引导的采样与筛选,实现生成路径的早期纠偏。 Result: 在六个大模型上,STARS在胜率上比SFT最高提升14.9个百分点,比DPO提升4.3个百分点,且与强Baseline相当。 Conclusion: STARS提供了一种可泛化、鲁棒且高效的LLM对齐方案,是传统微调和全序列排序方法的有效替代。 Abstract: Aligning large language models with human values is crucial for their safe deployment; however, existing methods, such as fine-tuning, are computationally expensive and suboptimal. In contrast, inference-time approaches like Best-of-N sampling require practically infeasible computation to achieve optimal alignment. We propose STARS: Segment-level Token Alignment with Rejection Sampling, a decoding-time algorithm that steers model generation by iteratively sampling, scoring, and rejecting/accepting short, fixed-size token segments. This allows for early correction of the generation path, significantly improving computational efficiency and boosting alignment quality. Across a suite of six LLMs, we show that STARS outperforms Supervised Fine-Tuning (SFT) by up to 14.9 percentage points and Direct Preference Optimization (DPO) by up to 4.3 percentage points on win-rates, while remaining highly competitive with strong Best-of-N baselines. Our work establishes granular, reward-guided sampling as a generalizable, robust, and efficient alternative to traditional fine-tuning and full-sequence ranking methods for aligning LLMs.[6] Divide, Cache, Conquer: Dichotomic Prompting for Efficient Multi-Label LLM-Based Classification
Mikołaj Langner,Jan Eliasz,Ewa Rudnicka,Jan Kocoń
Main category: cs.CL
TL;DR: 提出一种基于二元决策的高效多标签文本分类方法,通过将分类任务分解为独立的是/否查询,并结合前缀缓存机制,在不损失精度的情况下显著提升推理效率。
Details
Motivation: 传统多标签分类在大语言模型中效率较低,难以处理大量标签;需要一种更高效且可扩展的方法来提升短文本推断性能。 Method: 将多标签分类任务重构为一系列二元(是/否)决策问题,每个标签维度独立查询,并利用prefix caching优化推理;采用LLM-to-SLM知识蒸馏,用大模型(DeepSeek-V3)生成多标注数据,训练小型模型(如HerBERT、Gemma3等)。 Result: 该方法在24个情感维度上验证有效,蒸馏后的小模型显著优于零样本基线,尤其在训练中见过的维度上表现更好;推理效率大幅提升,且保持高准确率。 Conclusion: 将多标签分类分解为二元查询,结合知识蒸馏与缓存感知推理,构成了一种可扩展、高效的LLM-based分类框架,具有跨领域应用潜力。 Abstract: We introduce a method for efficient multi-label text classification with large language models (LLMs), built on reformulating classification tasks as sequences of dichotomic (yes/no) decisions. Instead of generating all labels in a single structured response, each target dimension is queried independently, which, combined with a prefix caching mechanism, yields substantial efficiency gains for short-text inference without loss of accuracy. To demonstrate the approach, we focus on affective text analysis, covering 24 dimensions including emotions and sentiment. Using LLM-to-SLM distillation, a powerful annotator model (DeepSeek-V3) provides multiple annotations per text, which are aggregated to fine-tune smaller models (HerBERT-Large, CLARIN-1B, PLLuM-8B, Gemma3-1B). The fine-tuned models show significant improvements over zero-shot baselines, particularly on the dimensions seen during training. Our findings suggest that decomposing multi-label classification into dichotomic queries, combined with distillation and cache-aware inference, offers a scalable and effective framework for LLM-based classification. While we validate the method on affective states, the approach is general and applicable across domains.[7] Evaluating Machine Translation Datasets for Low-Web Data Languages: A Gendered Lens
Hellina Hailu Nigatu,Bethelhem Yemane Mamo,Bontu Fufa Balcha,Debora Taye Tesfaye,Elbethel Daniel Zewdie,Ikram Behiru Nesiru,Jitu Ewnetu Hailu,Senait Mengesha Yayo
Main category: cs.CL
TL;DR: 本文研究了三种低资源语言(Afan Oromo、Amharic 和 Tigrinya)的机器翻译数据集质量,重点关注性别表征问题,发现数据集中存在明显的男性偏向以及对女性的有害和毒性描绘,表明数据量大并不等于质量高。
Details
Motivation: 在将低资源语言纳入NLP研究的过程中,当前过于注重数据规模而忽视质量,可能导致技术性能差并传播社会偏见,因此有必要评估现有数据集的质量问题。 Method: 分析三种低资源语言的机器翻译训练和基准数据集,从领域分布和性别表征(包括人名、动词语法性别和文本刻板印象)两个方面进行系统评估。 Result: 发现训练数据多来自政治和宗教领域,而基准数据集中于新闻、健康和体育;数据集中普遍存在男性偏向,并包含针对女性的有害和毒性内容,且数据量最大的语言此类问题更严重。 Conclusion: 数据质量对低资源语言的NLP发展至关重要,单纯追求数据规模可能带来负面影响,应尽早识别并缓解数据中的偏见与有害内容。 Abstract: As low-resourced languages are increasingly incorporated into NLP research, there is an emphasis on collecting large-scale datasets. But in prioritizing quantity over quality, we risk 1) building language technologies that perform poorly for these languages and 2) producing harmful content that perpetuates societal biases. In this paper, we investigate the quality of Machine Translation (MT) datasets for three low-resourced languages--Afan Oromo, Amharic, and Tigrinya, with a focus on the gender representation in the datasets. Our findings demonstrate that while training data has a large representation of political and religious domain text, benchmark datasets are focused on news, health, and sports. We also found a large skew towards the male gender--in names of persons, the grammatical gender of verbs, and in stereotypical depictions in the datasets. Further, we found harmful and toxic depictions against women, which were more prominent for the language with the largest amount of data, underscoring that quantity does not guarantee quality. We hope that our work inspires further inquiry into the datasets collected for low-resourced languages and prompts early mitigation of harmful content. WARNING: This paper contains discussion of NSFW content that some may find disturbing.[8] GRAD: Graph-Retrieved Adaptive Decoding for Hallucination Mitigation
Manh Nguyen,Sunil Gupta,Dai Do,Hung Le
Main category: cs.CL
TL;DR: 本文提出了一种名为Graph-Retrieved Adaptive Decoding (GRAD) 的解码时方法,通过构建稀疏的token转移图,利用语料库中的统计证据来减少大语言模型的幻觉问题,无需重新训练,具有轻量、即插即用的优势。
Details
Motivation: 现有的幻觉缓解方法依赖外部知识源,存在脆弱性或高检索成本,因此需要一种不依赖符号化知识图谱且高效的生成控制方法。 Method: GRAD在解码时通过一次前向传播在小型检索语料库上累积下一个token的logits,构建稀疏token转移图,并将图中检索到的logits与模型原始logits自适应融合,以支持高证据延续性同时保持生成流畅性。 Result: 在三个模型和多种问答基准(包括内在/外在幻觉和事实性任务)上,GRAD相比贪婪解码最高提升9.7%的内在准确率,降低8.6%的幻觉率,提高6.9%的正确性,并在所有方法中取得最高的真实-信息量乘积得分。 Conclusion: GRAD提供了一种轻量级、可插拔的方法,证明了基于语料库级token转移的统计证据能有效引导生成更真实、可验证的输出,优于对比解码和知识图谱增强方法。 Abstract: Hallucination mitigation remains a persistent challenge for large language models (LLMs), even as model scales grow. Existing approaches often rely on external knowledge sources, such as structured databases or knowledge graphs, accessed through prompting or retrieval. However, prompt-based grounding is fragile and domain-sensitive, while symbolic knowledge integration incurs heavy retrieval and formatting costs. Motivated by knowledge graphs, we introduce Graph-Retrieved Adaptive Decoding (GRAD), a decoding-time method that grounds generation in corpus-derived evidence without retraining. GRAD constructs a sparse token transition graph by accumulating next-token logits across a small retrieved corpus in a single forward pass. During decoding, graph-retrieved logits are max-normalized and adaptively fused with model logits to favor high-evidence continuations while preserving fluency. Across three models and a range of question-answering benchmarks spanning intrinsic, extrinsic hallucination, and factuality tasks, GRAD consistently surpasses baselines, achieving up to 9.7$\%$ higher intrinsic accuracy, 8.6$\%$ lower hallucination rates, and 6.9$\%$ greater correctness compared to greedy decoding, while attaining the highest truth--informativeness product score among all methods. GRAD offers a lightweight, plug-and-play alternative to contrastive decoding and knowledge graph augmentation, demonstrating that statistical evidence from corpus-level token transitions can effectively steer generation toward more truthful and verifiable outputs.[9] Context informs pragmatic interpretation in vision-language models
Alvin Wei Ming Tan,Ben Prystawski,Veronica Boyce,Michael C. Frank
Main category: cs.CL
TL;DR: 研究了人类与视觉语言模型在迭代指代游戏中的表现,发现相关上下文能显著提升模型性能,但抽象指代的少样本任务对模型仍具挑战。
Details
Motivation: 探索智能体在多轮语言环境中进行上下文敏感的语用推理能力。 Method: 通过改变上下文的数量、顺序和相关性,在迭代指代游戏中测试人类与视觉语言模型的表现。 Result: 无相关上下文时,模型表现高于随机但远差于人类;有相关上下文时,模型表现随轮次显著提升。 Conclusion: 相关上下文对模型性能至关重要,但当前模型在少样本、抽象指代任务上仍有局限。 Abstract: Iterated reference games - in which players repeatedly pick out novel referents using language - present a test case for agents' ability to perform context-sensitive pragmatic reasoning in multi-turn linguistic environments. We tested humans and vision-language models on trials from iterated reference games, varying the given context in terms of amount, order, and relevance. Without relevant context, models were above chance but substantially worse than humans. However, with relevant context, model performance increased dramatically over trials. Few-shot reference games with abstract referents remain a difficult task for machine learning models.[10] The Human Flourishing Geographic Index: A County-Level Dataset for the United States, 2013--2023
Stefano M. Iacus,Devika Jain,Andrea Nasuto,Giuseppe Porro,Marcello Carammia,Andrea Vezzulli
Main category: cs.CL
TL;DR: 提出了一个人类繁荣地理指数(HFGI),基于26亿条美国推文,利用微调的大语言模型分析48个与人类繁荣相关的指标,提供高时空分辨率的福祉衡量工具。
Details
Motivation: 现有衡量人类繁荣的指标往往缺乏足够的时间和空间分辨率,且多依赖传统调查数据,难以全面反映社会福祉的动态变化。 Method: 通过分析2013-2023年间约26亿条带有地理位置的美国推文,使用微调的大语言模型对与哈佛全球繁荣研究框架一致的48个指标进行分类,并新增对移民态度和腐败感知的测量,生成县级和州级、月度和年度的繁荣相关话语指标。 Result: 构建了人类繁荣地理指数(HFGI),验证表明该指数能准确反映潜在构念,并与已有指标呈现预期相关性,提供了过去十年美国社会媒体中人类繁荣话语的高分辨率动态数据集。 Conclusion: HFGI为研究福祉、不平等和社会变迁提供了新的多维度、高分辨率工具,拓展了利用社交媒体大数据理解人类繁荣的能力。 Abstract: Quantifying human flourishing, a multidimensional construct including happiness, health, purpose, virtue, relationships, and financial stability, is critical for understanding societal well-being beyond economic indicators. Existing measures often lack fine spatial and temporal resolution. Here we introduce the Human Flourishing Geographic Index (HFGI), derived from analyzing approximately 2.6 billion geolocated U.S. tweets (2013-2023) using fine-tuned large language models to classify expressions across 48 indicators aligned with Harvard's Global Flourishing Study framework plus attitudes towards migration and perception of corruption. The dataset offers monthly and yearly county- and state-level indicators of flourishing-related discourse, validated to confirm that the measures accurately represent the underlying constructs and show expected correlations with established indicators. This resource enables multidisciplinary analyses of well-being, inequality, and social change at unprecedented resolution, offering insights into the dynamics of human flourishing as reflected in social media discourse across the United States over the past decade.[11] Direct Semantic Communication Between Large Language Models via Vector Translation
Fu-Chun Yang,Jason Eshraghian
Main category: cs.CL
TL;DR: 提出了一种通过向量转换在不同大语言模型之间建立潜在语义桥梁的方法,实现跨模型的语义传递,提升多智能体系统中的信息交换效率。
Details
Motivation: 传统多智能体系统中,模型间以token形式传递信息,丢失了大部分潜在语义,限制了信息传输效率并增加计算开销。因此需要一种更高效的语义传递机制。 Method: 设计双编码器翻译器,在Llama-2-7B和Mistral-7B-Instruct之间学习映射关系,实现潜在表示空间的直接语义交换,并通过部分注入(30%强度)将翻译后的向量融入目标模型生成过程。 Result: 实现了平均0.538的余弦对齐度,验证了跨模型语义传递的可行性;双向评估显示2.01:1的传输不对称性,表明通用模型比指令调优模型更具可迁移性。 Conclusion: 跨模型潜在通信是可行的,且可在保持计算稳定的同时提升多智能体系统的协作效率,推动共享语义而非token的AI协作系统发展。 Abstract: In multi-agent settings, such as debate, reflection, or tool-calling, large language models (LLMs) pass messages as plain tokens, discarding most latent semantics. This constrains information transfer and adds unnecessary computational overhead. We form a latent bridge via vector translations, which use learned mappings that enable direct semantic exchange between representation spaces. A dual-encoder translator trained between Llama-2-7B and Mistral-7B-Instruct attains an average cosine alignment of 0.538. Injecting the translated vectors at 30 percent blending strength steers the target model's generation without destabilizing logits. Bidirectional evaluation shows a 2.01:1 transfer asymmetry, indicating that general-purpose models yield more transferable representations than instruction-tuned variants. This conservative injection preserves computational stability while demonstrating that cross-model latent communication is feasible, enabling collaborative AI systems that share meaning rather than tokens.[12] Abductive Inference in Retrieval-Augmented Language Models: Generating and Validating Missing Premises
Shiyin Lin
Main category: cs.CL
TL;DR: 本文提出了一种将溯因推理融入检索增强型大语言模型的框架,通过检测证据不足、生成候选前提并进行一致性与合理性验证,提升了回答准确性和推理可信度。
Details
Motivation: 现有检索增强生成(RAG)系统在检索到的证据不完整时容易失败,导致推理过程出现空白,因此需要一种方法来弥补这些缺失的前提。 Method: 提出一个集成溯因推理的框架,包括检测证据不足、生成可能的缺失前提,并通过一致性和合理性检查进行验证。 Result: 在溯因推理和多跳问答基准上的实验表明,该方法提高了答案准确性和推理过程的可信度。 Conclusion: 溯因推理是提升RAG系统鲁棒性和可解释性的一个有前景的方向。 Abstract: Large Language Models (LLMs) enhanced with retrieval -- commonly referred to as Retrieval-Augmented Generation (RAG) -- have demonstrated strong performance in knowledge-intensive tasks. However, RAG pipelines often fail when retrieved evidence is incomplete, leaving gaps in the reasoning process. In such cases, \emph{abductive inference} -- the process of generating plausible missing premises to explain observations -- offers a principled approach to bridge these gaps. In this paper, we propose a framework that integrates abductive inference into retrieval-augmented LLMs. Our method detects insufficient evidence, generates candidate missing premises, and validates them through consistency and plausibility checks. Experimental results on abductive reasoning and multi-hop QA benchmarks show that our approach improves both answer accuracy and reasoning faithfulness. This work highlights abductive inference as a promising direction for enhancing the robustness and explainability of RAG systems.[13] WST: Weakly Supervised Transducer for Automatic Speech Recognition
Dongji Gao,Chenda Liao,Changliang Liu,Matthew Wiesner,Leibny Paola Garcia,Daniel Povey,Sanjeev Khudanpur,Jian Wu
Main category: cs.CL
TL;DR: 提出了一种弱监督的WST模型,能够在高错误率的转录文本下保持良好的语音识别性能,优于现有的CTC-based方法。
Details
Motivation: RNN-T在端到端语音识别中依赖大量高质量标注数据,获取成本高,因此需要一种能容忍转录错误的弱监督方法。 Method: 设计了一个灵活的训练图结构,使WST无需置信度估计或辅助预训练模型即可鲁棒地处理转录错误。 Result: 在合成和工业数据集上验证了WST的有效性,即使转录错误率达到70%仍能保持性能,且优于BTC和OTC等现有方法。 Conclusion: WST具有较强的鲁棒性和实用价值,适用于真实场景中的ASR任务。 Abstract: The Recurrent Neural Network-Transducer (RNN-T) is widely adopted in end-to-end (E2E) automatic speech recognition (ASR) tasks but depends heavily on large-scale, high-quality annotated data, which are often costly and difficult to obtain. To mitigate this reliance, we propose a Weakly Supervised Transducer (WST), which integrates a flexible training graph designed to robustly handle errors in the transcripts without requiring additional confidence estimation or auxiliary pre-trained models. Empirical evaluations on synthetic and industrial datasets reveal that WST effectively maintains performance even with transcription error rates of up to 70%, consistently outperforming existing Connectionist Temporal Classification (CTC)-based weakly supervised approaches, such as Bypass Temporal Classification (BTC) and Omni-Temporal Classification (OTC). These results demonstrate the practical utility and robustness of WST in realistic ASR settings. The implementation will be publicly available.[14] T-FIX: Text-Based Explanations with Features Interpretable to eXperts
Shreya Havaldar,Helen Jin,Chaehyeon Kim,Anton Xue,Weiqiu You,Marco Gatti,Bhuvnesh Jain,Helen Qu,Daniel A Hashimoto,Amin Madani,Rajat Deo,Sameed Ahmed M. Khatana,Gary E. Weissman,Lyle Ungar,Eric Wong
Main category: cs.CL
TL;DR: 本文提出了T-FIX基准,用于评估大语言模型在知识密集型领域中生成的解释与领域专家判断的一致性。
Details
Motivation: 现有评估方法主要关注解释的合理性或内部忠实性,无法衡量解释内容是否符合专家直觉,因此需要一种新的评估标准来反映专家对解释的认可程度。 Method: 通过与多个知识密集型领域的专家合作,构建了涵盖七个领域的T-FIX基准,并设计了新的指标来量化LLM解释与专家判断之间的对齐程度。 Result: T-FIX能够有效衡量LLM生成解释在专业内容上与专家推理的一致性,揭示了当前模型在专家对齐方面的不足。 Conclusion: 专家对齐是评估LLM解释质量的重要标准,T-FIX为实现这一目标提供了可扩展且可靠的评估框架。 Abstract: As LLMs are deployed in knowledge-intensive settings (e.g., surgery, astronomy, therapy), users expect not just answers, but also meaningful explanations for those answers. In these settings, users are often domain experts (e.g., doctors, astrophysicists, psychologists) who require explanations that reflect expert-level reasoning. However, current evaluation schemes primarily emphasize plausibility or internal faithfulness of the explanation, which fail to capture whether the content of the explanation truly aligns with expert intuition. We formalize expert alignment as a criterion for evaluating explanations with T-FIX, a benchmark spanning seven knowledge-intensive domains. In collaboration with domain experts, we develop novel metrics to measure the alignment of LLM explanations with expert judgment.[15] Plan of Knowledge: Retrieval-Augmented Large Language Models for Temporal Knowledge Graph Question Answering
Xinying Qian,Ying Zhang,Yu Zhao,Baohang Zhou,Xuhui Sui,Xiaojie Yuan
Main category: cs.CL
TL;DR: 本文提出了一种名为PoK的框架,结合知识规划与对比时间检索,提升大语言模型在时序知识图谱问答中的推理准确性和可解释性。
Details
Motivation: 现有方法在理解复杂时间约束语义方面存在不足,且大语言模型存在幻觉和知识缺失问题,限制了其在时序知识图谱问答中的表现。 Method: 提出Plan of Knowledge模块,将复杂时序问题分解为子目标,并构建带对比检索机制的时序知识库(TKS),实现语义和时间对齐的事实检索,结合结构化规划与知识检索进行推理。 Result: 在四个基准数据集上实验表明,PoK显著提升了检索精度和推理准确性,最高超越现有最先进方法56.0%。 Conclusion: PoK通过结构化规划与对比检索有效增强了大语言模型在时序知识图谱问答中的时间推理能力、可解释性和事实一致性。 Abstract: Temporal Knowledge Graph Question Answering (TKGQA) aims to answer time-sensitive questions by leveraging factual information from Temporal Knowledge Graphs (TKGs). While previous studies have employed pre-trained TKG embeddings or graph neural networks to inject temporal knowledge, they fail to fully understand the complex semantic information of time constraints. Recently, Large Language Models (LLMs) have shown remarkable progress, benefiting from their strong semantic understanding and reasoning generalization capabilities. However, their temporal reasoning ability remains limited. LLMs frequently suffer from hallucination and a lack of knowledge. To address these limitations, we propose the Plan of Knowledge framework with a contrastive temporal retriever, which is named PoK. Specifically, the proposed Plan of Knowledge module decomposes a complex temporal question into a sequence of sub-objectives from the pre-defined tools, serving as intermediate guidance for reasoning exploration. In parallel, we construct a Temporal Knowledge Store (TKS) with a contrastive retrieval framework, enabling the model to selectively retrieve semantically and temporally aligned facts from TKGs. By combining structured planning with temporal knowledge retrieval, PoK effectively enhances the interpretability and factual consistency of temporal reasoning. Extensive experiments on four benchmark TKGQA datasets demonstrate that PoK significantly improves the retrieval precision and reasoning accuracy of LLMs, surpassing the performance of the state-of-the-art TKGQA methods by 56.0% at most.[16] The truth is no diaper: Human and AI-generated associations to emotional words
Špela Vintar,Jan Jona Javoršek
Main category: cs.CL
TL;DR: 比较人类与大语言模型在情感词联想上的表现,发现两者有中等程度的重叠,但大语言模型的联想更可预测、创造性较低,并倾向于放大情感强度。
Details
Motivation: 探究大语言模型是否能像人类一样进行词语联想,特别是在情感词上的联想行为,以理解其创造力和认知模拟能力。 Method: 通过对比人类参与者与大语言模型对情感词提示的联想反应,分析其关联模式、情感强度和创造性差异。 Result: 人类与大语言模型的联想重叠程度中等;大语言模型的联想更具可预测性,创造性较低,并倾向于放大输入词的情感负荷。 Conclusion: 大语言模型在词语联想上虽表现出一定类似人类的行为,但在创造性和情感调节方面仍存在明显差异,显示出其联想过程更为机械和稳定。 Abstract: Human word associations are a well-known method of gaining insight into the internal mental lexicon, but the responses spontaneously offered by human participants to word cues are not always predictable as they may be influenced by personal experience, emotions or individual cognitive styles. The ability to form associative links between seemingly unrelated concepts can be the driving mechanisms of creativity. We perform a comparison of the associative behaviour of humans compared to large language models. More specifically, we explore associations to emotionally loaded words and try to determine whether large language models generate associations in a similar way to humans. We find that the overlap between humans and LLMs is moderate, but also that the associations of LLMs tend to amplify the underlying emotional load of the stimulus, and that they tend to be more predictable and less creative than human ones.[17] Improving the Performance of Radiology Report De-identification with Large-Scale Training and Benchmarking Against Cloud Vendor Methods
Eva Prakash,Maayane Attias,Pierre Chambon,Justin Xu,Steven Truong,Jean-Benoit Delbrouck,Tessa Cook,Curtis Langlotz
Main category: cs.CL
TL;DR: 本研究通过在大规模放射学数据集上训练基于Transformer的去标识化模型,并引入新的年龄(AGE)类别,显著提升了保护性健康信息(PHI)检测性能,超越了现有学术和商业系统,建立了临床文本处理的新基准。
Details
Motivation: 为了提高放射学报告自动去标识化的准确性和跨机构泛化能力,需要更大规模的训练数据和更先进的模型架构,并与商业系统进行性能对比。 Method: 基于最先进的Transformer模型,在斯坦福大学两个大型标注放射学语料库上进行微调,涵盖多种影像类型,并新增AGE类别;使用宾夕法尼亚大学测试集评估模型性能,采用‘明文隐藏’方法生成合成PHI以评估稳定性,并与商业云服务系统进行比较。 Result: 模型在宾夕法尼亚大学和斯坦福测试集上的整体F1分数分别为0.973和0.996,优于或保持了先前最先进模型的表现;合成PHI检测F1达0.959,且在50个独立去标识化数据集中表现稳定;在合成报告上显著优于所有商业系统(F1: 0.960 vs. 0.632–0.754)。 Conclusion: 基于多样化放射学数据集训练的Transformer去标识化模型在PHI检测方面优于现有学术和商业系统,具备更强的跨机构泛化能力和隐私保护效果,为安全的临床文本处理设立了新标准。 Abstract: Objective: To enhance automated de-identification of radiology reports by scaling transformer-based models through extensive training datasets and benchmarking performance against commercial cloud vendor systems for protected health information (PHI) detection. Materials and Methods: In this retrospective study, we built upon a state-of-the-art, transformer-based, PHI de-identification pipeline by fine-tuning on two large annotated radiology corpora from Stanford University, encompassing chest X-ray, chest CT, abdomen/pelvis CT, and brain MR reports and introducing an additional PHI category (AGE) into the architecture. Model performance was evaluated on test sets from Stanford and the University of Pennsylvania (Penn) for token-level PHI detection. We further assessed (1) the stability of synthetic PHI generation using a "hide-in-plain-sight" method and (2) performance against commercial systems. Precision, recall, and F1 scores were computed across all PHI categories. Results: Our model achieved overall F1 scores of 0.973 on the Penn dataset and 0.996 on the Stanford dataset, outperforming or maintaining the previous state-of-the-art model performance. Synthetic PHI evaluation showed consistent detectability (overall F1: 0.959 [0.958-0.960]) across 50 independently de-identified Penn datasets. Our model outperformed all vendor systems on synthetic Penn reports (overall F1: 0.960 vs. 0.632-0.754). Discussion: Large-scale, multimodal training improved cross-institutional generalization and robustness. Synthetic PHI generation preserved data utility while ensuring privacy. Conclusion: A transformer-based de-identification model trained on diverse radiology datasets outperforms prior academic and commercial systems in PHI detection and establishes a new benchmark for secure clinical text processing.[18] A Characterization of List Language Identification in the Limit
Moses Charikar,Chirag Pabbaraju,Ambuj Tewari
Main category: cs.CL
TL;DR: 本文研究了在极限条件下语言识别的问题,提出当学习者每次可以给出k个猜测时,能够精确刻画可k-列表识别的语言集合,并证明其等价于将语言集分解为k个可单独识别的子集。同时,在统计设定下分析了识别速率,表明可识别时能达到指数速率,否则无法实现收敛。
Details
Motivation: 受近期语言生成问题取得的积极成果启发,重新审视经典的语言识别问题,探索在允许学习者每次输出k个猜测的情况下,语言识别的可能性与效率。 Method: 基于Angluin对单猜测语言识别的刻画,提出了递归版本的k-列表识别条件,并通过语言集合的分解建立等价表征;进一步在独立同分布输入流下分析识别速率。 Result: 给出了k-列表可识别语言集合的精确刻画,证明其可分解为k个传统可识别子集;在统计设定下,若可k-列表识别,则能以指数速率收敛,否则无法以任何趋于零的速率识别。 Conclusion: 语言集合可k-列表识别当且仅当其可分解为k个传统可识别的语言子集,且该结果在统计设定下具有最优的指数收敛性质。 Abstract: We study the problem of language identification in the limit, where given a sequence of examples from a target language, the goal of the learner is to output a sequence of guesses for the target language such that all the guesses beyond some finite time are correct. Classical results of Gold showed that language identification in the limit is impossible for essentially any interesting collection of languages. Later, Angluin gave a precise characterization of language collections for which this task is possible. Motivated by recent positive results for the related problem of language generation, we revisit the classic language identification problem in the setting where the learner is given the additional power of producing a list of $k$ guesses at each time step. The goal is to ensure that beyond some finite time, one of the guesses is correct at each time step. We give an exact characterization of collections of languages that can be $k$-list identified in the limit, based on a recursive version of Angluin's characterization (for language identification with a list of size $1$). This further leads to a conceptually appealing characterization: A language collection can be $k$-list identified in the limit if and only if the collection can be decomposed into $k$ collections of languages, each of which can be identified in the limit (with a list of size $1$). We also use our characterization to establish rates for list identification in the statistical setting where the input is drawn as an i.i.d. stream from a distribution supported on some language in the collection. Our results show that if a collection is $k$-list identifiable in the limit, then the collection can be $k$-list identified at an exponential rate, and this is best possible. On the other hand, if a collection is not $k$-list identifiable in the limit, then it cannot be $k$-list identified at any rate that goes to zero.[19] Batch Prompting Suppresses Overthinking Reasoning Under Constraint: How Batch Prompting Suppresses Overthinking in Reasoning Models
Wenmo Qiu,Saurabh Srivastava
Main category: cs.CL
TL;DR: 批处理不仅提高了大推理模型的推理效率,还通过抑制过度思考和减少犹豫语言来正则化模型行为,同时展现出跨样本的集体泛化效应。
Details
Motivation: 探索批处理在大语言模型中的潜在益处,超越传统的推理成本分摊,关注其对多步推理过程中模型行为的影响。 Method: 在13个多样化基准上进行综合研究,分析批处理对准确性、推理令牌使用、过度思考和语言犹豫等方面的影响,并观察批内样本间的集体效应。 Result: 批处理显著提升准确率并减少3到5倍的推理令牌使用,抑制过度思考和自我修正行为,促使回答更果断,并出现从简单到困难样本的跨例泛化现象。 Conclusion: 批处理不仅是提高吞吐量的优化手段,更是一种有效的推理时正则化方法,使大语言模型的推理更高效且可靠。 Abstract: Recent work has explored batch prompting as a strategy to amortize inference cost in large language models (LLMs). In this paper, we show that batching offers an additional, underappreciated benefit: it regularizes model behavior during multi-step reasoning for Large Reasoning Models (LRMs). We conduct a comprehensive study across 13 diverse benchmarks and observe that batching improves accuracy while substantially reducing reasoning token usage, often by 3x-5x. Through detailed behavioral analysis, we find that batching suppresses overthinking, reduces hedging language (e.g., repetitive self-corrections), and encourages more decisive answers. Surprisingly, we also observe emergent collective effects in batched inference: models often generalize patterns from earlier examples to solve harder ones in the same batch. These findings position batching not just as a throughput optimization, but as a powerful inference-time regularizer for more efficient and reliable LLM reasoning.[20] RIDE: Difficulty Evolving Perturbation with Item Response Theory for Mathematical Reasoning
Xinyuan Li,Murong Xu,Wenbiao Tao,Hanlun Zhu,Yike Zhao,Jipeng Zhang,Yunshi Lan
Main category: cs.CL
TL;DR: 提出RIDE框架,利用项目反应理论(IRT)和强化学习生成更具挑战性的数学问题变体,以更严格地评估大语言模型的数学推理能力。
Details
Motivation: 现有评估方法因规则基础的扰动生成不良问题,难以系统评估问题难度,且可能高估模型的真实数学推理能力,因此需要一种更可靠的对抗性评估方法。 Method: 提出RIDE框架,结合项目反应理论(IRT),利用35个LLM模拟学生行为构建难度排序器,并通过强化学习指导问题重写模型生成跨难度级别的新问题。 Result: 在竞赛级数学基准上应用RIDE后,26个先进LLM的性能平均下降21.73%,表明当前模型的数学推理鲁棒性有限。 Conclusion: RIDE能有效生成高质量、更具挑战性的问题,揭示了现有大语言模型在数学推理上的脆弱性,验证了该评估方法的有效性和必要性。 Abstract: Large language models (LLMs) achieve high performance on mathematical reasoning, but these results can be inflated by training data leakage or superficial pattern matching rather than genuine reasoning. To this end, an adversarial perturbation-based evaluation is needed to measure true mathematical reasoning ability. Current rule-based perturbation methods often generate ill-posed questions and impede the systematic evaluation of question difficulty and the evolution of benchmarks. To bridge this gap, we propose RIDE, a novel adversarial question-rewriting framework that leverages Item Response Theory (IRT) to rigorously measure question difficulty and to generate intrinsically more challenging, well-posed variations of mathematical problems. We employ 35 LLMs to simulate students and build a difficulty ranker from their responses. This ranker provides a reward signal during reinforcement learning and guides a question-rewriting model to reformulate existing questions across difficulty levels. Applying RIDE to competition-level mathematical benchmarks yields perturbed versions that degrade advanced LLM performance, with experiments showing an average 21.73% drop across 26 models, thereby exposing limited robustness in mathematical reasoning and confirming the validity of our evaluation approach.[21] CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese
Dazhong Chen,Yi-Cheng Lin,Yuchen Huang,Ziwei Gong,Di Jiang,Zeying Xie,Yi R.,Fung
Main category: cs.CL
TL;DR: 提出CantoASR,一种结合ASR与大音频语言模型的协作框架,通过强制对齐、LoRA微调Whisper和指令调优Qwen-Audio实现粤语语音识别的显著改进。
Details
Motivation: 粤语作为低资源语言,受限于标注数据少、声调复杂(六声调、变调)及口音差异,现有ASR系统错误率高,难以有效处理声调与韵律信息。 Method: 构建CantoASR框架:1)使用强制对齐提取声学特征;2)LoRA微调Whisper以提升声调分辨能力;3)指令调优Qwen-Audio进行韵律感知的错误纠正。 Result: 在自发性粤语数据上评估,CantoASR相比Whisper-Large-V3显著降低字符错误率(CER),展现出更优性能。 Conclusion: 结合显式声学线索(如声调、韵律)与大音频语言模型的上下文推理能力,可为低资源声调语言和方言的语音识别提供可扩展的有效方案。 Abstract: Automatic speech recognition (ASR) is critical for language accessibility, yet low-resource Cantonese remains challenging due to limited annotated data, six lexical tones, tone sandhi, and accent variation. Existing ASR models, such as Whisper, often suffer from high word error rates. Large audio-language models (LALMs), in contrast, can leverage broader contextual reasoning but still require explicit tonal and prosodic acoustic cues. We introduce CantoASR, a collaborative ASR-LALM error correction framework that integrates forced alignment for acoustic feature extraction, a LoRA-finetuned Whisper for improved tone discrimination, and an instruction-tuned Qwen-Audio for prosody-aware correction. Evaluations on spontaneous Cantonese data show substantial CER gains over Whisper-Large-V3. These findings suggest that integrating acoustic cues with LALM reasoning provides a scalable strategy for low-resource tonal and dialectal ASR.[22] BAPPA: Benchmarking Agents, Plans, and Pipelines for Automated Text-to-SQL Generation
Fahim Ahmed,Md Mubtasim Ahasan,Jahir Sadik Monon,Muntasir Wahed,M Ashraful Amin,A K M Mahbubur Rahman,Amin Ahsan Ali
Main category: cs.CL
TL;DR: 本文探索了三种多智能体LLM流水线,用于提升小模型在Text-to-SQL任务上的性能,实验表明多智能体协作能显著提高执行准确率。
Details
Motivation: 现有大语言模型在处理大规模模式和复杂推理时生成SQL的能力有限,且多数研究集中于复杂且不实用的流水线,忽视了小型高效模型的潜力。 Method: 提出了三种多智能体LLM流水线:多智能体讨论流水线、Planner-Coder流水线和Coder-Aggregator流水线,并在多个开源模型上进行系统性性能基准测试。 Result: 实验结果显示,多智能体讨论可使小模型性能提升最多10.6%的执行准确率;其中LLM Reasoner-Coder流水线表现最佳,将Gemma 3 27B IT的准确率从52.4%提升至56.4%。 Conclusion: 多智能体协作框架能有效提升小模型在Text-to-SQL任务中的表现,为高效、实用的自然语言接口提供了新思路。 Abstract: Text-to-SQL systems provide a natural language interface that can enable even laymen to access information stored in databases. However, existing Large Language Models (LLM) struggle with SQL generation from natural instructions due to large schema sizes and complex reasoning. Prior work often focuses on complex, somewhat impractical pipelines using flagship models, while smaller, efficient models remain overlooked. In this work, we explore three multi-agent LLM pipelines, with systematic performance benchmarking across a range of small to large open-source models: (1) Multi-agent discussion pipeline, where agents iteratively critique and refine SQL queries, and a judge synthesizes the final answer; (2) Planner-Coder pipeline, where a thinking model planner generates stepwise SQL generation plans and a coder synthesizes queries; and (3) Coder-Aggregator pipeline, where multiple coders independently generate SQL queries, and a reasoning agent selects the best query. Experiments on the Bird-Bench Mini-Dev set reveal that Multi-Agent discussion can improve small model performance, with up to 10.6% increase in Execution Accuracy for Qwen2.5-7b-Instruct seen after three rounds of discussion. Among the pipelines, the LLM Reasoner-Coder pipeline yields the best results, with DeepSeek-R1-32B and QwQ-32B planners boosting Gemma 3 27B IT accuracy from 52.4% to the highest score of 56.4%. Codes are available at https://github.com/treeDweller98/bappa-sql.[23] Trustworthy LLM-Mediated Communication: Evaluating Information Fidelity in LLM as a Communicator (LAAC) Framework in Multiple Application Domains
Mohammed Musthafa Rafi,Adarsh Krishnamurthy,Aditya Balu
Main category: cs.CL
TL;DR: 本文提出了LAAC(大语言模型作为沟通中介)框架,旨在解决AI生成内容泛滥导致的“沟通剧场”问题,通过将LLM用作意图捕捉与知识传递的中介,促进真实交流,并系统评估其在信息保真、可重复性和响应可靠性三个维度的信任挑战。
Details
Motivation: 随着AI生成内容的泛滥,发送者和接收者依赖LLM进行内容的膨胀与压缩,导致缺乏真实内容互动,亟需一种新的沟通范式来恢复交流的真实性与有效性。 Method: 提出LAAC多智能体架构,通过结构化对话捕获发送者意图,并在多种通信场景中实验评估信息捕获保真度、可重复性和查询响应完整性三个信任维度。 Result: 实验揭示了在高风险沟通场景中部署LAAC前必须解决的可衡量的信任差距,特别是在避免幻觉、来源混淆和信息失真方面存在挑战。 Conclusion: LAAC为AI辅助沟通提供了新范式,但其作为可信中介的部署需先解决信息保真、一致性和可靠性等核心信任问题。 Abstract: The proliferation of AI-generated content has created an absurd communication theater where senders use LLMs to inflate simple ideas into verbose content, recipients use LLMs to compress them back into summaries, and as a consequence neither party engage with authentic content. LAAC (LLM as a Communicator) proposes a paradigm shift - positioning LLMs as intelligent communication intermediaries that capture the sender's intent through structured dialogue and facilitate genuine knowledge exchange with recipients. Rather than perpetuating cycles of AI-generated inflation and compression, LAAC enables authentic communication across diverse contexts including academic papers, proposals, professional emails, and cross-platform content generation. However, deploying LLMs as trusted communication intermediaries raises critical questions about information fidelity, consistency, and reliability. This position paper systematically evaluates the trustworthiness requirements for LAAC's deployment across multiple communication domains. We investigate three fundamental dimensions: (1) Information Capture Fidelity - accuracy of intent extraction during sender interviews across different communication types, (2) Reproducibility - consistency of structured knowledge across multiple interaction instances, and (3) Query Response Integrity - reliability of recipient-facing responses without hallucination, source conflation, or fabrication. Through controlled experiments spanning multiple LAAC use cases, we assess these trust dimensions using LAAC's multi-agent architecture. Preliminary findings reveal measurable trust gaps that must be addressed before LAAC can be reliably deployed in high-stakes communication scenarios.[24] Computational Turing Test Reveals Systematic Differences Between Human and AI Language
Nicolò Pagan,Petter Törnberg,Christopher A. Bail,Anikó Hannák,Christopher Barrie
Main category: cs.CL
TL;DR: 本文提出了一种计算图灵测试框架,用于评估大语言模型(LLM)生成文本与人类语言的接近程度,并比较了九种开源LLM在不同校准策略下的表现,发现即使经过校准,LLM输出仍明显区别于人类文本,且提升拟人化常以牺牲语义保真度为代价。
Details
Motivation: 现有对LLM模拟人类行为的研究依赖未经充分验证的人类判断式评估方法,缺乏可靠、可扩展的工具来检验LLM生成文本的真实性,因此需要更稳健的验证框架。 Method: 提出一种结合聚合指标(如BERT-based可检测性和语义相似性)与可解释语言特征(如风格标记和话题模式)的计算图灵测试框架,并系统评估九种开源LLM在五种校准策略下重现X、Bluesky和Reddit用户互动的能力。 Result: 发现LLM输出即使经过校准仍显著区别于人类文本,尤其在情感表达方面;指令微调模型表现不如基础模型,增大模型规模未提升拟人化效果,且拟人化与语义保真度之间存在权衡。 Conclusion: 当前LLM在模拟人类交流方面存在局限性,需谨慎使用;本文提供了可扩展的验证与校准框架,为未来研究提供基础。 Abstract: Large language models (LLMs) are increasingly used in the social sciences to simulate human behavior, based on the assumption that they can generate realistic, human-like text. Yet this assumption remains largely untested. Existing validation efforts rely heavily on human-judgment-based evaluations -- testing whether humans can distinguish AI from human output -- despite evidence that such judgments are blunt and unreliable. As a result, the field lacks robust tools for assessing the realism of LLM-generated text or for calibrating models to real-world data. This paper makes two contributions. First, we introduce a computational Turing test: a validation framework that integrates aggregate metrics (BERT-based detectability and semantic similarity) with interpretable linguistic features (stylistic markers and topical patterns) to assess how closely LLMs approximate human language within a given dataset. Second, we systematically compare nine open-weight LLMs across five calibration strategies -- including fine-tuning, stylistic prompting, and context retrieval -- benchmarking their ability to reproduce user interactions on X (formerly Twitter), Bluesky, and Reddit. Our findings challenge core assumptions in the literature. Even after calibration, LLM outputs remain clearly distinguishable from human text, particularly in affective tone and emotional expression. Instruction-tuned models underperform their base counterparts, and scaling up model size does not enhance human-likeness. Crucially, we identify a trade-off: optimizing for human-likeness often comes at the cost of semantic fidelity, and vice versa. These results provide a much-needed scalable framework for validation and calibration in LLM simulations -- and offer a cautionary note about their current limitations in capturing human communication.[25] LLM-as-a-Judge is Bad, Based on AI Attempting the Exam Qualifying for the Member of the Polish National Board of Appeal
Michał Karp,Anna Kubaszewska,Magdalena Król,Robert Król,Aleksander Smywiński-Pohl,Mateusz Szymański,Witold Wydmański
Main category: cs.CL
TL;DR: 本研究评估了当前大语言模型(LLMs)是否能够通过波兰国家上诉委员会的官方资格考试,发现尽管LLMs在知识测试中表现尚可,但在实际判案写作部分均未达标,且“LLM作为评委”的自动评分结果与官方评审存在偏差。
Details
Motivation: 探讨大语言模型在法律专业高风险考试中的应用潜力,评估其在真实法律决策任务中的可靠性与局限性。 Method: 将LLMs作为考生参与考试,并采用‘LLM-as-a-judge’方法由其他模型自动评分;构建混合信息检索与提取管道,测试多种LLMs在闭卷及检索增强生成(RAG)设置下的表现。 Result: LLMs在多项选择题部分得分满意,但在书面判决部分均未达到通过标准,且自动生成的评分与官方评委判断常不一致;模型存在幻觉、错误引用法律条文和逻辑论证薄弱等问题。 Conclusion: 当前的大语言模型尚不能替代人类法官或独立考官在波兰公共采购裁决中的角色,需法律专家与技术团队紧密协作以提升实用性。 Abstract: This study provides an empirical assessment of whether current large language models (LLMs) can pass the official qualifying examination for membership in Poland's National Appeal Chamber (Krajowa Izba Odwo{\l}awcza). The authors examine two related ideas: using LLM as actual exam candidates and applying the 'LLM-as-a-judge' approach, in which model-generated answers are automatically evaluated by other models. The paper describes the structure of the exam, which includes a multiple-choice knowledge test on public procurement law and a written judgment, and presents the hybrid information recovery and extraction pipeline built to support the models. Several LLMs (including GPT-4.1, Claude 4 Sonnet and Bielik-11B-v2.6) were tested in closed-book and various Retrieval-Augmented Generation settings. The results show that although the models achieved satisfactory scores in the knowledge test, none met the passing threshold in the practical written part, and the evaluations of the 'LLM-as-a-judge' often diverged from the judgments of the official examining committee. The authors highlight key limitations: susceptibility to hallucinations, incorrect citation of legal provisions, weaknesses in logical argumentation, and the need for close collaboration between legal experts and technical teams. The findings indicate that, despite rapid technological progress, current LLMs cannot yet replace human judges or independent examiners in Polish public procurement adjudication.[26] REMIND: Input Loss Landscapes Reveal Residual Memorization in Post-Unlearning LLMs
Liran Cohen,Yaniv Nemcovesky,Avi Mendelson
Main category: cs.CL
TL;DR: 提出REMIND方法,通过分析模型在输入微小变化下的损失动态来检测被遗忘数据的残余记忆,从而更敏感地评估机器遗忘效果。
Details
Motivation: 现有机器遗忘评估方法仅在单个输入层面进行,可能忽略语义相似样本中的残余影响,导致隐私泄露。需要更可靠的评估手段来确保模型真正遗忘目标数据。 Method: REMIND通过查询访问模型,分析目标数据邻域内输入变化引起的损失变化模式。利用损失曲面的平缓程度(平坦性)判断是否有效遗忘:被遗忘数据对应更平缓的损失景观,而保留或无关数据则呈现更尖锐、波动更大的模式。 Result: REMIND在多种模型、数据集和改写输入下均表现出优于现有方法的性能和鲁棒性,能有效识别传统方法难以发现的残余记忆,且仅需查询访问权限,适合实际部署。 Conclusion: REMIND提供了一种更敏感、可解释且实用的机器遗忘评估框架,揭示了遗忘过程中的邻域动态特征,为语言模型的遗忘效果评估提供了新视角。 Abstract: Machine unlearning aims to remove the influence of specific training data from a model without requiring full retraining. This capability is crucial for ensuring privacy, safety, and regulatory compliance. Therefore, verifying whether a model has truly forgotten target data is essential for maintaining reliability and trustworthiness. However, existing evaluation methods often assess forgetting at the level of individual inputs. This approach may overlook residual influence present in semantically similar examples. Such influence can compromise privacy and lead to indirect information leakage. We propose REMIND (Residual Memorization In Neighborhood Dynamics), a novel evaluation method aiming to detect the subtle remaining influence of unlearned data and classify whether the data has been effectively forgotten. REMIND analyzes the model's loss over small input variations and reveals patterns unnoticed by single-point evaluations. We show that unlearned data yield flatter, less steep loss landscapes, while retained or unrelated data exhibit sharper, more volatile patterns. REMIND requires only query-based access, outperforms existing methods under similar constraints, and demonstrates robustness across different models, datasets, and paraphrased inputs, making it practical for real-world deployment. By providing a more sensitive and interpretable measure of unlearning effectiveness, REMIND provides a reliable framework to assess unlearning in language models. As a result, REMIND offers a novel perspective on memorization and unlearning.[27] Reusing Pre-Training Data at Test Time is a Compute Multiplier
Alex Fang,Thomas Voice,Ruoming Pang,Ludwig Schmidt,Tom Gunter
Main category: cs.CL
TL;DR: 研究表明,当前的预训练方法未能充分利用现有数据集中的信息,通过检索增强生成和测试时计算可以显著提升模型性能,表明预训练过程中仍有大量知识未被有效提取。
Details
Motivation: 理解当前预训练机制在从数据中提取知识方面的效率,并探索如何更好地利用已有数据集中的信息。 Method: 使用检索增强生成(RAG)结合测试时计算,评估在不同规模下预训练后从原始数据集中检索信息对模型性能的影响,并进行去污染分析以验证结果的稳健性。 Result: 在MMLU、Math-500和SimpleQA等任务上,检索增强显著提升了模型准确率;在MMLU上,检索相当于约5倍计算量的增益,并且通过额外的测试时计算进一步提升了10个百分点(LLaMA 3.1 8B模型)。 Conclusion: 当前的预训练方法未能充分挖掘数据中的价值,结合检索与测试时计算可有效释放剩余潜力,未来有较大改进空间。 Abstract: Large language models learn from their vast pre-training corpora, gaining the ability to solve an ever increasing variety of tasks; yet although researchers work to improve these datasets, there is little effort to understand how efficient the pre-training apparatus is at extracting ideas and knowledge from the data. In this work, we use retrieval augmented generation along with test-time compute as a way to quantify how much dataset value was left behind by the process of pre-training, and how this changes across scale. We demonstrate that pre-training then retrieving from standard and largely open-sourced datasets results in significant accuracy gains in MMLU, Math-500, and SimpleQA, which persist through decontamination. For MMLU we observe that retrieval acts as a ~5x compute multiplier versus pre-training alone. We show that these results can be further improved by leveraging additional compute at test time to parse the retrieved context, demonstrating a 10 percentage point improvement on MMLU for the public LLaMA 3.1 8B model. Overall, our results suggest that today's pre-training methods do not make full use of the information in existing pre-training datasets, leaving significant room for progress.[28] Efficient Topic Extraction via Graph-Based Labeling: A Lightweight Alternative to Deep Models
Salma Mekaoui,Hiba Sofyan,Imane Amaaz,Imane Benchrif,Arsalane Zarghili,Ilham Chaker,Nikola S. Nikolov
Main category: cs.CL
TL;DR: 本文提出一种基于图的轻量级方法用于主题标注,通过语义扩展和关系分析为话题模型生成的词分布赋予可解释标签,在保持计算高效的同时,性能接近ChatGPT-3.5。
Details
Motivation: 现有主题建模方法生成的主题词分布缺乏可解释性,且主流解决方案依赖高计算成本模型,难以广泛应用。 Method: 提出一种基于图的方法,利用语义相关词扩展主题词,并通过图结构分析词语间关系,从而生成有意义的主题标签。 Result: 在两个数据集上与多个基准方法(包括ChatGPT-3.5)对比,该方法在BERTScore和余弦相似度指标上优于传统基准,结果与ChatGPT-3.5相当,同时计算开销更低。 Conclusion: 所提图方法在主题标注任务中兼具高效性与有效性,为低资源场景下的主题可解释性提供了可行方案,并指出了未来提升自动化与解释性的研究方向。 Abstract: Extracting topics from text has become an essential task, especially with the rapid growth of unstructured textual data. Most existing works rely on highly computational methods to address this challenge. In this paper, we argue that probabilistic and statistical approaches, such as topic modeling (TM), can offer effective alternatives that require fewer computational resources. TM is a statistical method that automatically discovers topics in large collections of unlabeled text; however, it produces topics as distributions of representative words, which often lack clear interpretability. Our objective is to perform topic labeling by assigning meaningful labels to these sets of words. To achieve this without relying on computationally expensive models, we propose a graph-based approach that not only enriches topic words with semantically related terms but also explores the relationships among them. By analyzing these connections within the graph, we derive suitable labels that accurately capture each topic's meaning. We present a comparative study between our proposed method and several benchmarks, including ChatGPT-3.5, across two different datasets. Our method achieved consistently better results than traditional benchmarks in terms of BERTScore and cosine similarity and produced results comparable to ChatGPT-3.5, while remaining computationally efficient. Finally, we discuss future directions for topic labeling and highlight potential research avenues for enhancing interpretability and automation.[29] SSPO: Subsentence-level Policy Optimization
Kun Yang,Zikang chen,Yanmeng Wang,Zhigen Li
Main category: cs.CL
TL;DR: 本文提出了SSPO方法,通过引入句子级重要性比率和基于句子熵的动态裁剪机制,平衡了GRPO和GSPO的优点,提升了大语言模型在强化学习中的训练稳定性与数据利用率,在多个数据集上取得了优于现有方法的表现。
Details
Motivation: 现有的RLVR算法如GRPO和GSPO在训练稳定性或采样数据利用效率方面存在缺陷:GRPO因基于词元的重要性比率易受异常值影响导致训练崩溃,而GSPO因整个响应共享同一比率可能导致极端值影响整体判断,降低数据利用率。需要一种更平衡的方法。 Method: 提出SSPO(Sentence-level Sequence Policy Optimization),采用句子级重要性比率,避免单个词元异常影响整体训练,同时防止整条响应被误弃;结合句子熵动态调整PPO-CLIP的裁剪边界,鼓励高熵词元探索并缩小低熵词元的裁剪范围,提升训练稳定性与数据利用效率。 Result: SSPO在五个数据集上的平均得分为46.57,超过GRPO(43.01)和GSPO(44.42),并在三个数据集中达到最优性能,验证了其在数据利用和训练稳定性方面的优势。 Conclusion: SSPO通过句子级重要性比率和动态裁剪机制,有效平衡了GRPO与GSPO的优缺点,显著提升了大语言模型在强化学习中的推理能力与训练效率,是RLVR框架下的有效改进方案。 Abstract: As a significant part of post-training of the Large Language Models (LLMs), Reinforcement Learning from Verifiable Reward (RLVR) has greatly improved LLMs' reasoning skills. However, some RLVR algorithms, such as GRPO (Group Relative Policy Optimization) and GSPO (Group Sequence Policy Optimization), are observed to suffer from unstable policy updates and low usage of sampling data, respectively. The importance ratio of GRPO is calculated at the token level, which focuses more on optimizing a single token. This will be easily affected by outliers, leading to model training collapse. GSPO proposed the calculation of the response level importance ratio, which solves the problem of high variance and training noise accumulation in the calculation of the GRPO importance ratio. However, since all the response tokens share a common importance ratio, extreme values can easily raise or lower the overall mean, leading to the entire response being mistakenly discarded, resulting in a decrease in the utilization of sampled data. This paper introduces SSPO, which applies sentence-level importance ratio, taking the balance between GRPO and GSPO. SSPO not only avoids training collapse and high variance, but also prevents the whole response tokens from being abandoned by the clipping mechanism. Furthermore, we apply sentence entropy to PPO-CLIP to steadily adjust the clipping bounds, encouraging high-entropy tokens to explore and narrow the clipping range of low-entropy tokens. In particular, SSPO achieves an average score of 46.57 across five datasets, surpassing GRPO (43.01) and GSPO (44.42), and wins state-of-the-art performance on three datasets. These results highlight SSPO's effectiveness in leveraging generated data by taking the essence of GSPO but rejecting its shortcomings.[30] Dynamic Jointly Batch Selection for Data Efficient Machine Translation Fine-Tuning
Mohammad Amin Ghanizadeh,Mohammad Javad Dousti
Main category: cs.CL
TL;DR: 提出一种基于学习性评分和批处理选择策略的数据选择方法,用于机器翻译模型的微调,显著提升数据效率和翻译性能。
Details
Motivation: 提高机器翻译模型的训练效率和翻译质量,解决传统数据选择方法效率低、泛化能力差的问题。 Method: 利用学习模型与预训练参考模型的协同作用,定义学习性评分来评估数据点的训练价值,并采用考虑数据点间依赖关系的批次选择策略进行数据筛选。 Result: 在英语到波斯语等多个语言对上的实验表明,相比随机选择,该方法在使用缓存嵌入时计算效率提升24倍,数据效率最高提升五倍,并显著改善翻译性能。 Conclusion: 所提出的数据选择方法能有效提升机器翻译微调过程中的数据利用率、计算效率和模型泛化能力。 Abstract: Data quality and its effective selection are fundamental to improving the performance of machine translation models, serving as cornerstones for achieving robust and reliable translation systems. This paper presents a data selection methodology specifically designed for fine-tuning machine translation systems, which leverages the synergy between a learner model and a pre-trained reference model to enhance overall training effectiveness. By defining a learnability score, our approach systematically evaluates the utility of data points for training, ensuring that only the most relevant and impactful examples contribute to the fine-tuning process. Furthermore, our method employs a batch selection strategy which considers interdependencies among data points, optimizing the efficiency of the training process while maintaining a focus on data relevance. Experiments on English to Persian and several other language pairs using an mBART model fine-tuned on the CCMatrix dataset demonstrate that our method can achieve up to a fivefold improvement in data efficiency compared to an iid baseline. Experimental results indicate that our approach improves computational efficiency by 24 when utilizing cached embeddings, as it requires fewer training data points. Additionally, it enhances generalization, resulting in superior translation performance compared to random selection method.[31] If I Could Turn Back Time: Temporal Reframing as a Historical Reasoning Task for LLMs
Lars Bungum,Charles Yijia Huang,Abeer Kashar
Main category: cs.CL
TL;DR: 本研究探讨了大语言模型(LLM)在1940年时间背景下的时间推理能力,使用一本挪威语 trivia 书籍中的问题,以英语和挪威语提问,并通过LLM评分与人工抽样验证答案。结果显示,英语提示效果优于挪威语,且更大的模型表现更好。
Details
Motivation: 探索大语言模型在历史时间背景下进行准确推理的能力,特别是当需要模拟过去知识状态时的表现。 Method: 使用1940年的一本挪威语 trivia 书籍中的问题,将问题翻译成英语和挪威语,分别对多个LLM(包括DeepSeek-R1、Gemma3、Qwen3、Llama3.1及专为挪威语设计的最大LLM)进行提示,并采用LLM-as-judge方法评估回答准确性,辅以母语者抽样检查。 Result: 英语提示 consistently 表现优于挪威语提示;更大的模型表现更好;专为挪威语设计的大型模型未在母语任务中超越通用大模型。 Conclusion: 模型规模对时间推理任务有积极影响,而提示语言的选择(英语优于挪威语)可能反映出训练数据的偏差或英语在多语言模型中的主导地位。 Abstract: In this study, we experiment with the ability of LLMs to do temporal reasoning. Using a Norwegian book from 1940 containing trivia questions, we prompt the LLMs to answer the questions as if it were 1940. We also pose the questions in both English and Norwegian. Correct answers are often presented as sentences, and grading is done by means of LLM-as-judge, with sampled checks by a native speaker. Prompting in English consistently gave better results than in Norwegian, an unexpected result. In contrast, using larger LLMs improved results. We tested the DeepSeek-R1, Gemma3, Qwen3, and Llama3.1 model families, and also the largest available LLM especially crafted for Norwegian.[32] Probabilistic Textual Time Series Depression Detection
Fabian Schmidt,Seyedehmoniba Ravan,Vladimir Vlassov
Main category: cs.CL
TL;DR: 提出了一种名为PTTSD的基于文本时间序列的抑郁症严重程度预测框架,能够结合不确定性建模实现准确且可解释的PHQ-8分数预测,在多个数据集上达到文本模态下的最优性能。
Details
Motivation: 现有抑郁严重程度预测模型通常缺乏不确定性估计和时间动态建模能力,限制了其在临床决策支持中的可信度与实用性。 Method: PTTSD框架结合双向LSTM、自注意力机制和残差连接,采用序列到序列和序列到单一输出两种结构,并使用高斯或Student-t分布作为输出头,通过负对数似然进行训练,以实现带不确定性估计的时间序列预测。 Result: 在E-DAIC和DAIC-WOZ数据集上,PTTSD在纯文本系统中达到最先进水平(如E-DAIC上MAE=3.85,DAIC上MAE=3.55),并生成校准良好的预测区间;消融实验证明注意力机制与概率建模的有效性,与MentalBERT的比较验证了方法的通用性。 Conclusion: PTTSD通过引入不确定性建模和时间动态分析,提升了抑郁严重程度预测的准确性与可解释性,具备临床应用潜力,尤其在提供可靠置信度估计方面具有优势。 Abstract: Accurate and interpretable predictions of depression severity are essential for clinical decision support, yet existing models often lack uncertainty estimates and temporal modeling. We propose PTTSD, a Probabilistic Textual Time Series Depression Detection framework that predicts PHQ-8 scores from utterance-level clinical interviews while modeling uncertainty over time. PTTSD includes sequence-to-sequence and sequence-to-one variants, both combining bidirectional LSTMs, self-attention, and residual connections with Gaussian or Student-t output heads trained via negative log-likelihood. Evaluated on E-DAIC and DAIC-WOZ, PTTSD achieves state-of-the-art performance among text-only systems (e.g., MAE = 3.85 on E-DAIC, 3.55 on DAIC) and produces well-calibrated prediction intervals. Ablations confirm the value of attention and probabilistic modeling, while comparisons with MentalBERT establish generality. A three-part calibration analysis and qualitative case studies further highlight the interpretability and clinical relevance of uncertainty-aware forecasting.[33] ThaiOCRBench: A Task-Diverse Benchmark for Vision-Language Understanding in Thai
Surapon Nonesung,Teetouch Jaknamon,Sirinya Chaiophat,Natapong Nitarach,Chanakan Wittayasakpan,Warit Sirichotedumrong,Adisai Na-Thalang,Kunat Pipatanakul
Main category: cs.CL
TL;DR: ThaiOCRBench是首个针对泰语文本密集型视觉理解任务的综合基准,包含2,808个人工标注样本,涵盖13个任务类别,用于评估多模态模型在低资源、复杂文字场景下的表现。
Details
Motivation: 现有视觉语言模型基准主要集中于高资源语言,泰语在文档结构理解等任务中缺乏代表性,亟需专门的评估基准。 Method: 构建了一个名为ThaiOCRBench的多样化、人工标注数据集,在零样本设置下对多种前沿视觉语言模型(包括闭源和开源)进行系统评估,并开展详细错误分析。 Result: 闭源模型(如Gemini 2.5 Pro)显著优于开源模型;开源模型在细粒度文本识别和手写内容提取上表现最差;主要挑战包括语言偏见、结构不匹配和生成幻觉内容。 Conclusion: ThaiOCRBench为泰语等低资源语言提供了标准化的视觉语言模型评估框架,并为改进泰语文档理解提供了可操作的见解。 Abstract: We present ThaiOCRBench, the first comprehensive benchmark for evaluating vision-language models (VLMs) on Thai text-rich visual understanding tasks. Despite recent progress in multimodal modeling, existing benchmarks predominantly focus on high-resource languages, leaving Thai underrepresented, especially in tasks requiring document structure understanding. ThaiOCRBench addresses this gap by offering a diverse, human-annotated dataset comprising 2,808 samples across 13 task categories. We evaluate a wide range of state-of-the-art VLMs in a zero-shot setting, spanning both proprietary and open-source systems. Results show a significant performance gap, with proprietary models (e.g., Gemini 2.5 Pro) outperforming open-source counterparts. Notably, fine-grained text recognition and handwritten content extraction exhibit the steepest performance drops among open-source models. Through detailed error analysis, we identify key challenges such as language bias, structural mismatch, and hallucinated content. ThaiOCRBench provides a standardized framework for assessing VLMs in low-resource, script-complex settings, and provides actionable insights for improving Thai-language document understanding.[34] RUST-BENCH: Benchmarking LLM Reasoning on Unstructured Text within Structured Tables
Nikhil Abhyankar,Purvi Chaurasia,Sanchit Kabra,Ananya Srivastava,Vivek Gupta,Chandan K. Reddy
Main category: cs.CL
TL;DR: RUST-BENCH是一个新的基准,用于评估大语言模型在真实、复杂表格数据上的推理能力,涵盖规模、异构性、领域特异性和推理复杂性。
Details
Motivation: 现有表格推理基准主要测试小而统一的表格,无法充分反映现实世界数据的复杂性,也无法全面评估大语言模型的推理能力。 Method: 构建包含7966个问题、来自2031个真实表格的RUST-BENCH基准,覆盖NSF资助记录和NBA统计数据两个领域,评估模型在大规模、异构、领域特定和多跳推理任务上的表现。 Result: 实验表明,现有大语言模型在处理异构模式和复杂多跳推理时表现不佳,暴露了当前架构和提示策略的局限性。 Conclusion: RUST-BENCH为推进表格推理研究提供了一个具有挑战性的新测试平台。 Abstract: Existing tabular reasoning benchmarks mostly test models on small, uniform tables, underrepresenting the complexity of real-world data and giving an incomplete view of Large Language Models' (LLMs) reasoning abilities. Real tables are long, heterogeneous, and domain-specific, mixing structured fields with free text and requiring multi-hop reasoning across thousands of tokens. To address this gap, we introduce RUST-BENCH, a benchmark of 7966 questions from 2031 real-world tables spanning two domains: i) RB-Science (NSF grant records) and ii) RB-Sports (NBA statistics). Unlike prior work, RUST-BENCH evaluates LLMs jointly across scale, heterogeneity, domain specificity, and reasoning complexity. Experiments with open-source and proprietary models show that LLMs struggle with heterogeneous schemas and complex multi-hop inference, revealing persistent weaknesses in current architectures and prompting strategies. RUST-BENCH establishes a challenging new testbed for advancing tabular reasoning research.[35] OUNLP at TSAR 2025 Shared Task: Multi-Round Text Simplifier via Code Generation
Cuong Huynh,Jie Cao
Main category: cs.CL
TL;DR: 本文提出了基于多轮简化的文本简化方法,利用GPT-4o生成,发现源CEFR级别与目标CEFR级别之间的差距显著影响简化效果。
Details
Motivation: 探索提示词驱动的文本简化方法中,源文本与目标可读性等级差异对性能的影响,并据此改进简化策略。 Method: 提出两种多轮简化方法:基于规则的MRS-Rule和结合规则与大模型的MRS-Joint,均通过GPT-4o生成。 Result: 在TSAR-2025共享任务中,系统在20支队伍中排名第7;后续实验表明,以LLM生成的简化文本作为起点可进一步提升性能。 Conclusion: CEFR级别差距是影响提示式文本简化性能的关键因素,多轮简化策略特别是MRS-Joint具有进一步优化潜力。 Abstract: This paper describes the OUNLP system submitted to the TSAR-2025 Shared Task (Alva-Manchego et al., 2025), designed for readability-controlled text simplification using LLM-prompting-based generation. Based on the analysis of prompt-based text simplification methods, we discovered an interesting finding that text simplification performance is highly related to the gap between the source CEFR (Arase et al., 2022) level and the target CEFR level. Inspired by this finding, we propose two multi-round simplification methods and generate them via GPT-4o: rule-based simplification (MRS-Rule) and jointly rule-based LLM simplification (MRS-Joint). Our submitted systems ranked 7 out of 20 teams. Later improvements with MRS-Joint show that taking the LLM simplified candidates as the starting point could further boost the multi-round simplification performance.[36] Decoding Emergent Big Five Traits in Large Language Models: Temperature-Dependent Expression and Architectural Clustering
Christos-Nikolaos Zacharopoulos,Revekka Kyriakoglou
Main category: cs.CL
TL;DR: 该研究使用BFI-2框架系统评估六种大语言模型在不同采样温度下的五大人格特质表达,发现神经质和外向性受温度影响显著,且模型架构可能导致稳定的人格特征聚类。
Details
Motivation: 随着大语言模型在人类中心应用中的普及,理解其类人格行为对负责任的开发和部署至关重要。 Method: 采用Big Five Inventory-2(BFI-2)框架评估六个LLM在不同采样温度下的人格特质,并通过层次聚类分析模型间的特征模式。 Result: 在五个维度中发现四个存在显著差异,其中神经质和外向性受温度调节影响;层次聚类显示模型形成可区分的群组,表明架构可能影响人格特征的稳定性。 Conclusion: 大语言模型表现出可测量的类人格模式,这些模式受温度和架构影响,研究为模型调优、选择及AI伦理治理提供了新视角。 Abstract: As Large Language Models (LLMs) become integral to human-centered applications, understanding their personality-like behaviors is increasingly important for responsible development and deployment. This paper systematically evaluates six LLMs, applying the Big Five Inventory-2 (BFI-2) framework, to assess trait expressions under varying sampling temperatures. We find significant differences across four of the five personality dimensions, with Neuroticism and Extraversion susceptible to temperature adjustments. Further, hierarchical clustering reveals distinct model clusters, suggesting that architectural features may predispose certain models toward stable trait profiles. Taken together, these results offer new insights into the emergence of personality-like patterns in LLMs and provide a new perspective on model tuning, selection, and the ethical governance of AI systems. We share the data and code for this analysis here: https://osf.io/bsvzc/?view_only=6672219bede24b4e875097426dc3fac1[37] RAGalyst: Automated Human-Aligned Agentic Evaluation for Domain-Specific RAG
Joshua Gao,Quoc Huy Pham,Subin Varghese,Silwal Saurav,Vedhus Hoskere
Main category: cs.CL
TL;DR: 本文提出了RAGalyst,一种用于评估特定领域检索增强生成(RAG)系统的自动化、与人类判断对齐的智能体框架。该框架通过生成高质量合成问答数据并优化LLM-as-a-Judge指标,实现了在军事、网络安全和桥梁工程等领域的高保真评估。
Details
Motivation: 现有RAG评估方法在专业且安全关键的领域中难以捕捉领域细微差异,且缺乏与人类判断的一致性,因此需要一个更可靠、可推广的评估框架。 Method: 提出RAGalyst框架,包含一个智能体流水线,用于从源文档生成合成问答数据,并引入过滤步骤确保数据保真度;同时优化‘答案正确性’和‘可回答性’两个LLM-as-a-Judge指标,使其与人工标注高度相关。 Result: 在三个不同领域应用该框架发现:RAG性能高度依赖上下文,没有单一模型或配置始终最优;并识别出导致答案正确性低的主要原因。 Conclusion: RAGalyst提供了一种系统化、可复现的评估方法,帮助实践者揭示领域特异性权衡,做出更优设计决策,提升RAG系统的可靠性与有效性。 Abstract: Retrieval-Augmented Generation (RAG) is a critical technique for grounding Large Language Models (LLMs) in factual evidence, yet evaluating RAG systems in specialized, safety-critical domains remains a significant challenge. Existing evaluation frameworks often rely on heuristic-based metrics that fail to capture domain-specific nuances and other works utilize LLM-as-a-Judge approaches that lack validated alignment with human judgment. This paper introduces RAGalyst, an automated, human-aligned agentic framework designed for the rigorous evaluation of domain-specific RAG systems. RAGalyst features an agentic pipeline that generates high-quality, synthetic question-answering (QA) datasets from source documents, incorporating an agentic filtering step to ensure data fidelity. The framework refines two key LLM-as-a-Judge metrics-Answer Correctness and Answerability-using prompt optimization to achieve a strong correlation with human annotations. Applying this framework to evaluate various RAG components across three distinct domains (military operations, cybersecurity, and bridge engineering), we find that performance is highly context-dependent. No single embedding model, LLM, or hyperparameter configuration proves universally optimal. Additionally, we provide an analysis on the most common low Answer Correctness reasons in RAG. These findings highlight the necessity of a systematic evaluation framework like RAGalyst, which empowers practitioners to uncover domain-specific trade-offs and make informed design choices for building reliable and effective RAG systems. RAGalyst is available on our Github.[38] Modeling Clinical Uncertainty in Radiology Reports: from Explicit Uncertainty Markers to Implicit Reasoning Pathways
Paloma Rabaey,Jong Hak Moon,Jung-Oh Lee,Min Gwan Kim,Hangyul Yoon,Thomas Demeester,Edward Choi
Main category: cs.CL
TL;DR: 提出了一种两部分框架来处理放射学报告中的显性和隐性不确定性,并发布了增强版Lunguage++数据集。
Details
Motivation: 放射学报告中存在显性和隐性不确定性,影响自动化分析的准确性,需更精细的方法进行建模。 Method: 通过构建专家验证的LLM-based参考排序量化显性不确定性,并利用专家定义的诊断路径扩展框架建模隐性不确定性。 Result: 成功构建了Lunguage++数据集,支持细粒度、感知不确定性的图像分类与诊断推理。 Conclusion: 该框架能有效捕捉放射学报告中的两类不确定性,为临床决策和自动化分析提供更可靠的结构化资源。 Abstract: Radiology reports are invaluable for clinical decision-making and hold great potential for automated analysis when structured into machine-readable formats. These reports often contain uncertainty, which we categorize into two distinct types: (i) Explicit uncertainty reflects doubt about the presence or absence of findings, conveyed through hedging phrases. These vary in meaning depending on the context, making rule-based systems insufficient to quantify the level of uncertainty for specific findings; (ii) Implicit uncertainty arises when radiologists omit parts of their reasoning, recording only key findings or diagnoses. Here, it is often unclear whether omitted findings are truly absent or simply unmentioned for brevity. We address these challenges with a two-part framework. We quantify explicit uncertainty by creating an expert-validated, LLM-based reference ranking of common hedging phrases, and mapping each finding to a probability value based on this reference. In addition, we model implicit uncertainty through an expansion framework that systematically adds characteristic sub-findings derived from expert-defined diagnostic pathways for 14 common diagnoses. Using these methods, we release Lunguage++, an expanded, uncertainty-aware version of the Lunguage benchmark of fine-grained structured radiology reports. This enriched resource enables uncertainty-aware image classification, faithful diagnostic reasoning, and new investigations into the clinical impact of diagnostic uncertainty.[39] Are language models aware of the road not taken? Token-level uncertainty and hidden state dynamics
Amir Zur,Atticus Geiger,Ekdeep Singh Lubana,Eric Bigelow
Main category: cs.CL
TL;DR: 研究发现语言模型的隐藏激活状态能够反映其在推理过程中的不确定性,并可用来预测和控制模型在生成文本时的不同路径选择。
Details
Motivation: 量化语言模型在生成文本时的不确定性较为困难,本文探究模型是否在其内部表示中包含可能的替代推理路径。 Method: 通过分析语言模型在思维链推理过程中的隐藏激活状态,测试其与模型不确定性的相关性,并尝试通过控制激活来引导模型行为。 Result: 发现模型在不同token上的不确定性与其激活状态的可操控性之间存在明显相关性;隐藏激活还能预测模型未来的输出分布,表明模型隐式地表征了可能的推理路径空间。 Conclusion: 语言模型的隐藏激活不仅反映当前不确定性,还蕴含对多种推理路径的表征,干预激活更有效于模型尚未确定最终答案的阶段。 Abstract: When a language model generates text, the selection of individual tokens might lead it down very different reasoning paths, making uncertainty difficult to quantify. In this work, we consider whether reasoning language models represent the alternate paths that they could take during generation. To test this hypothesis, we use hidden activations to control and predict a language model's uncertainty during chain-of-thought reasoning. In our experiments, we find a clear correlation between how uncertain a model is at different tokens, and how easily the model can be steered by controlling its activations. This suggests that activation interventions are most effective when there are alternate paths available to the model -- in other words, when it has not yet committed to a particular final answer. We also find that hidden activations can predict a model's future outcome distribution, demonstrating that models implicitly represent the space of possible paths.[40] IntelliProof: An Argumentation Network-based Conversational Helper for Organized Reflection
Kaveh Eskandari Miandoab,Katharine Kowalyshyn,Kabir Pamnani,Anesu Gavhera,Vasanth Sarathy,Matthias Scheutz
Main category: cs.CL
TL;DR: IntelliProof是一个利用大语言模型(LLM)分析议论文的交互式系统,将文章结构化为论证图,强调用户体验,提供可视化、分类依据和连贯性量化指标。
Details
Motivation: 现有自动作文评分系统缺乏对论证结构的深入分析和用户交互体验,难以有效评估议论文的逻辑连贯性和论证质量。 Method: 使用LLM对论点间的支撑或攻击关系进行分类和打分,构建包含主张节点、证据属性和关系边的论证图,并通过可视化界面呈现结果。 Result: 系统能生成论证图的可视化展示,提供关系分类的解释和文章连贯性的定量衡量,支持快速评估论证质量并保留人工监督。 Conclusion: IntelliProof有效结合了LLM的能力与人类监督,提升了对议论文结构和论证质量的理解与分析效率。 Abstract: We present IntelliProof, an interactive system for analyzing argumentative essays through LLMs. IntelliProof structures an essay as an argumentation graph, where claims are represented as nodes, supporting evidence is attached as node properties, and edges encode supporting or attacking relations. Unlike existing automated essay scoring systems, IntelliProof emphasizes the user experience: each relation is initially classified and scored by an LLM, then visualized for enhanced understanding. The system provides justifications for classifications and produces quantitative measures for essay coherence. It enables rapid exploration of argumentative quality while retaining human oversight. In addition, IntelliProof provides a set of tools for a better understanding of an argumentative essay and its corresponding graph in natural language, bridging the gap between the structural semantics of argumentative essays and the user's understanding of a given text. A live demo and the system are available here to try: \textbf{https://intelliproof.vercel.app}[41] From Model to Breach: Towards Actionable LLM-Generated Vulnerabilities Reporting
Cyril Vallez,Alexander Sternfeld,Andrei Kucharavy,Ljiljana Dolamic
Main category: cs.CL
TL;DR: 本文研究了基于大语言模型(LLM)的编程助手在生成代码时引入的安全漏洞问题,指出当前主流开源模型仍易受早期已知漏洞影响,并提出了一种新的风险度量指标Prompt Exposure(PE)及Model Exposure(ME)评分,用于评估和缓解LLM生成漏洞的严重性与普遍性。
Details
Motivation: 随着LLM在软件开发中的广泛应用,其生成的代码漏洞对网络安全构成日益严重的威胁。然而现有安全基准和改进方法对实际模型的影响尚不明确,亟需有效评估和缓解机制。 Method: 提出Prompt Exposure(PE)指标,综合考虑漏洞严重性、生成概率和诱导漏洞的提示词形式;进一步定义Model Exposure(ME)评分,衡量模型生成漏洞的整体风险。 Result: 发现最新的开源权重LLM在现实使用场景下仍易产生早期已知类型的漏洞,表明安全性与功能性的权衡阻碍了有效修复;PE和ME能够量化模型的安全风险。 Conclusion: 当前LLM编码助手存在持续的安全隐患,需通过PE和ME等新指标推动更有效的漏洞缓解策略,平衡安全性与功能性。 Abstract: As the role of Large Language Models (LLM)-based coding assistants in software development becomes more critical, so does the role of the bugs they generate in the overall cybersecurity landscape. While a number of LLM code security benchmarks have been proposed alongside approaches to improve the security of generated code, it remains unclear to what extent they have impacted widely used coding LLMs. Here, we show that even the latest open-weight models are vulnerable in the earliest reported vulnerability scenarios in a realistic use setting, suggesting that the safety-functionality trade-off has until now prevented effective patching of vulnerabilities. To help address this issue, we introduce a new severity metric that reflects the risk posed by an LLM-generated vulnerability, accounting for vulnerability severity, generation chance, and the formulation of the prompt that induces vulnerable code generation - Prompt Exposure (PE). To encourage the mitigation of the most serious and prevalent vulnerabilities, we use PE to define the Model Exposure (ME) score, which indicates the severity and prevalence of vulnerabilities a model generates.[42] BanglaMedQA and BanglaMMedBench: Evaluating Retrieval-Augmented Generation Strategies for Bangla Biomedical Question Answering
Sadia Sultana,Saiyma Sittul Muna,Mosammat Zannatul Samarukh,Ajwad Abrar,Tareque Mohmud Chowdhury
Main category: cs.CL
TL;DR: 本文提出了首个大规模孟加拉语生物医学多选题数据集BanglaMedQA和BanglaMMedBench,并评估了多种检索增强生成(RAG)策略,其中基于代理的RAG方法结合教材与网络检索,在GPT-120B上达到89.54%的准确率,显著提升了孟加拉语医学问答系统的性能。
Details
Motivation: 低资源语言环境下,生物医学问答系统的发展受限,导致获取可靠医疗知识的机会不平等。为解决这一问题,亟需针对非英语语言(如孟加拉语)构建高质量的医学问答数据集与有效模型。 Method: 提出并比较了多种RAG策略(传统、零样本回退、代理式、迭代反馈和聚合RAG),结合基于OCR提取的孟加拉语医学教材文本与网页检索信息,采用生成式推理提升事实准确性,其中代理式RAG能动态选择检索或推理路径。 Result: 实验结果显示,代理式RAG在openai/gpt-oss-120b模型上取得了89.54%的最高准确率,优于其他配置,且生成的推理理由质量更优。 Conclusion: 基于RAG的方法显著提升了孟加拉语医学问答系统的准确性和可靠性,所构建的数据集与方法为多语言医学人工智能研究奠定了基础。 Abstract: Developing accurate biomedical Question Answering (QA) systems in low-resource languages remains a major challenge, limiting equitable access to reliable medical knowledge. This paper introduces BanglaMedQA and BanglaMMedBench, the first large-scale Bangla biomedical Multiple Choice Question (MCQ) datasets designed to evaluate reasoning and retrieval in medical artificial intelligence (AI). The study applies and benchmarks several Retrieval-Augmented Generation (RAG) strategies, including Traditional, Zero-Shot Fallback, Agentic, Iterative Feedback, and Aggregate RAG, combining textbook-based and web retrieval with generative reasoning to improve factual accuracy. A key novelty lies in integrating a Bangla medical textbook corpus through Optical Character Recognition (OCR) and implementing an Agentic RAG pipeline that dynamically selects between retrieval and reasoning strategies. Experimental results show that the Agentic RAG achieved the highest accuracy 89.54% with openai/gpt-oss-120b, outperforming other configurations and demonstrating superior rationale quality. These findings highlight the potential of RAG-based methods to enhance the reliability and accessibility of Bangla medical QA, establishing a foundation for future research in multilingual medical artificial intelligence.[43] When retrieval outperforms generation: Dense evidence retrieval for scalable fake news detection
Alamgir Munir Qazi,John P. McCrae,Jamal Abdul Nasir
Main category: cs.CL
TL;DR: 本文提出了一种名为DeReC的轻量级事实验证框架,利用通用文本嵌入结合密集检索与分类,在准确性和效率上均优于基于大语言模型的解释生成方法。
Details
Motivation: 现有的基于大语言模型的事实验证方法存在计算开销大和幻觉风险高的问题,难以在实际场景中部署,因此需要更高效且可靠的方法。 Method: 提出DeReC框架,使用密集检索(Dense Retrieval)从知识库中获取相关证据,并通过专门的分类器进行判断,避免使用自回归的大语言模型生成解释。 Result: 在RAWFC数据集上F1得分为65.58%,超过当前最优方法L-Defense(61.20%);运行时间比LLM方法减少95%(从454分钟降至23分钟),在LIAR-RAW上减少92%。 Conclusion: 精心设计的基于检索的系统可以在特定任务上达到甚至超越大语言模型的性能,同时具备更高的实用性和部署可行性。 Abstract: The proliferation of misinformation necessitates robust yet computationally efficient fact verification systems. While current state-of-the-art approaches leverage Large Language Models (LLMs) for generating explanatory rationales, these methods face significant computational barriers and hallucination risks in real-world deployments. We present DeReC (Dense Retrieval Classification), a lightweight framework that demonstrates how general-purpose text embeddings can effectively replace autoregressive LLM-based approaches in fact verification tasks. By combining dense retrieval with specialized classification, our system achieves better accuracy while being significantly more efficient. DeReC outperforms explanation-generating LLMs in efficiency, reducing runtime by 95% on RAWFC (23 minutes 36 seconds compared to 454 minutes 12 seconds) and by 92% on LIAR-RAW (134 minutes 14 seconds compared to 1692 minutes 23 seconds), showcasing its effectiveness across varying dataset sizes. On the RAWFC dataset, DeReC achieves an F1 score of 65.58%, surpassing the state-of-the-art method L-Defense (61.20%). Our results demonstrate that carefully engineered retrieval-based systems can match or exceed LLM performance in specialized tasks while being significantly more practical for real-world deployment.[44] Logit-Entropy Adaptive Stopping Heuristic for Efficient Chain-of-Thought Reasoning
Mohammad Atif Quamar,Mohammad Areeb
Main category: cs.CL
TL;DR: LEASH是一种无需训练的解码算法,通过监控token级熵斜率和顶级logit边距的改进来自适应停止推理生成,在减少30-35%的token使用和27%的延迟的同时保持较高的准确性。
Details
Motivation: 传统的思维链提示(CoT)在大语言模型中进行复杂推理时会产生冗余的推理步骤,导致计算资源浪费,因此需要一种更高效的推理生成终止机制。 Method: 提出LEASH:基于logit-熵的自适应停止启发式算法,通过监测token级熵的变化斜率和top-logit margin的提升情况,当这两个信号趋于平稳时自动终止推理生成。 Result: 在GSM8K和AQuA-RAT数据集上,对四种指令微调模型的实验表明,LEASH平均减少了30-35%的token生成量和27%的延迟,但相比CoT有10个百分点的准确率下降。 Conclusion: LEASH是一种模型无关、无需额外训练的有效替代方案,能够在可接受精度损失下显著提升推理效率。 Abstract: Chain-of-Thought (CoT) prompting is a key technique for enabling complex reasoning in large language models. However, generating full, fixed-length rationales is computationally wasteful, inflating both token usage and latency. We introduce LEASH: Logit-Entropy Adaptive Stopping Heuristic, a training-free decoding algorithm that adaptively halts rationale generation. LEASH monitors two intrinsic signals: the slope of token-level entropy and the improvement in the top-logit margin. It terminates the generation once both signals plateau, indicating the model has reached a stable reasoning state. Across four instruction-tuned models on the GSM8K and AQuA-RAT benchmarks, LEASH reduces average token generation by 30--35% and latency by 27%, while incurring a 10 p.p. accuracy drop relative to CoT. LEASH is model-agnostic and requires no additional training or supervision, offering a simple and efficient alternative to CoT decoding.cs.CV [Back]
[45] LoRA-Edge: Tensor-Train-Assisted LoRA for Practical CNN Fine-Tuning on Edge Devices
Hyunseok Kwak,Kyeongwon Lee,Jae-Jin Lee,Woojoo Lee
Main category: cs.CV
TL;DR: LoRA-Edge是一种面向边缘设备的参数高效微调方法,基于低秩适应(LoRA)并引入张量分解技术,可在极低参数更新量下实现卷积神经网络的高效在线微调。
Details
Motivation: 在边缘设备上进行完整微调因内存、计算和能耗限制而不可行,但面对领域偏移(如人类活动识别)时又亟需模型自适应能力,因此需要一种高效且结构对齐的微调方案。 Method: 提出LoRA-Edge:利用张量-训练奇异值分解(TT-SVD)对预训练卷积层进行分解,仅选择性更新输出侧核心并采用零初始化,最后将更新融合回原始卷积核,保持推理开销不变。 Result: 在多个HAR数据集和CNN主干网络上,LoRA-Edge仅更新最多1.49%的参数即达到接近全微调(差距<4.7%)的性能,收敛速度提升1.4-3.8倍,显著优于现有PEFT基线。 Conclusion: LoRA-Edge实现了结构对齐、参数高效的CNN边缘微调,使在资源受限设备上持续适应成为可能。 Abstract: On-device fine-tuning of CNNs is essential to withstand domain shift in edge applications such as Human Activity Recognition (HAR), yet full fine-tuning is infeasible under strict memory, compute, and energy budgets. We present LoRA-Edge, a parameter-efficient fine-tuning (PEFT) method that builds on Low-Rank Adaptation (LoRA) with tensor-train assistance. LoRA-Edge (i) applies Tensor-Train Singular Value Decomposition (TT-SVD) to pre-trained convolutional layers, (ii) selectively updates only the output-side core with zero-initialization to keep the auxiliary path inactive at the start, and (iii) fuses the update back into dense kernels, leaving inference cost unchanged. This design preserves convolutional structure and reduces the number of trainable parameters by up to two orders of magnitude compared to full fine-tuning. Across diverse HAR datasets and CNN backbones, LoRA-Edge achieves accuracy within 4.7% of full fine-tuning while updating at most 1.49% of parameters, consistently outperforming prior parameter-efficient baselines under similar budgets. On a Jetson Orin Nano, TT-SVD initialization and selective-core training yield 1.4-3.8x faster convergence to target F1. LoRA-Edge thus makes structure-aligned, parameter-efficient on-device CNN adaptation practical for edge platforms.[46] SILVI: Simple Interface for Labeling Video Interactions
Ozan Kanbertay,Richard Vogg,Elif Karakoc,Peter M. Kappeler,Claudia Fichtel,Alexander S. Ecker
Main category: cs.CV
TL;DR: SILVI是一个开源的视频标注工具,能够同时标注动物行为和个体间的交互,填补了现有工具在行为生态学与计算机视觉结合上的空白。
Details
Motivation: 现有的开源标注工具无法同时支持个体定位和交互行为标注,限制了对动物社会性和个体化行为的精细分析。 Method: 开发了一个名为SILVI的开源标注软件,集成行为标注与个体定位功能,支持在视频中直接标注行为和交互,并生成可用于训练和验证计算机视觉模型的结构化输出。 Result: SILVI成功实现了对动物行为和互动的高效标注,支持动态场景图的提取,适用于动物行为研究及更广泛的人类交互视频标注。 Conclusion: SILVI桥接了行为生态学与计算机视觉之间的工具缺口,促进了自动化精细行为分析方法的发展,具有广泛的适用潜力。 Abstract: Computer vision methods are increasingly used for the automated analysis of large volumes of video data collected through camera traps, drones, or direct observations of animals in the wild. While recent advances have focused primarily on detecting individual actions, much less work has addressed the detection and annotation of interactions -- a crucial aspect for understanding social and individualized animal behavior. Existing open-source annotation tools support either behavioral labeling without localization of individuals, or localization without the capacity to capture interactions. To bridge this gap, we present SILVI, an open-source labeling software that integrates both functionalities. SILVI enables researchers to annotate behaviors and interactions directly within video data, generating structured outputs suitable for training and validating computer vision models. By linking behavioral ecology with computer vision, SILVI facilitates the development of automated approaches for fine-grained behavioral analyses. Although developed primarily in the context of animal behavior, SILVI could be useful more broadly to annotate human interactions in other videos that require extracting dynamic scene graphs. The software, along with documentation and download instructions, is available at: https://gitlab.gwdg.de/kanbertay/interaction-labelling-app.[47] Noise Injection: Improving Out-of-Distribution Generalization for Limited Size Datasets
Duong Mai,Lawrence Hall
Main category: cs.CV
TL;DR: 本研究探讨了在训练过程中引入基本噪声(如高斯噪声、斑点噪声等)以提高深度学习模型在不同分布数据上的泛化能力,特别是在胸部X光片中检测COVID-19的应用中显著缩小了分布内与分布外数据的性能差距。
Details
Motivation: 深度学习模型在图像识别任务中难以泛化到不同设备或人群的数据,尤其在COVID-19的CXRs检测中,模型容易依赖训练数据中的源特异性伪影而非真正的生物标志物,导致在新临床来源的分布外数据上表现下降。 Method: 在训练过程中注入多种基本噪声(包括高斯噪声、斑点噪声、泊松噪声和椒盐噪声),以增强模型对分布偏移的鲁棒性。 Result: 该方法将分布内与分布外数据之间的性能差距从0.10-0.20显著降低至0.01-0.06(基于AUC、F1、准确率、召回率和特异性等指标在十个随机种子下的平均结果)。 Conclusion: 噪声注入是一种简单有效的策略,可提升深度学习模型在医学影像中的跨分布泛化能力,有助于缓解因数据分布差异导致的性能下降问题。 Abstract: Deep learned (DL) models for image recognition have been shown to fail to generalize to data from different devices, populations, etc. COVID-19 detection from Chest X-rays (CXRs), in particular, has been shown to fail to generalize to out-of-distribution (OOD) data from new clinical sources not covered in the training set. This occurs because models learn to exploit shortcuts - source-specific artifacts that do not translate to new distributions - rather than reasonable biomarkers to maximize performance on in-distribution (ID) data. Rendering the models more robust to distribution shifts, our study investigates the use of fundamental noise injection techniques (Gaussian, Speckle, Poisson, and Salt and Pepper) during training. Our empirical results demonstrate that this technique can significantly reduce the performance gap between ID and OOD evaluation from 0.10-0.20 to 0.01-0.06, based on results averaged over ten random seeds across key metrics such as AUC, F1, accuracy, recall and specificity. Our source code is publicly available at https://github.com/Duongmai127/Noisy-ood[48] Investigating Robot Control Policy Learning for Autonomous X-ray-guided Spine Procedures
Florence Klitzner,Blanca Inigo,Benjamin D. Killeen,Lalithkumar Seenivasan,Michelle Song,Axel Krieger,Mathias Unberath
Main category: cs.CV
TL;DR: 本研究探讨了在双平面X光引导下套管插入手术中应用模仿学习策略的机会与挑战,开发了一个高仿真的体内模拟沙箱,并训练基于视觉信息的模仿学习策略,实现了68.5%的首次成功率,且在复杂解剖结构和不同初始化条件下表现稳健,尽管在入口点精度方面仍存在局限性。
Details
Motivation: 由于多视角X光解读复杂,尚不清楚基于模仿学习的机器人控制策略是否适用于X光引导下的脊柱手术,因此本文旨在探索该方法在此类医疗操作中的可行性与潜力。 Method: 开发了一个高仿真的体内模拟沙箱,构建包含正确轨迹及对应双平面X光序列的数据集,并训练基于视觉输入的模仿学习策略用于规划和开环控制。 Result: 所提出的策略在68.5%的情况下首次尝试即成功,能在不同椎体水平保持安全的椎弓内路径,对骨折等复杂解剖结构具有良好的泛化能力,并在真实X光图像上表现出合理的轨迹生成能力,尽管入口点精度仍有不足。 Conclusion: 模仿学习在X光引导脊柱手术中展现出潜力,特别是在无需CT的轻量化术中导航方面,但需进一步改进闭环控制反馈机制和引入更强的先验知识以提升精度和实用性。 Abstract: Imitation learning-based robot control policies are enjoying renewed interest in video-based robotics. However, it remains unclear whether this approach applies to X-ray-guided procedures, such as spine instrumentation. This is because interpretation of multi-view X-rays is complex. We examine opportunities and challenges for imitation policy learning in bi-plane-guided cannula insertion. We develop an in silico sandbox for scalable, automated simulation of X-ray-guided spine procedures with a high degree of realism. We curate a dataset of correct trajectories and corresponding bi-planar X-ray sequences that emulate the stepwise alignment of providers. We then train imitation learning policies for planning and open-loop control that iteratively align a cannula solely based on visual information. This precisely controlled setup offers insights into limitations and capabilities of this method. Our policy succeeded on the first attempt in 68.5% of cases, maintaining safe intra-pedicular trajectories across diverse vertebral levels. The policy generalized to complex anatomy, including fractures, and remained robust to varied initializations. Rollouts on real bi-planar X-rays further suggest that the model can produce plausible trajectories, despite training exclusively in simulation. While these preliminary results are promising, we also identify limitations, especially in entry point precision. Full closed-look control will require additional considerations around how to provide sufficiently frequent feedback. With more robust priors and domain knowledge, such models may provide a foundation for future efforts toward lightweight and CT-free robotic intra-operative spinal navigation.[49] Desert Waste Detection and Classification Using Data-Based and Model-Based Enhanced YOLOv12 DL Model
Abdulmumin Sa'ad,Sulaimon Oyeniyi Adebayo,Abdul Jabbar Siddiqui
Main category: cs.CV
TL;DR: 提出一种基于轻量级YOLOv12和自对抗训练的实时废物检测框架,适用于无人机在沙漠等恶劣环境中高效、准确地检测各类废物。
Details
Motivation: 传统废物收集方法在偏远或恶劣环境中效率低且危险,现有研究多集中于城市环境和可回收物,忽视了有机和有害废物以及沙漠等地形。 Method: 采用剪枝后的轻量级YOLOv12模型,结合自对抗训练(SAT)和专用数据增强策略,在DroneTrashNet数据集上进行训练与验证。 Result: 在精度、召回率和mAP上均有显著提升,同时具备低延迟和小模型尺寸,适合资源受限的无人机部署,优于其他轻量级YOLO变体。 Conclusion: 数据驱动与模型优化相结合的方法能有效提升沙漠环境下实时废物检测的鲁棒性与实用性。 Abstract: The global waste crisis is escalating, with solid waste generation expected to increase by 70% by 2050. Traditional waste collection methods, particularly in remote or harsh environments like deserts, are labor-intensive, inefficient, and often hazardous. Recent advances in computer vision and deep learning have opened the door to automated waste detection systems, yet most research focuses on urban environments and recyclable materials, overlooking organic and hazardous waste and underexplored terrains such as deserts. In this work, we propose an enhanced real-time object detection framework based on a pruned, lightweight version of YOLOv12 integrated with Self-Adversarial Training (SAT) and specialized data augmentation strategies. Using the DroneTrashNet dataset, we demonstrate significant improvements in precision, recall, and mean average precision (mAP), while achieving low latency and compact model size suitable for deployment on resource-constrained aerial drones. Benchmarking our model against state-of-the-art lightweight YOLO variants further highlights its optimal balance of accuracy and efficiency. Our results validate the effectiveness of combining data-centric and model-centric enhancements for robust, real-time waste detection in desert environments.[50] Improving Diagnostic Performance on Small and Imbalanced Datasets Using Class-Based Input Image Composition
Hlali Azzeddine,Majid Ben Yakhlef,Soulaiman El Hazzat
Main category: cs.CV
TL;DR: 本文提出了一种类别图像融合方法(Class-Based Image Composition),通过将同类图像融合为复合图像(CoImg)来提升小样本、类别不平衡数据下的深度学习模型性能,在OCT眼底图像数据集上显著提高了准确率和鲁棒性。
Details
Motivation: 针对小样本、类别不平衡及图像质量差导致深度学习模型误判率高的问题,需要增强训练样本的信息密度和类内多样性。 Method: 提出类别图像融合方法(CoImg),将同一类别的多张图像融合为3x1布局的复合图像,构建了类别平衡的Co-OCTDL数据集,并使用VGG16模型进行评估。 Result: 在OCTDL数据集上,使用CoImg的模型达到99.6%准确率、F1分数0.995、AUC 0.9996,显著优于原始数据集上的基线模型,且误判率大幅降低。 Conclusion: 该方法能有效提升小样本和类别不平衡场景下模型的诊断性能,适用于医学图像等弱数据集场景。 Abstract: Small, imbalanced datasets and poor input image quality can lead to high false predictions rates with deep learning models. This paper introduces Class-Based Image Composition, an approach that allows us to reformulate training inputs through a fusion of multiple images of the same class into combined visual composites, named Composite Input Images (CoImg). That enhances the intra-class variance and improves the valuable information density per training sample and increases the ability of the model to distinguish between subtle disease patterns. Our method was evaluated on the Optical Coherence Tomography Dataset for Image-Based Deep Learning Methods (OCTDL) (Kulyabin et al., 2024), which contains 2,064 high-resolution optical coherence tomography (OCT) scans of the human retina, representing seven distinct diseases with a significant class imbalance. We constructed a perfectly class-balanced version of this dataset, named Co-OCTDL, where each scan is resented as a 3x1 layout composite image. To assess the effectiveness of this new representation, we conducted a comparative analysis between the original dataset and its variant using a VGG16 model. A fair comparison was ensured by utilizing the identical model architecture and hyperparameters for all experiments. The proposed approach markedly improved diagnostic results.The enhanced Dataset achieved near-perfect accuracy (99.6%) with F1-score (0.995) and AUC (0.9996), compared to a baseline model trained on raw dataset. The false prediction rate was also significantly lower, this demonstrates that the method can producehigh-quality predictions even for weak datasets affected by class imbalance or small sample size.[51] I Detect What I Don't Know: Incremental Anomaly Learning with Stochastic Weight Averaging-Gaussian for Oracle-Free Medical Imaging
Nand Kumar Yadav,Rodrigue Rizk,William CW Chen,KC Santosh
Main category: cs.CV
TL;DR: 提出一种无需标签、无需oracle的增量式正常样本扩展框架,用于医学图像中的未知异常检测,通过轻量级适配器更新和不确定性门控机制实现高效准确的异常检测。
Details
Motivation: 由于异常样本标注稀缺且专家监督成本高,医学图像中未知异常检测面临挑战,现有方法依赖生成模型或重放缓冲区,存在计算开销大或漂移风险。 Method: 基于冻结的预训练视觉骨干网络,添加小型卷积适配器进行快速领域自适应;维护一个紧凑的coreset存储特征嵌入,采用k近邻进行异常评分;通过z-score距离阈值和SWAG估计的认知不确定性双重门控机制控制正常样本库的增量扩展。 Result: 在COVID-CXR上ROC-AUC从0.9489提升至0.9982(F1: 0.8048→0.9746);Pneumonia CXR上从0.6834升至0.8968;Brain MRI ND-5上ROC-AUC从0.6041增至0.7269,PR-AUC从0.7539升至0.8211。 Conclusion: 该框架在无监督、无oracle条件下有效提升异常检测性能,具有低计算开销、防止记忆漂移的优点,适用于真实场景下标签稀缺的医学影像应用。 Abstract: Unknown anomaly detection in medical imaging remains a fundamental challenge due to the scarcity of labeled anomalies and the high cost of expert supervision. We introduce an unsupervised, oracle-free framework that incrementally expands a trusted set of normal samples without any anomaly labels. Starting from a small, verified seed of normal images, our method alternates between lightweight adapter updates and uncertainty-gated sample admission. A frozen pretrained vision backbone is augmented with tiny convolutional adapters, ensuring rapid domain adaptation with negligible computational overhead. Extracted embeddings are stored in a compact coreset enabling efficient k-nearest neighbor anomaly (k-NN) scoring. Safety during incremental expansion is enforced by dual probabilistic gates, a sample is admitted into the normal memory only if its distance to the existing coreset lies within a calibrated z-score threshold, and its SWAG-based epistemic uncertainty remains below a seed-calibrated bound. This mechanism prevents drift and false inclusions without relying on generative reconstruction or replay buffers. Empirically, our system steadily refines the notion of normality as unlabeled data arrive, producing substantial gains over baselines. On COVID-CXR, ROC-AUC improves from 0.9489 to 0.9982 (F1: 0.8048 to 0.9746); on Pneumonia CXR, ROC-AUC rises from 0.6834 to 0.8968; and on Brain MRI ND-5, ROC-AUC increases from 0.6041 to 0.7269 and PR-AUC from 0.7539 to 0.8211. These results highlight the effectiveness and efficiency of the proposed framework for real-world, label-scarce medical imaging applications.[52] Adaptive Temporal Refinement: Continuous Depth Allocation and Distance Regression for Efficient Action Localization
Ibne Farabi Shihab,Sanjeda Akter,Anuj Sharma
Main category: cs.CV
TL;DR: 本文提出了两种互补的方法:边界距离回归(BDR)和自适应时间细化(ATR),以提高时序动作定位的精度和计算效率。BDR通过符号距离回归实现更优的边界检测,而ATR则通过连续深度选择自适应分配计算资源。
Details
Motivation: 现有的时序动作定位方法在处理不同难度的边界时采用统一的计算方式,忽略了边界检测中显著的难度差异,导致效率和精度受限。 Method: 提出边界距离回归(BDR),使用符号距离回归代替分类进行最优定位;设计自适应时间细化(ATR),通过可微的连续深度选择机制动态分配计算量,并结合知识蒸馏降低训练成本。 Result: BDR在多种架构上带来1.8%到3.1%的mAP@0.7提升;ATR在THUMOS14上以更少18%的计算量实现2.9%的性能提升,短动作上提升达4.2%;轻量级学生模型通过知识蒸馏保留99%性能。 Conclusion: BDR和ATR有效提升了时序动作定位的边界精度与计算效率,且可广泛适用于不同模型,在多个基准上经过严格验证表现出显著优势。 Abstract: Temporal action localization requires precise boundary detection; however, current methods apply uniform computation despite significant variations in difficulty across boundaries. We present two complementary contributions. First, Boundary Distance Regression (BDR) provides information-theoretically optimal localization through signed-distance regression rather than classification, achieving 43\% sharper boundary peaks. BDR retrofits to existing methods with approximately 50 lines of code, yielding consistent 1.8 to 3.1\% mAP@0.7 improvements across diverse architectures. Second, Adaptive Temporal Refinement (ATR) allocates computation via continuous depth selection $\tau \in [0,1]$, enabling end-to-end differentiable optimization without reinforcement learning. On THUMOS14, ATR achieves 56.5\% mAP@0.7 at 162G FLOPs, compared to 53.6\% at 198G for uniform processing, providing a 2.9\% improvement with 18\% less compute. Gains scale with boundary heterogeneity, showing 4.2\% improvement on short actions. Training cost is mitigated via knowledge distillation, with lightweight students retaining 99\% performance at baseline cost. Results are validated across four benchmarks with rigorous statistical testing.[53] Improving Multi-View Reconstruction via Texture-Guided Gaussian-Mesh Joint Optimization
Zhejia Cai,Puhua Jiang,Shiwei Mao,Hongkun Cao,Ruqi Huang
Main category: cs.CV
TL;DR: 提出了一种统一优化几何和外观的高斯-网格联合优化框架,通过高斯引导的可微渲染实现高质量3D重建,支持光照重置和形状变形等编辑任务。
Details
Motivation: 现有方法通常将几何与外观优化分离,导致下游编辑任务受限,难以同时保证几何精度和渲染真实感。 Method: 提出一种新框架,利用输入图像的光度一致性以及法向和深度图的几何正则化,通过高斯引导的可微分渲染联合优化网格顶点位置、面片和顶点颜色。 Result: 实现了高质量的3D重建,在几何准确性和渲染真实感之间取得良好平衡,并验证了其在重光照和形状变形等编辑任务中的有效性。 Conclusion: 该方法实现了几何与外观的统一优化,提升了多视角图像三维重建的质量和可编辑性,适用于AR/VR和数字内容创作等应用。 Abstract: Reconstructing real-world objects from multi-view images is essential for applications in 3D editing, AR/VR, and digital content creation. Existing methods typically prioritize either geometric accuracy (Multi-View Stereo) or photorealistic rendering (Novel View Synthesis), often decoupling geometry and appearance optimization, which hinders downstream editing tasks. This paper advocates an unified treatment on geometry and appearance optimization for seamless Gaussian-mesh joint optimization. More specifically, we propose a novel framework that simultaneously optimizes mesh geometry (vertex positions and faces) and vertex colors via Gaussian-guided mesh differentiable rendering, leveraging photometric consistency from input images and geometric regularization from normal and depth maps. The obtained high-quality 3D reconstruction can be further exploit in down-stream editing tasks, such as relighting and shape deformation. The code will be publicly available upon acceptance.[54] A Linear Fractional Transformation Model and Calibration Method for Light Field Camera
Zhong Chen,Changfeng Chen
Main category: cs.CV
TL;DR: 提出一种基于线性分数变换(LFT)参数α的光场相机内参标定方法,解耦主镜头与微透镜阵列,结合解析解与非线性优化,提升标定精度与仿真速度。
Details
Motivation: 准确标定光场相机内参对3D重建至关重要,但现有方法难以有效解耦主镜头与微透镜阵列的影响,导致标定精度受限。 Method: 引入LFT参数α来解耦主镜头与微透镜阵列;采用基于最小二乘的解析解初始化,再进行非线性优化 refinement;并提出从原始图像中提取特征的方法。 Result: 在真实和模拟数据上验证了该方法的有效性,标定精度高,且显著提升了光场图像仿真的速度,有利于数据驱动的深度学习方法。 Conclusion: 所提方法能有效解耦光场相机光学组件,实现高精度内参标定,并加速光场图像仿真,具有良好的实用性与应用前景。 Abstract: Accurate calibration of internal parameters is a crucial yet challenging prerequisite for 3D reconstruction using light field cameras. In this paper, we propose a linear fractional transformation(LFT) parameter $\alpha$ to decoupled the main lens and micro lens array (MLA). The proposed method includes an analytical solution based on least squares, followed by nonlinear refinement. The method for detecting features from the raw images is also introduced. Experimental results on both physical and simulated data have verified the performance of proposed method. Based on proposed model, the simulation of raw light field images becomes faster, which is crucial for data-driven deep learning methods. The corresponding code can be obtained from the author's website.[55] Room Envelopes: A Synthetic Dataset for Indoor Layout Reconstruction from Images
Sam Bahrami,Dylan Campbell
Main category: cs.CV
TL;DR: 提出合成数据集Room Envelopes,用于监督单目几何估计器预测可见表面和结构布局表面,实现对场景范围及物体形状位置的理解。
Details
Motivation: 现有场景重建方法无法恢复被遮挡的表面,而结构性场景元素(如墙、地板、天花板)虽重要但研究较少;这些元素通常为平面且规律性强,较易预测,适合低成本方法。 Method: 构建名为Room Envelopes的合成数据集,包含RGB图像及两个对应的点图:一个表示可见表面,另一个表示去除装饰后的真实结构布局,从而为前馈单目几何估计提供直接监督。 Result: 该数据集可支持直接监督学习,使模型能同时预测可见表面和结构布局表面,提升对整体场景结构的理解能力。 Conclusion: Room Envelopes数据集有助于推动完整场景结构预测的研究,尤其适用于生成完整室内布局的单目几何估计任务。 Abstract: Modern scene reconstruction methods are able to accurately recover 3D surfaces that are visible in one or more images. However, this leads to incomplete reconstructions, missing all occluded surfaces. While much progress has been made on reconstructing entire objects given partial observations using generative models, the structural elements of a scene, like the walls, floors and ceilings, have received less attention. We argue that these scene elements should be relatively easy to predict, since they are typically planar, repetitive and simple, and so less costly approaches may be suitable. In this work, we present a synthetic dataset -- Room Envelopes -- that facilitates progress on this task by providing a set of RGB images and two associated pointmaps for each image: one capturing the visible surface and one capturing the first surface once fittings and fixtures are removed, that is, the structural layout. As we show, this enables direct supervision for feed-forward monocular geometry estimators that predict both the first visible surface and the first layout surface. This confers an understanding of the scene's extent, as well as the shape and location of its objects.[56] Simple 3D Pose Features Support Human and Machine Social Scene Understanding
Wenshuo Qin,Leyla Isik
Main category: cs.CV
TL;DR: 该研究发现人类对社交互动的判断依赖于显式的3D姿态信息,尤其是面部位置和朝向,这种信息在当前大多数AI视觉模型中缺失;通过引入3D社会姿态特征,可显著提升现有AI模型的社交互动识别性能。
Details
Motivation: 理解人类如何从视觉输入中快速提取社交互动信息,并揭示当前AI视觉系统在该任务上表现不佳的原因。 Method: 结合先进的姿态与深度估计算法,从视频片段中提取人物的3D关节位置,构建简化的3D社会姿态特征(仅包含面部位置和方向),并与现有AI视觉模型进行对比,评估其预测人类社交判断的能力。 Result: 3D关节位置的表现优于大多数现有AI模型;简化后的3D社会姿态特征(仅面部)即可达到全关节特征的预测效果,并能显著提升现成AI模型的性能;AI模型中3D社会姿态特征的表征程度与其匹配人类判断的能力相关。 Conclusion: 人类对社交场景的理解依赖于显式的3D姿态表征,特别是面部朝向等结构化空间线索,而当前AI模型缺乏此类表示,导致其在社交互动识别上表现不足。 Abstract: Humans can quickly and effortlessly extract a variety of information about others' social interactions from visual input, ranging from visuospatial cues like whether two people are facing each other to higher-level information. Yet, the computations supporting these abilities remain poorly understood, and social interaction recognition continues to challenge even the most advanced AI vision systems. Here, we hypothesized that humans rely on 3D visuospatial pose information to make social interaction judgments, which is absent in most AI vision models. To test this, we combined state-of-the-art pose and depth estimation algorithms to extract 3D joint positions of people in short video clips depicting everyday human actions and compared their ability to predict human social interaction judgments with current AI vision models. Strikingly, 3D joint positions outperformed most current AI vision models, revealing that key social information is available in explicit body position but not in the learned features of most vision models, including even the layer-wise embeddings of the pose models used to extract joint positions. To uncover the critical pose features humans use to make social judgments, we derived a compact set of 3D social pose features describing only the 3D position and direction of faces in the videos. We found that these minimal descriptors matched the predictive strength of the full set of 3D joints and significantly improved the performance of off-the-shelf AI vision models when combined with their embeddings. Moreover, the degree to which 3D social pose features were represented in each off-the-shelf AI vision model predicted the model's ability to match human social judgments. Together, our findings provide strong evidence that human social scene understanding relies on explicit representations of 3D pose and can be supported by simple, structured visuospatial primitives.[57] CaRF: Enhancing Multi-View Consistency in Referring 3D Gaussian Splatting Segmentation
Yuwen Tao,Kanglei Zhou,Xin Tan,Yuan Xie
Main category: cs.CV
TL;DR: 本文提出了一种名为CaRF的全微分框架,用于在3D高斯空间中实现语言引导的3D区域分割,并通过引入相机感知机制和多视角对齐训练策略,显著提升了跨视角一致性与性能表现。
Details
Motivation: 现有方法依赖2D渲染伪监督和视图特定特征学习,在跨视角一致性方面存在不足,难以实现准确的语言驱动3D区域定位。 Method: 提出Camera Aware Referring Field (CaRF),包含高斯场相机编码(GFCE)以建模视图相关变化,并采用训练中成对视图监督(ITPVS)对齐多视角下的高斯logits,实现跨视角一致性优化。 Result: 在Ref LERF、LERF OVS和3D OVS三个基准上,mIoU分别平均提升16.8%、4.3%和2.0%,显著优于现有方法。 Conclusion: CaRF实现了更可靠且具跨视角一致性的3D场景理解,为具身AI、AR/VR交互和自主感知等应用提供了新可能。 Abstract: Referring 3D Gaussian Splatting Segmentation (R3DGS) aims to interpret free-form language expressions and localize the corresponding 3D regions in Gaussian fields. While recent advances have introduced cross-modal alignment between language and 3D geometry, existing pipelines still struggle with cross-view consistency due to their reliance on 2D rendered pseudo supervision and view specific feature learning. In this work, we present Camera Aware Referring Field (CaRF), a fully differentiable framework that operates directly in the 3D Gaussian space and achieves multi view consistency. Specifically, CaRF introduces Gaussian Field Camera Encoding (GFCE), which incorporates camera geometry into Gaussian text interactions to explicitly model view dependent variations and enhance geometric reasoning. Building on this, In Training Paired View Supervision (ITPVS) is proposed to align per Gaussian logits across calibrated views during training, effectively mitigating single view overfitting and exposing inter view discrepancies for optimization. Extensive experiments on three representative benchmarks demonstrate that CaRF achieves average improvements of 16.8%, 4.3%, and 2.0% in mIoU over state of the art methods on the Ref LERF, LERF OVS, and 3D OVS datasets, respectively. Moreover, this work promotes more reliable and view consistent 3D scene understanding, with potential benefits for embodied AI, AR/VR interaction, and autonomous perception.[58] PhysCorr: Dual-Reward DPO for Physics-Constrained Text-to-Video Generation with Automated Preference Selection
Peiyao Wang,Weining Wang,Qi Li
Main category: cs.CV
TL;DR: 本文提出了PhysCorr框架,用于建模、评估和优化视频生成中的物理一致性,通过引入PhysicsRM奖励模型和PhyDPO优化方法,显著提升了生成视频的物理真实感。
Details
Motivation: 现有文本到视频生成模型在物理合理性方面存在缺陷,如物体运动不真实、交互不连贯,限制了其在具身AI、机器人和仿真等领域的应用。 Method: 提出PhysicsRM双维度奖励模型,量化对象内稳定性和对象间交互;基于此构建PhyDPO直接偏好优化流程,利用对比反馈和物理感知重加权引导生成过程。 Result: 在多个基准上实验表明,PhysCorr显著提升了物理真实性,同时保持了视觉质量和语义一致性。 Conclusion: PhysCorr为实现物理上合理且可信的视频生成提供了有效途径,推动了视频生成技术在需要物理准确性的场景中的应用。 Abstract: Recent advances in text-to-video generation have achieved impressive perceptual quality, yet generated content often violates fundamental principles of physical plausibility - manifesting as implausible object dynamics, incoherent interactions, and unrealistic motion patterns. Such failures hinder the deployment of video generation models in embodied AI, robotics, and simulation-intensive domains. To bridge this gap, we propose PhysCorr, a unified framework for modeling, evaluating, and optimizing physical consistency in video generation. Specifically, we introduce PhysicsRM, the first dual-dimensional reward model that quantifies both intra-object stability and inter-object interactions. On this foundation, we develop PhyDPO, a novel direct preference optimization pipeline that leverages contrastive feedback and physics-aware reweighting to guide generation toward physically coherent outputs. Our approach is model-agnostic and scalable, enabling seamless integration into a wide range of video diffusion and transformer-based backbones. Extensive experiments across multiple benchmarks demonstrate that PhysCorr achieves significant improvements in physical realism while preserving visual fidelity and semantic alignment. This work takes a critical step toward physically grounded and trustworthy video generation.[59] GNN-MoE: Context-Aware Patch Routing using GNNs for Parameter-Efficient Domain Generalization
Mahmoud Soliman,Omar Abdelaziz,Ahmed Radwan,Anand,Mohamed Shehata
Main category: cs.CV
TL;DR: 提出GNN-MoE方法,利用图神经网络路由在补丁间图上动态分配专业化专家,结合Kronecker适配器的Mixture-of-Experts框架,实现高效参数微调,在域泛化任务中取得先进性能。
Details
Motivation: 解决预训练Vision Transformer在域泛化中标准微调成本高且损害泛化能力的问题,提升参数效率和模型鲁棒性。 Method: 设计基于图神经网络(如GCN、GAT、SAGE)的路由器,构建补丁间图并利用其关系进行上下文感知的动态路由,将不同图像补丁分配给专用专家,并采用轻量级Kronecker适配器实现参数高效的微调。 Result: 在多个域泛化基准上达到最先进或具有竞争力的性能,同时保持高参数效率,验证了图结构和上下文路由对域泛化的有效性。 Conclusion: GNN-MoE通过引入图神经网络驱动的上下文感知路由机制,显著提升了Vision Transformer在跨域场景下的适应能力和效率,为轻量级域泛化提供了新思路。 Abstract: Domain generalization (DG) seeks robust Vision Transformer (ViT) performance on unseen domains. Efficiently adapting pretrained ViTs for DG is challenging; standard fine-tuning is costly and can impair generalization. We propose GNN-MoE, enhancing Parameter-Efficient Fine-Tuning (PEFT) for DG with a Mixture-of-Experts (MoE) framework using efficient Kronecker adapters. Instead of token-based routing, a novel Graph Neural Network (GNN) router (GCN, GAT, SAGE) operates on inter-patch graphs to dynamically assign patches to specialized experts. This context-aware GNN routing leverages inter-patch relationships for better adaptation to domain shifts. GNN-MoE achieves state-of-the-art or competitive DG benchmark performance with high parameter efficiency, highlighting the utility of graph-based contextual routing for robust, lightweight DG.[60] MedDChest: A Content-Aware Multimodal Foundational Vision Model for Thoracic Imaging
Mahmoud Soliman,Islam Osman,Mohamed S. Shehata,Rasika Rajapakshe
Main category: cs.CV
TL;DR: 提出并验证了专为胸部影像设计的MedDChest模型,通过大规模域内预训练和内容感知数据增强显著提升医学图像性能。
Details
Motivation: 现有视觉模型在医学影像中受限于使用自然图像预训练带来的域差异问题。 Method: 从零开始在超过120万张胸部X光和CT图像上预训练ViT模型,并提出Guided Random Resized Crops作为内容感知的数据增强策略。 Result: 在多种下游诊断任务中,MedDChest显著优于ImageNet预训练模型。 Conclusion: 大规模域内预训练结合领域特定数据增强可有效提升医学影像分析性能,MedDChest为胸部影像任务提供了更优的特征提取基础。 Abstract: The performance of vision models in medical imaging is often hindered by the prevailing paradigm of fine-tuning backbones pre-trained on out-of-domain natural images. To address this fundamental domain gap, we propose MedDChest, a new foundational Vision Transformer (ViT) model optimized specifically for thoracic imaging. We pre-trained MedDChest from scratch on a massive, curated, multimodal dataset of over 1.2 million images, encompassing different modalities including Chest X-ray and Computed Tomography (CT) compiled from 10 public sources. A core technical contribution of our work is Guided Random Resized Crops, a novel content-aware data augmentation strategy that biases sampling towards anatomically relevant regions, overcoming the inefficiency of standard cropping techniques on medical scans. We validate our model's effectiveness by fine-tuning it on a diverse set of downstream diagnostic tasks. Comprehensive experiments empirically demonstrate that MedDChest significantly outperforms strong, publicly available ImageNet-pretrained models. By establishing the superiority of large-scale, in-domain pre-training combined with domain-specific data augmentation, MedDChest provides a powerful and robust feature extractor that serves as a significantly better starting point for a wide array of thoracic diagnostic tasks. The model weights will be made publicly available to foster future research and applications.[61] Near-Lossless 3D Voxel Representation Free from Iso-surface
Yihao Luo,Xianglong He,Chuanyu Pan,Yiwen Chen,Jiaqi Wu,Yangguang Li,Wanli Ouyang,Yuanming Hu,Guang Yang,ChoonHwai Yap
Main category: cs.CV
TL;DR: 提出了一种名为Faithful Contouring的稀疏体素化表示方法,支持高分辨率且无需水密处理或等值面提取,实现了接近无损的几何保真度。
Details
Motivation: 现有基于等值面的3D网格体素化表示方法依赖水密化或渲染优化,导致几何保真度下降,难以准确表达复杂几何与拓扑结构。 Method: 提出Faithful Contouring,一种无需将网格转换为场函数或进行等值面提取的稀疏体素化表示;设计双模式自编码器以实现可扩展且保持细节的形状重建。 Result: 在直接表示中距离误差达10^{-5}级别;在网格重建中Chamfer Distance降低93%,F-score提升35%。 Conclusion: Faithful Contouring在精度和效率上优于现有方法,能更好地保持尖锐特征和内部结构,适用于3D重建与生成任务。 Abstract: Accurate and efficient voxelized representations of 3D meshes are the foundation of 3D reconstruction and generation. However, existing representations based on iso-surface heavily rely on water-tightening or rendering optimization, which inevitably compromise geometric fidelity. We propose Faithful Contouring, a sparse voxelized representation that supports 2048+ resolutions for arbitrary meshes, requiring neither converting meshes to field functions nor extracting the isosurface during remeshing. It achieves near-lossless fidelity by preserving sharpness and internal structures, even for challenging cases with complex geometry and topology. The proposed method also shows flexibility for texturing, manipulation, and editing. Beyond representation, we design a dual-mode autoencoder for Faithful Contouring, enabling scalable and detail-preserving shape reconstruction. Extensive experiments show that Faithful Contouring surpasses existing methods in accuracy and efficiency for both representation and reconstruction. For direct representation, it achieves distance errors at the $10^{-5}$ level; for mesh reconstruction, it yields a 93\% reduction in Chamfer Distance and a 35\% improvement in F-score over strong baselines, confirming superior fidelity as a representation for 3D learning tasks.[62] A Hybrid Deep Learning Model for Robust Biometric Authentication from Low-Frame-Rate PPG Signals
Arfina Rahman,Mahesh Banavar
Main category: cs.CV
TL;DR: 本文提出了一种基于低帧率指尖视频提取PPG信号的轻量级生物识别认证框架,结合连续小波变换与混合深度学习模型CVT-ConvMixer-LSTM,在CFIHSR数据集上实现了98%的认证准确率,具有良好的抗噪性和跨被试稳定性,适用于移动和嵌入式安全应用。
Details
Motivation: PPG信号虽在生物认证中具有非侵入性、天然活体检测等优势,但易受运动伪影、光照变化和个体间生理差异影响,需提升特征提取与分类的鲁棒性。 Method: 采用CFIHSR数据集(46名被试,14 Hz采样率),对原始PPG信号进行去基线漂移、PCA去运动伪影、带通滤波、傅里叶重采样和幅值归一化;通过连续小波变换(CWT)将一维PPG段转换为二维时频谱图,并设计融合CVT、ConvMixer空间特征与LSTM时序特征的混合深度学习模型CVT-ConvMixer-LSTM进行认证。 Result: 在46名被试上的实验结果显示认证准确率达到98%,验证了模型对噪声和个体差异的鲁棒性。 Conclusion: 所提出的轻量高效框架结合时频分析与混合深度模型,显著提升了PPG生物认证的性能,具备可扩展性和活体检测能力,适合实际移动端和嵌入式应用场景。 Abstract: Photoplethysmography (PPG) signals, which measure changes in blood volume in the skin using light, have recently gained attention in biometric authentication because of their non-invasive acquisition, inherent liveness detection, and suitability for low-cost wearable devices. However, PPG signal quality is challenged by motion artifacts, illumination changes, and inter-subject physiological variability, making robust feature extraction and classification crucial. This study proposes a lightweight and cost-effective biometric authentication framework based on PPG signals extracted from low-frame-rate fingertip videos. The CFIHSR dataset, comprising PPG recordings from 46 subjects at a sampling rate of 14 Hz, is employed for evaluation. The raw PPG signals undergo a standard preprocessing pipeline involving baseline drift removal, motion artifact suppression using Principal Component Analysis (PCA), bandpass filtering, Fourier-based resampling, and amplitude normalization. To generate robust representations, each one-dimensional PPG segment is converted into a two-dimensional time-frequency scalogram via the Continuous Wavelet Transform (CWT), effectively capturing transient cardiovascular dynamics. We developed a hybrid deep learning model, termed CVT-ConvMixer-LSTM, by combining spatial features from the Convolutional Vision Transformer (CVT) and ConvMixer branches with temporal features from a Long Short-Term Memory network (LSTM). The experimental results on 46 subjects demonstrate an authentication accuracy of 98%, validating the robustness of the model to noise and variability between subjects. Due to its efficiency, scalability, and inherent liveness detection capability, the proposed system is well-suited for real-world mobile and embedded biometric security applications.[63] Unveiling Deep Semantic Uncertainty Perception for Language-Anchored Multi-modal Vision-Brain Alignment
Zehui Feng,Chenqi Zhang,Mingru Wang,Minuo Wei,Shiwei Cheng,Cuntai Guan,Ting Han
Main category: cs.CV
TL;DR: 提出Bratrix,首个实现语言锚定的视觉-脑对齐的端到端框架,通过解耦视觉刺激并引入不确定性感知模块,在EEG、MEG和fMRI任务中显著提升检索、重建和描述性能。
Details
Motivation: 现有方法直接对齐神经活动与视觉嵌入,但缺乏语义可解释性和鲁棒性,难以应对个体差异和噪声信号。 Method: Bratrix将视觉刺激解耦为层次化的视觉与语言语义成分,通过语言锚定语义矩阵和不确定性感知模块,将视觉与脑信号投影到共享隐空间,并采用两阶段训练策略:单模态预训练和多模态微调。 Result: 在EEG、MEG和fMRI基准上,Bratrix在检索、重建和图像描述任务中优于现有方法,200类EEG检索任务性能提升14.3%。 Conclusion: Bratrix通过语言锚定的多模态对齐和不确定性建模,有效提升了从神经信号中解码视觉语义的精度与鲁棒性。 Abstract: Unveiling visual semantics from neural signals such as EEG, MEG, and fMRI remains a fundamental challenge due to subject variability and the entangled nature of visual features. Existing approaches primarily align neural activity directly with visual embeddings, but visual-only representations often fail to capture latent semantic dimensions, limiting interpretability and deep robustness. To address these limitations, we propose Bratrix, the first end-to-end framework to achieve multimodal Language-Anchored Vision-Brain alignment. Bratrix decouples visual stimuli into hierarchical visual and linguistic semantic components, and projects both visual and brain representations into a shared latent space, enabling the formation of aligned visual-language and brain-language embeddings. To emulate human-like perceptual reliability and handle noisy neural signals, Bratrix incorporates a novel uncertainty perception module that applies uncertainty-aware weighting during alignment. By leveraging learnable language-anchored semantic matrices to enhance cross-modal correlations and employing a two-stage training strategy of single-modality pretraining followed by multimodal fine-tuning, Bratrix-M improves alignment precision. Extensive experiments on EEG, MEG, and fMRI benchmarks demonstrate that Bratrix improves retrieval, reconstruction, and captioning performance compared to state-of-the-art methods, specifically surpassing 14.3% in 200-way EEG retrieval task. Code and model are available.[64] Adversarial and Score-Based CT Denoising: CycleGAN vs Noise2Score
Abu Hanif Muhammad Syarubany
Main category: cs.CV
TL;DR: 本文研究了在无配对和自监督条件下CT图像去噪的两种高效训练数据范式:基于CycleGAN的残差翻译器和Noise2Score(N2S)得分匹配去噪器。实验结果表明,CycleGAN在最终图像质量上表现最佳,而Noise2Score在缺乏干净配对数据时表现出强大的去噪能力。
Details
Motivation: 在缺乏配对训练数据的情况下,探索高效的CT图像去噪方法,以提升医学影像质量。 Method: 采用CycleGAN-based残差翻译器和Noise2Score得分匹配去噪器,在统一评估协议下进行对比实验,并通过参数调优和长时间训练优化CycleGAN性能。 Result: CycleGAN将输入图像从34.66 dB / 0.9234 SSIM提升至38.913 dB / 0.971 SSIM,在Kaggle未见数据集上得分为1.9343;Noise2Score在极噪声输入下表现出显著改善,虽PSNR/SSIM略低但具备强鲁棒性。 Conclusion: CycleGAN提供最优图像质量,Noise2Score则是在无配对数据场景下的有力替代方案,两者均具有实际应用价值。 Abstract: We study CT image denoising in the unpaired and self-supervised regimes by evaluating two strong, training-data-efficient paradigms: a CycleGAN-based residual translator and a Noise2Score (N2S) score-matching denoiser. Under a common evaluation protocol, a configuration sweep identifies a simple standard U-Net backbone within CycleGAN (lambda_cycle = 30, lambda_iden = 2, ngf = ndf = 64) as the most reliable setting; we then train it to convergence with a longer schedule. The selected CycleGAN improves the noisy input from 34.66 dB / 0.9234 SSIM to 38.913 dB / 0.971 SSIM and attains an estimated score of 1.9441 and an unseen-set (Kaggle leaderboard) score of 1.9343. Noise2Score, while slightly behind in absolute PSNR / SSIM, achieves large gains over very noisy inputs, highlighting its utility when clean pairs are unavailable. Overall, CycleGAN offers the strongest final image quality, whereas Noise2Score provides a robust pair-free alternative with competitive performance. Source code is available at https://github.com/hanifsyarubany/CT-Scan-Image-Denoising-using-CycleGAN-and-Noise2Score.[65] When Swin Transformer Meets KANs: An Improved Transformer Architecture for Medical Image Segmentation
Nishchal Sapkota,Haoyan Shi,Yejia Zhang,Xianshi Ma,Bofang Zheng,Danny Z. Chen
Main category: cs.CV
TL;DR: 本文提出了一种名为UKAST的新型医学图像分割网络,结合了Swin Transformer与基于有理函数的Kolmogorov-Arnold Networks(KAN),在减少计算量的同时提升了数据效率和分割性能,在多种2D/3D医学图像数据集上达到SOTA,尤其在标注数据稀缺场景下表现优异。
Details
Motivation: 医学图像分割面临复杂解剖结构和标注数据有限的挑战;传统CNN难以建模长距离依赖,而Transformer虽能捕捉全局上下文但数据需求高、计算成本大,因此需要一种更高效且数据友好的模型。 Method: 提出UKAST,一种类似U-Net的架构,将基于有理函数的Kolmogorov-Arnold Networks(KAN)集成到Swin Transformer编码器中,采用Group Rational KANs(GR-KANs)替代原始样条基KAN,提升表达能力和数据效率,降低FLOPs,仅轻微增加参数量。 Result: UKAST在四个2D/3D医学图像分割基准上均达到最先进的性能,显著优于CNN和Transformer基线方法,尤其在数据稀缺情况下表现出更强的鲁棒性和准确性。 Conclusion: KAN增强的Transformer能够有效提升医学图像分割的效率与性能,缓解Vision Transformer对大量数据的依赖,为数据高效的医疗AI提供了新方向。 Abstract: Medical image segmentation is critical for accurate diagnostics and treatment planning, but remains challenging due to complex anatomical structures and limited annotated training data. CNN-based segmentation methods excel at local feature extraction, but struggle with modeling long-range dependencies. Transformers, on the other hand, capture global context more effectively, but are inherently data-hungry and computationally expensive. In this work, we introduce UKAST, a U-Net like architecture that integrates rational-function based Kolmogorov-Arnold Networks (KANs) into Swin Transformer encoders. By leveraging rational base functions and Group Rational KANs (GR-KANs) from the Kolmogorov-Arnold Transformer (KAT), our architecture addresses the inefficiencies of vanilla spline-based KANs, yielding a more expressive and data-efficient framework with reduced FLOPs and only a very small increase in parameter count compared to SwinUNETR. UKAST achieves state-of-the-art performance on four diverse 2D and 3D medical image segmentation benchmarks, consistently surpassing both CNN- and Transformer-based baselines. Notably, it attains superior accuracy in data-scarce settings, alleviating the data-hungry limitations of standard Vision Transformers. These results show the potential of KAN-enhanced Transformers to advance data-efficient medical image segmentation. Code is available at: https://github.com/nsapkota417/UKAST[66] SpatialLock: Precise Spatial Control in Text-to-Image Synthesis
Biao Liu,Yuanzhi Liang
Main category: cs.CV
TL;DR: 提出了一种名为SpatialLock的新框架,用于提升文本到图像生成中对象位置的精确控制,通过结合感知信号和定位信息,在多个数据集上实现了超过0.9的IOU得分,达到当前最优性能。
Details
Motivation: 现有文本到图像生成方法在利用位置信息方面不足,难以实现对生成图像中对象空间布局的精确控制。 Method: 提出SpatialLock框架,包含位置感知注入(PoI)和位置引导学习(PoG)两个组件:PoI通过注意力层直接融合空间信息,PoG采用基于感知的监督来优化对象定位。 Result: 在多个数据集上实现了高于0.9的IOU分数,显著提升了对象定位精度和生成图像的视觉质量。 Conclusion: SpatialLock有效解决了文本到图像生成中对象定位不准确的问题,通过联合控制空间位置生成,实现了当前最先进的精确对象定位性能。 Abstract: Text-to-Image (T2I) synthesis has made significant advancements in recent years, driving applications such as generating datasets automatically. However, precise control over object localization in generated images remains a challenge. Existing methods fail to fully utilize positional information, leading to an inadequate understanding of object spatial layouts. To address this issue, we propose SpatialLock, a novel framework that leverages perception signals and grounding information to jointly control the generation of spatial locations. SpatialLock incorporates two components: Position-Engaged Injection (PoI) and Position-Guided Learning (PoG). PoI directly integrates spatial information through an attention layer, encouraging the model to learn the grounding information effectively. PoG employs perception-based supervision to further refine object localization. Together, these components enable the model to generate objects with precise spatial arrangements and improve the visual quality of the generated images. Experiments show that SpatialLock sets a new state-of-the-art for precise object positioning, achieving IOU scores above 0.9 across multiple datasets.[67] Tortoise and Hare Guidance: Accelerating Diffusion Model Inference with Multirate Integration
Yunghee Lee,Byeonghyun Pak,Junwha Hong,Hoseong Kim
Main category: cs.CV
TL;DR: 提出了一种无需训练的扩散采样加速策略Tortoise and Hare Guidance (THG),通过多速率ODE系统减少计算量,在保持生成质量的同时显著提升效率。
Details
Motivation: 现有的扩散模型在采样时需要大量函数评估,导致推理速度慢,而现有加速方法难以兼顾高保真生成和低计算成本。 Method: 将分类器无关引导(CFG)ODE重构为多速率ODE系统,分析噪声估计与附加引导项对数值误差的敏感性差异;在此基础上,THG在细粒度时间步上更新噪声估计(龟方程),在粗粒度时间步上更新附加引导项(兔方程),并引入误差感知的时间步采样器和引导尺度调度器。 Result: THG最多减少了30%的函数评估次数(NFE),生成质量几乎无损(ΔImageReward ≤ 0.032),在相同计算预算下优于当前最先进的无需训练的CFG加速方法。 Conclusion: 多速率建模为扩散求解器提供了有效加速的新思路,可在不重新训练模型的前提下实现高质量实时图像生成。 Abstract: In this paper, we propose Tortoise and Hare Guidance (THG), a training-free strategy that accelerates diffusion sampling while maintaining high-fidelity generation. We demonstrate that the noise estimate and the additional guidance term exhibit markedly different sensitivity to numerical error by reformulating the classifier-free guidance (CFG) ODE as a multirate system of ODEs. Our error-bound analysis shows that the additional guidance branch is more robust to approximation, revealing substantial redundancy that conventional solvers fail to exploit. Building on this insight, THG significantly reduces the computation of the additional guidance: the noise estimate is integrated with the tortoise equation on the original, fine-grained timestep grid, while the additional guidance is integrated with the hare equation only on a coarse grid. We also introduce (i) an error-bound-aware timestep sampler that adaptively selects step sizes and (ii) a guidance-scale scheduler that stabilizes large extrapolation spans. THG reduces the number of function evaluations (NFE) by up to 30% with virtually no loss in generation fidelity ($\Delta$ImageReward $\leq$ 0.032) and outperforms state-of-the-art CFG-based training-free accelerators under identical computation budgets. Our findings highlight the potential of multirate formulations for diffusion solvers, paving the way for real-time high-quality image synthesis without any model retraining. The source code is available at https://github.com/yhlee-add/THG.[68] Text to Sketch Generation with Multi-Styles
Tengjie Li,Shikui Tu,Lei Xu
Main category: cs.CV
TL;DR: 提出了一种无需训练的扩散模型框架M3S,通过文本提示和参考草图实现精确的草图风格控制,有效减少内容泄露并支持多风格可控生成。
Details
Motivation: 现有草图生成方法缺乏对草图风格的精细控制,且在风格迁移过程中容易发生内容泄露。 Method: 基于扩散模型,引入参考特征作为辅助信息,采用线性平滑和风格-内容引导机制,并通过联合AdaIN模块实现多风格融合。 Result: 实验表明该方法在风格对齐准确性、生成质量及风格控制灵活性方面优于现有方法,尤其在参考与目标草图结构差异大时表现更优。 Conclusion: M3S提供了一种高效、灵活且无需训练的草图风格生成方案,显著提升了风格控制能力和生成效果。 Abstract: Recent advances in vision-language models have facilitated progress in sketch generation. However, existing specialized methods primarily focus on generic synthesis and lack mechanisms for precise control over sketch styles. In this work, we propose a training-free framework based on diffusion models that enables explicit style guidance via textual prompts and referenced style sketches. Unlike previous style transfer methods that overwrite key and value matrices in self-attention, we incorporate the reference features as auxiliary information with linear smoothing and leverage a style-content guidance mechanism. This design effectively reduces content leakage from reference sketches and enhances synthesis quality, especially in cases with low structural similarity between reference and target sketches. Furthermore, we extend our framework to support controllable multi-style generation by integrating features from multiple reference sketches, coordinated via a joint AdaIN module. Extensive experiments demonstrate that our approach achieves high-quality sketch generation with accurate style alignment and improved flexibility in style control. The official implementation of M3S is available at https://github.com/CMACH508/M3S.[69] Automated Tennis Player and Ball Tracking with Court Keypoints Detection (Hawk Eye System)
Venkata Manikanta Desu,Syed Fawaz Ali
Main category: cs.CV
TL;DR: 本研究提出了一套基于深度学习的自动化网球比赛分析管道,整合了多个模型实现球员与球的检测跟踪及球场关键点识别,可生成带注释的视频和详细性能指标。
Details
Motivation: 为了提供实时、准确且全面的网球比赛分析工具,帮助教练、转播方和运动员更好地理解比赛动态。 Method: 采用YOLOv8进行球员检测,自定义训练的YOLOv5用于网球跟踪,ResNet50架构用于球场关键点检测,结合多模型输出进行数据分析。 Result: 系统在不同场地条件和比赛场景下均表现出良好的鲁棒性,能够准确提取球员移动模式、球速、击球准确率和反应时间等指标。 Conclusion: 该框架实现了端到端的网球比赛自动分析,具备实际应用价值,可广泛用于训练优化、赛事转播和战术分析。 Abstract: This study presents a complete pipeline for automated tennis match analysis. Our framework integrates multiple deep learning models to detect and track players and the tennis ball in real time, while also identifying court keypoints for spatial reference. Using YOLOv8 for player detection, a custom-trained YOLOv5 model for ball tracking, and a ResNet50-based architecture for court keypoint detection, our system provides detailed analytics including player movement patterns, ball speed, shot accuracy, and player reaction times. The experimental results demonstrate robust performance in varying court conditions and match scenarios. The model outputs an annotated video along with detailed performance metrics, enabling coaches, broadcasters, and players to gain actionable insights into the dynamics of the game.[70] DMSORT: An efficient parallel maritime multi-object tracking architecture for unmanned vessel platforms
Shengyu Tang,Zeyuan Lu,Jiazhi Dong,Changdong Yu,Xiaoyu Wang,Yaohui Lyu,Weihao Xia
Main category: cs.CV
TL;DR: 提出了一种高效的双分支海上多目标跟踪方法DMSORT,结合检测重识别与相机运动估计分支,在复杂海况下实现鲁棒的物体跟踪。
Details
Motivation: 复杂的海上环境导致相机运动和视觉退化,严重影响多目标跟踪性能,需提升海上MOT的鲁棒性和准确性。 Method: 设计双分支并行跟踪框架:一枝采用RCDN进行多级特征检测,Li-TAE提取外观特征;另一枝通过投影变换估计平台运动并在卡尔曼滤波中补偿,再通过聚类优化的特征融合模块融合运动与外观信息。 Result: 在新加坡海事数据集上达到SOTA性能,运行速度最快,且对抖动、遮挡具有强鲁棒性,保持高身份一致性。 Conclusion: DMSORT有效解决了海上动态相机带来的挑战,兼顾高效性与鲁棒性,适用于实际海上监控与航行安全应用。 Abstract: Accurate perception of the marine environment through robust multi-object tracking (MOT) is essential for ensuring safe vessel navigation and effective maritime surveillance. However, the complicated maritime environment often causes camera motion and subsequent visual degradation, posing significant challenges to MOT. To address this challenge, we propose an efficient Dual-branch Maritime SORT (DMSORT) method for maritime MOT. The core of the framework is a parallel tracker with affine compensation, which incorporates an object detection and re-identification (ReID) branch, along with a dedicated branch for dynamic camera motion estimation. Specifically, a Reversible Columnar Detection Network (RCDN) is integrated into the detection module to leverage multi-level visual features for robust object detection. Furthermore, a lightweight Transformer-based appearance extractor (Li-TAE) is designed to capture global contextual information and generate robust appearance features. Another branch decouples platform-induced and target-intrinsic motion by constructing a projective transformation, applying platform-motion compensation within the Kalman filter, and thereby stabilizing true object trajectories. Finally, a clustering-optimized feature fusion module effectively combines motion and appearance cues to ensure identity consistency under noise, occlusion, and drift. Extensive evaluations on the Singapore Maritime Dataset demonstrate that DMSORT achieves state-of-the-art performance. Notably, DMSORT attains the fastest runtime among existing ReID-based MOT frameworks while maintaining high identity consistency and robustness to jitter and occlusion. Code is available at: https://github.com/BiscuitsLzy/DMSORT-An-efficient-parallel-maritime-multi-object-tracking-architecture-.[71] Learning from Online Videos at Inference Time for Computer-Use Agents
Yujian Liu,Ze Wang,Hao Chen,Ximeng Sun,Xiaodong Yu,Jialian Wu,Jiang Liu,Emad Barsoum,Zicheng Liu,Shiyu Chang
Main category: cs.CV
TL;DR: 本文提出了一种使计算机使用代理在推理时从在线视频中学习的框架,通过检索、过滤教程视频并将其转化为结构化示范轨迹,动态选择最佳轨迹作为上下文指导,显著提升了代理在复杂任务中的表现。
Details
Motivation: 现有计算机使用代理在需要特定领域程序性知识的任务上仍落后于人类,而人类可通过观看视频教程快速学习。因此,研究如何让代理有效利用在线视频进行实时学习具有重要意义。 Method: 提出一个包含视频检索与过滤、视频转为结构化示范轨迹、以及动态选择轨迹作为上下文指导的框架;使用视觉语言模型(VLM)推断UI动作,将视频分段为带文本目标的动作子序列,并在推理时通过两阶段机制动态选择最相关的轨迹。 Result: 在两个广泛使用的基准上实验表明,该方法 consistently 优于强基线代理及仅使用文本教程或转录本的变体,验证了轨迹分割与选择、动作过滤和视觉信息的重要性。 Conclusion: 在线视频可被系统地提炼为可在推理时提升计算机使用代理性能的可执行指导,为代理的实时学习提供了新路径。 Abstract: Computer-use agents can operate computers and automate laborious tasks, but despite recent rapid progress, they still lag behind human users, especially when tasks require domain-specific procedural knowledge about particular applications, platforms, and multi-step workflows. Humans can bridge this gap by watching video tutorials: we search, skim, and selectively imitate short segments that match our current subgoal. In this paper, we study how to enable computer-use agents to learn from online videos at inference time effectively. We propose a framework that retrieves and filters tutorial videos, converts them into structured demonstration trajectories, and dynamically selects trajectories as in-context guidance during execution. Particularly, using a VLM, we infer UI actions, segment videos into short subsequences of actions, and assign each subsequence a textual objective. At inference time, a two-stage selection mechanism dynamically chooses a single trajectory to add in context at each step, focusing the agent on the most helpful local guidance for its next decision. Experiments on two widely used benchmarks show that our framework consistently outperforms strong base agents and variants that use only textual tutorials or transcripts. Analyses highlight the importance of trajectory segmentation and selection, action filtering, and visual information, suggesting that abundant online videos can be systematically distilled into actionable guidance that improves computer-use agents at inference time. Our code is available at https://github.com/UCSB-NLP-Chang/video_demo.[72] Seeing Straight: Document Orientation Detection for Efficient OCR
Suranjan Goswami,Abhinav Ravi,Raja Kolla,Ali Faraz,Shaharukh Khan,Akash,Chandra Khatri,Shubham Agarwal
Main category: cs.CV
TL;DR: 本文提出了一种用于评估OCR在图像旋转下鲁棒性的新基准OCR-Rotation-Bench(ORB),并构建了一个基于Phi-3.5-Vision模型的高效旋转分类流水线,在真实场景中显著提升了OCR性能。
Details
Motivation: 由于用户拍摄时方向错误导致文档图像常出现旋转问题,影响OCR等下游任务性能,因此需要一个准确、轻量且鲁棒的旋转校正方法及评估基准。 Method: 构建了包含英语和11种印度语种的旋转测试集ORB-En与ORB-Indic;基于Phi-3.5-Vision视觉编码器设计了带动态裁剪的轻量级四分类旋转检测模型,并进行独立微调。 Result: 该方法在ORB-En和ORB-Indic上分别达到96%和92%的旋转识别准确率,并在实际应用中使闭源OCR模型性能提升达14%,开源模型提升达4倍。 Conclusion: 所提出的ORB基准和旋转分类模型能有效评估并改善OCR系统对旋转的鲁棒性,具有实际部署价值。 Abstract: Despite significant advances in document understanding, determining the correct orientation of scanned or photographed documents remains a critical pre-processing step in the real world settings. Accurate rotation correction is essential for enhancing the performance of downstream tasks such as Optical Character Recognition (OCR) where misalignment commonly arises due to user errors, particularly incorrect base orientations of the camera during capture. In this study, we first introduce OCR-Rotation-Bench (ORB), a new benchmark for evaluating OCR robustness to image rotations, comprising (i) ORB-En, built from rotation-transformed structured and free-form English OCR datasets, and (ii) ORB-Indic, a novel multilingual set spanning 11 Indic mid to low-resource languages. We also present a fast, robust and lightweight rotation classification pipeline built on the vision encoder of Phi-3.5-Vision model with dynamic image cropping, fine-tuned specifically for 4-class rotation task in a standalone fashion. Our method achieves near-perfect 96% and 92% accuracy on identifying the rotations respectively on both the datasets. Beyond classification, we demonstrate the critical role of our module in boosting OCR performance: closed-source (up to 14%) and open-weights models (up to 4x) in the simulated real-world setting.[73] Systematic Evaluation of Preprocessing Techniques for Accurate Image Registration in Digital Pathology
Fatemehzahra Darzi,Rodrigo Escobar Diaz Guerrero,Thomas Bocklitz
Main category: cs.CV
TL;DR: 本研究探讨了不同颜色转换技术对苏木精-伊红(H&E)染色图像与非线性多模态图像配准效果的影响,结果表明CycleGAN颜色转换能显著降低配准误差,提升数字病理学中跨模态图像对齐的准确性。
Details
Motivation: 在数字病理学中,不同染色或成像模态的图像需要精确配准以支持生物标志物分析和组织重建等应用,但模态差异导致配准困难,因此需研究预处理中的颜色转换方法如何改善配准性能。 Method: 采用20组组织样本对,比较CycleGAN、Macenko、Reinhard和Vahadane等颜色转换方法,并结合图像反转、对比度调整、强度归一化和去噪等预处理;使用VALIS方法进行刚性与非刚性两步配准,并在高低分辨率图像上分别执行;通过相对目标配准误差(rTRE)评估性能,报告中位数中位rTRE(MMrTRE)和平均中位rTRE(AMrTRE),并辅以10个手动选取关键点的点对点评估。 Result: 在使用原始或多模态反转图像的两种场景下,CycleGAN颜色转换均取得最低的配准误差,其他方法误差较高;自定义关键点评估进一步验证了其优越性。 Conclusion: 在图像配准前应用颜色转换(尤其是CycleGAN)可有效提升不同模态病理图像的对齐精度,有助于实现更可靠的多模态信息整合与分析。 Abstract: Image registration refers to the process of spatially aligning two or more images by mapping them into a common coordinate system, so that corresponding anatomical or tissue structures are matched across images. In digital pathology, registration enables direct comparison and integration of information from different stains or imaging modalities, sup-porting applications such as biomarker analysis and tissue reconstruction. Accurate registration of images from different modalities is an essential step in digital pathology. In this study, we investigated how various color transformation techniques affect image registration between hematoxylin and eosin (H&E) stained images and non-linear multimodal images. We used a dataset of 20 tissue sample pairs, with each pair undergoing several preprocessing steps, including different color transformation (CycleGAN, Macenko, Reinhard, Vahadane), inversion, contrast adjustment, intensity normalization, and denoising. All images were registered using the VALIS registration method, which first applies rigid registration and then performs non-rigid registration in two steps on both low and high-resolution images. Registration performance was evaluated using the relative Target Registration Error (rTRE). We reported the median of median rTRE values (MMrTRE) and the average of median rTRE values (AMrTRE) for each method. In addition, we performed a custom point-based evaluation using ten manually selected key points. Registration was done separately for two scenarios, using either the original or inverted multimodal images. In both scenarios, CycleGAN color transformation achieved the lowest registration errors, while the other methods showed higher errors. These findings show that applying color transformation before registration improves alignment between images from different modalities and supports more reliable analysis in digital pathology.[74] Covariance Descriptors Meet General Vision Encoders: Riemannian Deep Learning for Medical Image Classification
Josef Mayr,Anna Reithmeir,Maxime Di Folco,Julia A. Schnabel
Main category: cs.CV
TL;DR: 本文研究了基于预训练视觉编码器的协方差描述符在医学图像分类中的有效性,发现结合DINOv2特征的SPDNet性能优于现有方法。
Details
Motivation: 协方差描述符在通用计算机视觉中表现良好,但在医学成像中尚未充分探索,本文旨在评估其在传统和基于学习的医学图像分类中的效果。 Method: 从预训练的通用视觉编码器(如DINOv2和MedSAM)提取特征并构建协方差描述符,与手工设计的描述符进行比较,并在MedMNSIT基准的11个数据集上评估SPDNet网络的分类性能。 Result: 基于GVE特征的协方差描述符 consistently 优于手工特征;DINOv2与SPDNet结合时表现最佳,超越现有方法。 Conclusion: 结合强大的预训练视觉编码器与协方差描述符具有提升医学图像分析性能的巨大潜力。 Abstract: Covariance descriptors capture second-order statistics of image features. They have shown strong performance in general computer vision tasks, but remain underexplored in medical imaging. We investigate their effectiveness for both conventional and learning-based medical image classification, with a particular focus on SPDNet, a classification network specifically designed for symmetric positive definite (SPD) matrices. We propose constructing covariance descriptors from features extracted by pre-trained general vision encoders (GVEs) and comparing them with handcrafted descriptors. Two GVEs - DINOv2 and MedSAM - are evaluated across eleven binary and multi-class datasets from the MedMNSIT benchmark. Our results show that covariance descriptors derived from GVE features consistently outperform those derived from handcrafted features. Moreover, SPDNet yields superior performance to state-of-the-art methods when combined with DINOv2 features. Our findings highlight the potential of combining covariance descriptors with powerful pretrained vision encoders for medical image analysis.[75] AStF: Motion Style Transfer via Adaptive Statistics Fusor
Hanmo Chen,Chenghao Xu,Jiexi Yan,Cheng Deng
Main category: cs.CV
TL;DR: 提出了一种新的自适应统计融合器(AStF),通过引入偏度和峰度来增强运动风格迁移的效果,优于现有方法。
Details
Motivation: 传统方法仅依赖均值和方差无法充分捕捉运动数据的复杂动态模式和时空一致性,因此需要更全面的统计建模。 Method: 提出了包含风格解耦模块(SDM)和高阶多统计注意力(HOS-Attn)的AStF,并结合运动一致性正则化(MCR)判别器进行训练。 Result: 实验结果表明,AStF在运动风格迁移任务中优于当前最先进方法,能更好地建模动态风格中的时空统计模式。 Conclusion: 引入偏度和峰度作为高阶统计量可有效提升运动风格迁移的质量,AStF为该任务提供了更全面且有效的解决方案。 Abstract: Human motion style transfer allows characters to appear less rigidity and more realism with specific style. Traditional arbitrary image style transfer typically process mean and variance which is proved effective. Meanwhile, similar methods have been adapted for motion style transfer. However, due to the fundamental differences between images and motion, relying on mean and variance is insufficient to fully capture the complex dynamic patterns and spatiotemporal coherence properties of motion data. Building upon this, our key insight is to bring two more coefficient, skewness and kurtosis, into the analysis of motion style. Specifically, we propose a novel Adaptive Statistics Fusor (AStF) which consists of Style Disentanglement Module (SDM) and High-Order Multi-Statistics Attention (HOS-Attn). We trained our AStF in conjunction with a Motion Consistency Regularization (MCR) discriminator. Experimental results show that, by providing a more comprehensive model of the spatiotemporal statistical patterns inherent in dynamic styles, our proposed AStF shows proficiency superiority in motion style transfers over state-of-the-arts. Our code and model are available at https://github.com/CHMimilanlan/AStF.[76] MedSapiens: Taking a Pose to Rethink Medical Imaging Landmark Detection
Marawan Elbatel,Anbang Wang,Keyuan Liu,Kaouther Mouheb,Enrique Almar-Munoz,Lizhuo Lin,Yanqi Yang,Karim Lekadir,Xiaomeng Li
Main category: cs.CV
TL;DR: 本文提出MedSapiens,通过将人类中心的基础模型Sapiens迁移到医学图像解剖标志检测任务中,在多数据集预训练后实现了新的最先进性能。
Details
Motivation: 传统解剖标志检测依赖领域特定模型,而大规模预训练视觉模型提供了新机遇。作者旨在探索人类姿态估计基础模型在医学影像中的潜力。 Method: 基于Sapiens模型进行多数据集预训练,将其适应于医学图像中的解剖标志检测任务。 Result: MedSapiens在平均成功检测率(SDR)上比通用模型提升最高5.26%,比专用模型提升最高21.81%;在少样本场景下比现有最先进方法提升2.69%。 Conclusion: 人类中心的基础模型在解剖标志检测中具有强大先验能力,MedSapiens有效挖掘了这一潜力,展现出优异的性能和泛化能力。 Abstract: This paper does not introduce a novel architecture; instead, it revisits a fundamental yet overlooked baseline: adapting human-centric foundation models for anatomical landmark detection in medical imaging. While landmark detection has traditionally relied on domain-specific models, the emergence of large-scale pre-trained vision models presents new opportunities. In this study, we investigate the adaptation of Sapiens, a human-centric foundation model designed for pose estimation, to medical imaging through multi-dataset pretraining, establishing a new state of the art across multiple datasets. Our proposed model, MedSapiens, demonstrates that human-centric foundation models, inherently optimized for spatial pose localization, provide strong priors for anatomical landmark detection, yet this potential has remained largely untapped. We benchmark MedSapiens against existing state-of-the-art models, achieving up to 5.26% improvement over generalist models and up to 21.81% improvement over specialist models in the average success detection rate (SDR). To further assess MedSapiens adaptability to novel downstream tasks with few annotations, we evaluate its performance in limited-data settings, achieving 2.69% improvement over the few-shot state of the art in SDR. Code and model weights are available at https://github.com/xmed-lab/MedSapiens .[77] Proto-LeakNet: Towards Signal-Leak Aware Attribution in Synthetic Human Face Imagery
Claudio Giusti,Luca Guarnera,Sebastiano Battiato
Main category: cs.CV
TL;DR: 本文提出了一种名为Proto-LeakNet的可解释性AI图像与深度伪造溯源框架,利用扩散模型在潜在空间中的信号泄露特征,实现对已知和未知生成器的高效分类与开放集识别。
Details
Motivation: 随着合成图像和深度伪造技术日益复杂,验证图像真实性变得愈发重要。现有方法难以应对未见过的生成模型,且缺乏可解释性。因此,亟需一种鲁棒、可解释并能泛化至未知生成器的溯源方法。 Method: Proto-LeakNet在扩散模型的潜在空间中操作,通过重模拟部分前向扩散过程来暴露生成器特有的残留线索。采用时间注意力编码器聚合多步潜在特征,并设计基于特征加权原型的分类头来构建可解释的嵌入空间,结合闭集分类与基于密度的开集评估,无需重新训练即可分析未见生成器。 Result: 该方法仅使用闭集数据训练即达到98.13%的Macro AUC,在抗后处理攻击方面表现出强鲁棒性,显著优于现有最先进方法,并在已知与未知生成器之间实现了良好的嵌入空间分离性。 Conclusion: 通过建模潜在空间中的信号泄露偏差,Proto-LeakNet实现了可靠且可解释的AI生成图像与深度伪造检测,为未来数字内容鉴伪提供了有效工具。 Abstract: The growing sophistication of synthetic image and deepfake generation models has turned source attribution and authenticity verification into a critical challenge for modern computer vision systems. Recent studies suggest that diffusion pipelines unintentionally imprint persistent statistical traces, known as signal leaks, within their outputs, particularly in latent representations. Building on this observation, we propose Proto-LeakNet, a signal-leak-aware and interpretable attribution framework that integrates closed-set classification with a density-based open-set evaluation on the learned embeddings, enabling analysis of unseen generators without retraining. Operating in the latent domain of diffusion models, our method re-simulates partial forward diffusion to expose residual generator-specific cues. A temporal attention encoder aggregates multi-step latent features, while a feature-weighted prototype head structures the embedding space and enables transparent attribution. Trained solely on closed data and achieving a Macro AUC of 98.13%, Proto-LeakNet learns a latent geometry that remains robust under post-processing, surpassing state-of-the-art methods, and achieves strong separability between known and unseen generators. These results demonstrate that modeling signal-leak bias in latent space enables reliable and interpretable AI-image and deepfake forensics. The code for the whole work will be available upon submission.[78] DINOv2 Driven Gait Representation Learning for Video-Based Visible-Infrared Person Re-identification
Yujie Yang,Shuang Li,Jun Ye,Neng Dong,Fan Li,Huafeng Li
Main category: cs.CV
TL;DR: 提出DinoGRL框架,利用DINOv2学习跨模态行人重识别中的步态特征,结合外观与步态信息,通过SASGL和PBMGE模块实现更优的视频级表征。
Details
Motivation: 现有方法多关注模态不变的视觉特征,忽视了具有时序动态且模态不变的步态特征,限制了跨模态视频匹配中的时空一致性建模能力。 Method: 提出DinoGRL框架,包括语义感知轮廓与步态学习模型(SASGL),利用DINOv2的语义先验增强轮廓表示;设计渐进式双向多粒度增强模块(PBMGE),在多个空间粒度上融合步态与外观特征以提升表征能力。 Result: 在HITSZ-VCM和BUPT数据集上实验表明,该方法显著优于现有最先进方法,验证了其在跨模态视频行人重识别中的有效性。 Conclusion: 所提DinoGRL框架有效融合了DINOv2先验与步态特征,增强了跨模态视频行人重识别的鲁棒性和判别性,推动了对时序动态特征的充分利用。 Abstract: Video-based Visible-Infrared person re-identification (VVI-ReID) aims to retrieve the same pedestrian across visible and infrared modalities from video sequences. Existing methods tend to exploit modality-invariant visual features but largely overlook gait features, which are not only modality-invariant but also rich in temporal dynamics, thus limiting their ability to model the spatiotemporal consistency essential for cross-modal video matching. To address these challenges, we propose a DINOv2-Driven Gait Representation Learning (DinoGRL) framework that leverages the rich visual priors of DINOv2 to learn gait features complementary to appearance cues, facilitating robust sequence-level representations for cross-modal retrieval. Specifically, we introduce a Semantic-Aware Silhouette and Gait Learning (SASGL) model, which generates and enhances silhouette representations with general-purpose semantic priors from DINOv2 and jointly optimizes them with the ReID objective to achieve semantically enriched and task-adaptive gait feature learning. Furthermore, we develop a Progressive Bidirectional Multi-Granularity Enhancement (PBMGE) module, which progressively refines feature representations by enabling bidirectional interactions between gait and appearance streams across multiple spatial granularities, fully leveraging their complementarity to enhance global representations with rich local details and produce highly discriminative features. Extensive experiments on HITSZ-VCM and BUPT datasets demonstrate the superiority of our approach, significantly outperforming existing state-of-the-art methods.[79] FastGS: Training 3D Gaussian Splatting in 100 Seconds
Shiwei Ren,Tianci Wen,Yongchun Fang,Biao Lu
Main category: cs.CV
TL;DR: 本文提出了FastGS,一种基于多视角一致性的3D高斯点阵加速框架,通过创新的稠密化与剪枝策略,在保持渲染质量的同时显著提升训练速度。
Details
Motivation: 现有的3D高斯点阵加速方法在训练过程中无法有效控制高斯数量,导致计算冗余和时间开销过大。 Method: 设计了一种基于多视角一致性的高斯重要性评估机制,提出无需预算机制的稠密化与剪枝策略,动态优化高斯分布。 Result: 在Mip-NeRF 360数据集上比DashGaussian快3.32倍,在Deep Blending数据集上比原始3DGS快15.45倍,并在多种任务中实现2-7倍加速。 Conclusion: FastGS是一种高效、通用的加速框架,能够在多种场景下显著缩短训练时间并保持良好的渲染质量。 Abstract: The dominant 3D Gaussian splatting (3DGS) acceleration methods fail to properly regulate the number of Gaussians during training, causing redundant computational time overhead. In this paper, we propose FastGS, a novel, simple, and general acceleration framework that fully considers the importance of each Gaussian based on multi-view consistency, efficiently solving the trade-off between training time and rendering quality. We innovatively design a densification and pruning strategy based on multi-view consistency, dispensing with the budgeting mechanism. Extensive experiments on Mip-NeRF 360, Tanks & Temples, and Deep Blending datasets demonstrate that our method significantly outperforms the state-of-the-art methods in training speed, achieving a 3.32$\times$ training acceleration and comparable rendering quality compared with DashGaussian on the Mip-NeRF 360 dataset and a 15.45$\times$ acceleration compared with vanilla 3DGS on the Deep Blending dataset. We demonstrate that FastGS exhibits strong generality, delivering 2-7$\times$ training acceleration across various tasks, including dynamic scene reconstruction, surface reconstruction, sparse-view reconstruction, large-scale reconstruction, and simultaneous localization and mapping. The project page is available at https://fastgs.github.io/[80] Vision Foundation Models in Agriculture: Toward Domain-Specific Adaptation for Weed Herbicide Trials Assessment
Leire Benito-Del-Valle,Artzai Picón,Daniel Mugica,Manuel Ramos,Eva Portillo,Javier Romero,Carlos Javier Jimenez,Ramón Navarra-Mestre
Main category: cs.CV
TL;DR: 本研究通过在大规模农业数据集上使用自监督学习方法,将通用视觉基础模型适配于除草剂田间试验的物种识别与损伤分类任务,显著提升了在多种场景下的性能表现。
Details
Motivation: 通用视觉模型在农业领域应用时,因物种和损伤类型的细粒度差异而性能受限,亟需针对特定领域优化以提高准确性和泛化能力。 Method: 采用自监督学习方法,在一个大型、 curated 的农业数据集上对通用视觉基础模型进行领域特定的预训练,从而学习适用于除草剂试验图像的丰富且可迁移的表示。 Result: 该领域特定模型在物种识别(F1从0.91提升至0.94)和损伤分类(从0.26提升至0.33)上均显著优于通用模型;在新环境条件下提升更明显;在无人机图像等域偏移场景中也保持优势;同时在少样本标注下分割精度更高,仅用20%标注数据即可超越通用模型。 Conclusion: 领域特定的基础模型具有更强的泛化能力和标注效率,可显著减少人工标注工作,为除草剂试验分析提供可扩展、自动化的解决方案。 Abstract: Herbicide field trials require accurate identification of plant species and assessment of herbicide-induced damage across diverse environments. While general-purpose vision foundation models have shown promising results in complex visual domains, their performance can be limited in agriculture, where fine-grained distinctions between species and damage types are critical. In this work, we adapt a general-purpose vision foundation model to herbicide trial characterization. Trained using a self-supervised learning approach on a large, curated agricultural dataset, the model learns rich and transferable representations optimized for herbicide trials images. Our domain-specific model significantly outperforms the best general-purpose foundation model in both species identification (F1 score improvement from 0.91 to 0.94) and damage classification (from 0.26 to 0.33). Under unseen conditions (new locations and other time), it achieves even greater gains (species identification from 0.56 to 0.66; damage classification from 0.17 to 0.27). In domain-shift scenarios, such as drone imagery, it maintains strong performance (species classification from 0.49 to 0.60). Additionally, we show that domain-specific pretraining enhances segmentation accuracy, particularly in low-annotation regimes. An annotation-efficiency analysis reveals that, under unseen conditions, the domain-specific model achieves 5.4% higher F1 score than the general-purpose model, while using 80% fewer labeled samples. These results demonstrate the generalization capabilities of domain-specific foundation models and their potential to significantly reduce manual annotation efforts, offering a scalable and automated solution for herbicide trial analysis.[81] Deep learning-based object detection of offshore platforms on Sentinel-1 Imagery and the impact of synthetic training data
Robin Spanier,Thorsten Hoeser,Claudia Kuenzer
Main category: cs.CV
TL;DR: 本研究结合合成与真实卫星影像,利用YOLOv10模型提升海上基础设施检测性能,验证了合成数据在缓解样本不平衡和实现跨区域迁移检测中的有效性。
Details
Motivation: 由于海上基础设施种类、形状和尺寸的样本稀缺,特别是稀有类别,现有检测模型难以获得全面均衡的训练数据,亟需提升模型泛化能力与数据平衡性。 Method: 采用YOLOv10深度学习目标检测模型,结合合成与真实的Sentinel-1卫星影像进行训练,并在三个未见区域(墨西哥湾、北海、波斯湾)进行区域留出评估,以检验模型的地理可迁移性。 Result: 模型F1分数从0.85提升至0.90,共检测到3,529个 offshore平台,其中北海411个、墨西哥湾1,519个、波斯湾1,593个,验证了模型的良好泛化能力。 Conclusion: 合成数据能有效改善类别不平衡问题,提升模型性能,为实现全球范围内可扩展的海上基础设施监测提供了可行路径。 Abstract: The recent and ongoing expansion of marine infrastructure, including offshore wind farms, oil and gas platforms, artificial islands, and aquaculture facilities, highlights the need for effective monitoring systems. The development of robust models for offshore infrastructure detection relies on comprehensive, balanced datasets, but falls short when samples are scarce, particularly for underrepresented object classes, shapes, and sizes. By training deep learning-based YOLOv10 object detection models with a combination of synthetic and real Sentinel-1 satellite imagery acquired in the fourth quarter of 2023 from four regions (Caspian Sea, South China Sea, Gulf of Guinea, and Coast of Brazil), this study investigates the use of synthetic training data to enhance model performance. We evaluated this approach by applying the model to detect offshore platforms in three unseen regions (Gulf of Mexico, North Sea, Persian Gulf) and thereby assess geographic transferability. This region-holdout evaluation demonstrated that the model generalises beyond the training areas. In total, 3,529 offshore platforms were detected, including 411 in the North Sea, 1,519 in the Gulf of Mexico, and 1,593 in the Persian Gulf. The model achieved an F1 score of 0.85, which improved to 0.90 upon incorporating synthetic data. We analysed how synthetic data enhances the representation of unbalanced classes and overall model performance, taking a first step toward globally transferable detection of offshore infrastructure. This study underscores the importance of balanced datasets and highlights synthetic data generation as an effective strategy to address common challenges in remote sensing, demonstrating the potential of deep learning for scalable, global offshore infrastructure monitoring.[82] RISE-T2V: Rephrasing and Injecting Semantics with LLM for Expansive Text-to-Video Generation
Xiangjun Zhang,Litong Gong,Yinglin Zheng,Yansong Liu,Wentao Jiang,Mingyi Xu,Biao Wang,Tiezheng Ge,Ming Zeng
Main category: cs.CV
TL;DR: 提出RISE-T2V框架,通过集成提示重写与语义特征提取,提升文本到视频生成模型对用户意图的理解和生成质量。
Details
Motivation: 现有T2V模型依赖预训练文本编码器,但对简短提示理解不足,且无法在线重写提示以更好对齐用户意图,限制了模型的可扩展性与可用性。 Method: 提出RISE-T2V框架,引入Rephrasing Adapter模块,将提示重写与语义提取融合为一步,利用LLM在下一词预测中的隐藏状态作为视频生成条件,实现隐式提示重写。该方法可适配多种LLM与视频扩散模型。 Result: 实验表明RISE-T2V能显著提升不同架构视频扩散模型的生成质量,生成更符合用户意图的高质量视频,具备良好通用性与扩展性。 Conclusion: RISE-T2V通过统一提示重写与语义提取,有效增强了T2V模型对文本语义的理解与用户意图的对齐能力,为文本到视频生成提供了更灵活、强大的框架。 Abstract: Most text-to-video(T2V) diffusion models depend on pre-trained text encoders for semantic alignment, yet they often fail to maintain video quality when provided with concise prompts rather than well-designed ones. The primary issue lies in their limited textual semantics understanding. Moreover, these text encoders cannot rephrase prompts online to better align with user intentions, which limits both the scalability and usability of the models, To address these challenges, we introduce RISE-T2V, which uniquely integrates the processes of prompt rephrasing and semantic feature extraction into a single and seamless step instead of two separate steps. RISE-T2V is universal and can be applied to various pre-trained LLMs and video diffusion models(VDMs), significantly enhancing their capabilities for T2V tasks. We propose an innovative module called the Rephrasing Adapter, enabling diffusion models to utilize text hidden states during the next token prediction of the LLM as a condition for video generation. By employing a Rephrasing Adapter, the video generation model can implicitly rephrase basic prompts into more comprehensive representations that better match the user's intent. Furthermore, we leverage the powerful capabilities of LLMs to enable video generation models to accomplish a broader range of T2V tasks. Extensive experiments demonstrate that RISE-T2V is a versatile framework applicable to different video diffusion model architectures, significantly enhancing the ability of T2V models to generate high-quality videos that align with user intent. Visual results are available on the webpage at https://rise-t2v.github.io.[83] Submanifold Sparse Convolutional Networks for Automated 3D Segmentation of Kidneys and Kidney Tumours in Computed Tomography
Saúl Alonso-Monsalve,Leigh H. Whitehead,Adam Aurisano,Lorena Escudero Sanchez
Main category: cs.CV
TL;DR: 提出一种基于体素稀疏化和子流形稀疏卷积网络的两阶段方法,用于高分辨率3D医学图像中肿瘤的自动分割,在保持高精度的同时显著降低计算资源消耗。
Details
Motivation: 传统卷积神经网络在处理高分辨率3D医学图像时面临计算量大、显存占用高的问题,限制了其在临床中的应用。 Method: 采用两阶段策略:首先进行体素稀疏化,然后利用子流形稀疏卷积网络进行分割,实现对高分辨率3D输入的高效处理。 Result: 在KiTS23肾癌CT数据集上达到与竞赛优胜者相当的性能,肾脏+病灶Dice系数95.8%,肿瘤+囊肿85.7%,仅肿瘤80.3%;推理时间最多减少60%,显存使用最多减少75%。 Conclusion: 该方法在保证高精度的同时大幅降低计算资源需求,具有良好的临床应用前景。 Abstract: The accurate delineation of tumours in radiological images like Computed Tomography is a very specialised and time-consuming task, and currently a bottleneck preventing quantitative analyses to be performed routinely in the clinical setting. For this reason, developing methods for the automated segmentation of tumours in medical imaging is of the utmost importance and has driven significant efforts in recent years. However, challenges regarding the impracticality of 3D scans, given the large amount of voxels to be analysed, usually requires the downsampling of such images or using patches thereof when applying traditional convolutional neural networks. To overcome this problem, in this paper we propose a new methodology that uses, divided into two stages, voxel sparsification and submanifold sparse convolutional networks. This method allows segmentations to be performed with high-resolution inputs and a native 3D model architecture, obtaining state-of-the-art accuracies while significantly reducing the computational resources needed in terms of GPU memory and time. We studied the deployment of this methodology in the context of Computed Tomography images of renal cancer patients from the KiTS23 challenge, and our method achieved results competitive with the challenge winners, with Dice similarity coefficients of 95.8% for kidneys + masses, 85.7% for tumours + cysts, and 80.3% for tumours alone. Crucially, our method also offers significant computational improvements, achieving up to a 60% reduction in inference time and up to a 75\% reduction in VRAM usage compared to an equivalent dense architecture, across both CPU and various GPU cards tested.[84] Comparative Study of CNN Architectures for Binary Classification of Horses and Motorcycles in the VOC 2008 Dataset
Muhammad Annas Shaikh,Hamza Zaman,Arbaz Asif
Main category: cs.CV
TL;DR: 本文评估了九种卷积神经网络在VOC 2008数据集上对马和摩托车进行二分类的表现,重点解决类别不平衡问题。通过采用少数类数据增强技术,比较了包括ResNet-50、ConvNeXt-Tiny、DenseNet-121和Vision Transformer在内的多种现代架构。结果显示,ConvNeXt-Tiny在马和摩托车检测中分别达到95.53%和89.12%的平均精度(AP),表现最佳,且数据增强显著提升了少数类的检测效果,尤其有利于深层架构。
Details
Motivation: 由于VOC 2008数据集中存在严重的类别不平衡问题,影响二分类性能,因此需要系统评估不同CNN架构在该任务上的表现,并探索数据增强对缓解此类问题的有效性。 Method: 采用九种CNN架构进行二分类实验,使用少数类数据增强技术处理类别不平衡,并在多个性能指标下比较ResNet-50、ConvNeXt-Tiny、DenseNet-121和Vision Transformer等模型的表现。 Result: ConvNeXt-Tiny表现最优,马和摩托车的平均精度分别为95.53%和89.12%;数据增强显著提升少数类检测性能,尤其改善深层模型的表现。不同架构间性能差异显著。 Conclusion: 在处理类别不平衡的二分类任务时,选择合适的网络架构(如ConvNeXt-Tiny)并结合数据增强策略,可显著提升检测性能,本研究为类似任务中的模型选择和优化提供了实践指导。 Abstract: This paper presents a comprehensive evaluation of nine convolutional neural network architectures for binary classification of horses and motorcycles in the VOC 2008 dataset. We address the significant class imbalance problem by implementing minority-class augmentation techniques. Our experiments compare modern architectures including ResNet-50, ConvNeXt-Tiny, DenseNet-121, and Vision Transformer across multiple performance metrics. Results demonstrate substantial performance variations, with ConvNeXt-Tiny achieving the highest Average Precision (AP) of 95.53% for horse detection and 89.12% for motorcycle detection. We observe that data augmentation significantly improves minority class detection, particularly benefiting deeper architectures. This study provides insights into architecture selection for imbalanced binary classification tasks and quantifies the impact of data augmentation strategies in mitigating class imbalance issues in object detection.[85] Evaluating the Impact of Weather-Induced Sensor Occlusion on BEVFusion for 3D Object Detection
Sanjay Kumar,Tim Brophy,Eoin Martino Grua,Ganesh Sistu,Valentina Donzella,Ciaran Eising
Main category: cs.CV
TL;DR: 本文研究了在鸟瞰图(BEV)融合架构中,相机和激光雷达传感器在不同遮挡条件下的3D目标检测性能影响,发现模型更依赖LiDAR,尤其是在严重遮挡时,LiDAR性能显著下降,而相机遮挡对融合结果影响较小。
Details
Motivation: 尽管BEV融合架构在多模态感知中表现良好,但环境遮挡(如雾、霾或物理障碍)对3D检测精度的影响尚未充分研究。 Method: 采用BEVFusion架构,在nuScenes数据集上评估相机和LiDAR在不同遮挡程度下的3D检测性能,使用mAP和NDS作为评价指标。 Result: 仅使用相机时,中度遮挡导致mAP下降41.3%;LiDAR仅在重度遮挡下性能骤降47.3%,且严重影响远距离检测;在融合设置中,相机遮挡使mAP下降4.1%,而LiDAR遮挡导致下降26.8%。 Conclusion: 当前融合模型更依赖LiDAR,需进一步研究遮挡感知的评估方法和鲁棒的传感器融合技术,以应对传感器部分失效或环境退化情况。 Abstract: Accurate 3D object detection is essential for automated vehicles to navigate safely in complex real-world environments. Bird's Eye View (BEV) representations, which project multi-sensor data into a top-down spatial format, have emerged as a powerful approach for robust perception. Although BEV-based fusion architectures have demonstrated strong performance through multimodal integration, the effects of sensor occlusions, caused by environmental conditions such as fog, haze, or physical obstructions, on 3D detection accuracy remain underexplored. In this work, we investigate the impact of occlusions on both camera and Light Detection and Ranging (LiDAR) outputs using the BEVFusion architecture, evaluated on the nuScenes dataset. Detection performance is measured using mean Average Precision (mAP) and the nuScenes Detection Score (NDS). Our results show that moderate camera occlusions lead to a 41.3% drop in mAP (from 35.6% to 20.9%) when detection is based only on the camera. On the other hand, LiDAR sharply drops in performance only under heavy occlusion, with mAP falling by 47.3% (from 64.7% to 34.1%), with a severe impact on long-range detection. In fused settings, the effect depends on which sensor is occluded: occluding the camera leads to a minor 4.1% drop (from 68.5% to 65.7%), while occluding LiDAR results in a larger 26.8% drop (to 50.1%), revealing the model's stronger reliance on LiDAR for the task of 3D object detection. Our results highlight the need for future research into occlusion-aware evaluation methods and improved sensor fusion techniques that can maintain detection accuracy in the presence of partial sensor failure or degradation due to adverse environmental conditions.[86] A MATLAB tutorial on deep feature extraction combined with chemometrics for analytical applications
Puneet Mishra,Martijntje Vollebregt,Yizhou Ma,Maria Font-i-Furnols
Main category: cs.CV
TL;DR: 本教程旨在通过提供逐步指导,帮助分析化学领域研究人员利用现有的开源深度学习模型从成像数据中提取空间信息,并将其与其他数据(如光谱信息)结合,以提升数据分析能力。
Details
Motivation: 在分析化学中,尽管成像技术广泛应用,但传统化学计量方法难以高效提取和分析空间信息。虽然深度学习在图像处理方面取得了进展,但由于缺乏系统性的实施指南,其在该领域的应用仍受限。 Method: 本研究不侧重于训练深度学习模型,而是利用现有开源深度学习模型,通过MATLAB代码教程演示如何从多种成像模态中提取深层特征,并整合多源数据。 Result: 提供了可在实际数据集上运行的MATLAB代码示例,展示了从不同成像数据中提取空间信息的具体步骤,支持用户在自身数据上复现和应用。 Conclusion: 该教程有效降低了深度学习在分析化学中的应用门槛,为研究人员提供了实用、可操作的工具,推动了深度学习在该领域的普及与应用。 Abstract: Background In analytical chemistry, spatial information about materials is commonly captured through imaging techniques, such as traditional color cameras or with advanced hyperspectral cameras and microscopes. However, efficiently extracting and analyzing this spatial information for exploratory and predictive purposes remains a challenge, especially when using traditional chemometric methods. Recent advances in deep learning and artificial intelligence have significantly enhanced image processing capabilities, enabling the extraction of multiscale deep features that are otherwise challenging to capture with conventional image processing techniques. Despite the wide availability of open-source deep learning models, adoption in analytical chemistry remains limited because of the absence of structured, step-by-step guidance for implementing these models. Results This tutorial aims to bridge this gap by providing a step-by-step guide for applying deep learning approaches to extract spatial information from imaging data and integrating it with other data sources, such as spectral information. Importantly, the focus of this work is not on training deep learning models for image processing but on using existing open source models to extract deep features from imaging data. Significance The tutorial provides MATLAB code tutorial demonstrations, showcasing the processing of imaging data from various imaging modalities commonly encountered in analytical chemistry. Readers must run the tutorial steps on their own datasets using the codes presented in this tutorial.[87] Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
Jingqi Tong,Yurong Mou,Hangcheng Li,Mingzhe Li,Yongzhuo Yang,Ming Zhang,Qiguang Chen,Tianyi Liang,Xiaomeng Hu,Yining Zheng,Xinchi Chen,Jun Zhao,Xuanjing Huang,Xipeng Qiu
Main category: cs.CV
TL;DR: 本文提出了“Thinking with Video”新范式,利用视频生成模型(如Sora-2)统一视觉与文本推理,并通过VideoThinkBench基准验证其在视觉和文本任务上的强大表现,展示了视频生成模型作为统一多模态推理模型的潜力。
Details
Motivation: 现有“Thinking with Text”和“Thinking with Images”范式难以捕捉动态过程且割裂了图文模态,限制了统一的多模态理解和生成能力。 Method: 提出“Thinking with Video”范式,使用视频生成模型(如Sora-2)进行跨模态推理,并构建VideoThinkBench基准测试,包含视觉中心和文本中心两类任务。 Result: Sora-2在视觉任务上媲美甚至超越SOTA VLM,在Eyeballing Games等任务中表现更优;在文本任务上取得92%的MATH准确率和75.53%的MMMU准确率,且性能可通过自一致性与上下文学习进一步提升。 Conclusion: 视频生成模型具备成为统一多模态理解与生成模型的潜力,“Thinking with Video”是一种有前景的统一多模态推理范式。 Abstract: "Thinking with Text" and "Thinking with Images" paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. To overcome these limitations, we introduce "Thinking with Video", a new paradigm that leverages video generation models, such as Sora-2, to bridge visual and textual reasoning in a unified temporal framework. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2's performance. In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positions "thinking with video" as a unified multimodal reasoning paradigm.[88] Multi-Task Learning for Visually Grounded Reasoning in Gastrointestinal VQA
Itbaan Safwan,Muhammad Annas Shaikh,Muhammad Haaris,Ramail Khan,Muhammad Atif Tahir
Main category: cs.CV
TL;DR: 提出了一种基于LoRA微调Florence-2模型的多任务框架,用于MediaEval Medico 2025挑战赛,同时实现视觉问答、解释生成和视觉定位。
Details
Motivation: 旨在提升医学视觉问答系统的准确性与可解释性,通过多任务学习整合视觉理解、推理与定位能力。 Method: 采用LoRA微调Florence-2模型,结合三个精心构建的数据集:Kvasir-VQA-x1(问答)、合成增强的解释数据集(医学推理)和文本到区域配对数据集(视觉特征与分割掩码对齐),实现多任务联合训练。 Result: 实验表明,该方法在答案准确性和视觉定位性能上均显著优于单任务基线模型。 Conclusion: 所提出的多任务框架有效提升了医学VQA系统的性能,验证了基于 grounding 的多任务学习在医学图像理解中的潜力。 Abstract: We present a multi-task framework for the MediaEval Medico 2025 challenge, leveraging a LoRA-tuned Florence-2 model for simultaneous visual question answering (VQA), explanation generation, and visual grounding. The proposed system integrates three curated datasets: (1) Kvasir-VQA-x1 for question-answer learning, (2) a synthetically enriched explanation dataset offering structured medical reasoning, and (3) text-to-region pairs linking visual features with segmentation masks. This multi-task setup enables the model to jointly learn visual grounding, reasoning, and interpretation, producing responses that are both accurate and interpretable. Extensive evaluation demonstrates that our approach substantially improves over single-task baselines in both answer accuracy and visual localization, highlighting the effectiveness of grounded multi-task learning for medical VQA applications.[89] BoRe-Depth: Self-supervised Monocular Depth Estimation with Boundary Refinement for Embedded Systems
Chang Liu,Juan Li,Sheng Zhang,Chang Liu,Jie Li,Xu Zhang
Main category: cs.CV
TL;DR: 本文提出了一种名为BoRe-Depth的新型单目深度估计模型,仅含8.7M参数,适用于嵌入式系统,在提升深度估计性能的同时显著改善物体边界质量。
Details
Motivation: 现有单目深度估计方法在嵌入式系统上存在深度估计性能差和物体边界模糊的问题,限制了无人系统中的3D感知能力。 Method: 提出BoRe-Depth模型,包含增强特征自适应融合模块(EFAF)以提升边界细节表示,并在编码器中引入语义知识以增强物体识别与边界感知能力。 Result: 模型在NVIDIA Jetson Orin上达到50.7 FPS,且在多个挑战性数据集上显著优于先前的轻量级模型,边界质量明显改善。 Conclusion: BoRe-Depth在保持轻量化的同时实现了高性能的深度估计和清晰的边界预测,适合部署于资源受限的嵌入式系统。 Abstract: Depth estimation is one of the key technologies for realizing 3D perception in unmanned systems. Monocular depth estimation has been widely researched because of its low-cost advantage, but the existing methods face the challenges of poor depth estimation performance and blurred object boundaries on embedded systems. In this paper, we propose a novel monocular depth estimation model, BoRe-Depth, which contains only 8.7M parameters. It can accurately estimate depth maps on embedded systems and significantly improves boundary quality. Firstly, we design an Enhanced Feature Adaptive Fusion Module (EFAF) which adaptively fuses depth features to enhance boundary detail representation. Secondly, we integrate semantic knowledge into the encoder to improve the object recognition and boundary perception capabilities. Finally, BoRe-Depth is deployed on NVIDIA Jetson Orin, and runs efficiently at 50.7 FPS. We demonstrate that the proposed model significantly outperforms previous lightweight models on multiple challenging datasets, and we provide detailed ablation studies for the proposed methods. The code is available at https://github.com/liangxiansheng093/BoRe-Depth.[90] DORAEMON: A Unified Library for Visual Object Modeling and Representation Learning at Scale
Ke Du,Yimin Peng,Chao Gao,Fan Zhou,Siqiao Xue
Main category: cs.CV
TL;DR: DORAEMON是一个开源的PyTorch库,统一了多尺度视觉对象建模和表示学习,支持分类、检索和度量学习,提供超过1000个预训练骨干网络,并通过ONNX或HuggingFace实现一键导出,促进研究与部署的衔接。
Details
Motivation: 为了统一和简化视觉对象建模与表示学习在不同任务和尺度上的流程,提升研究到应用的转化效率。 Method: 采用YAML驱动的工作流,集成多种任务(分类、检索、度量学习),通过timm兼容接口提供大量预训练模型,并支持模块化的损失函数、数据增强和分布式训练工具。 Result: 在ImageNet-1K、MS-Celeb-1M和Stanford Online Products等数据集上复现或超越了基准结果,并支持一键导出模型至ONNX或HuggingFace。 Conclusion: DORAEMON为视觉识别和表示学习提供了可扩展的基础,促进了快速实验和研究成果向实际应用的高效迁移。 Abstract: DORAEMON is an open-source PyTorch library that unifies visual object modeling and representation learning across diverse scales. A single YAML-driven workflow covers classification, retrieval and metric learning; more than 1000 pretrained backbones are exposed through a timm-compatible interface, together with modular losses, augmentations and distributed-training utilities. Reproducible recipes match or exceed reference results on ImageNet-1K, MS-Celeb-1M and Stanford online products, while one-command export to ONNX or HuggingFace bridges research and deployment. By consolidating datasets, models, and training techniques into one platform, DORAEMON offers a scalable foundation for rapid experimentation in visual recognition and representation learning, enabling efficient transfer of research advances to real-world applications. The repository is available at https://github.com/wuji3/DORAEMON.[91] HideAndSeg: an AI-based tool with automated prompting for octopus segmentation in natural habitats
Alan de Aguiar,Michaella Pereira Andrade,Charles Morphy D. Santos,João Paulo Gois
Main category: cs.CV
TL;DR: 本文提出了一种名为HideAndSeg的半自动AI工具,用于在自然环境中分割章鱼视频,结合SAM2与自训练YOLOv11检测器,并引入两种无监督评估指标,在减少人工干预的同时实现了鲁棒的分割性能。
Details
Motivation: 由于章鱼具有伪装能力、皮肤快速变色、非刚体形变和频繁遮挡,加之水下光照和浑浊度变化,使其在自然环境中的分析极具挑战性;同时缺乏大规模标注数据集,限制了自动化分析的发展。 Method: HideAndSeg首先利用用户提供的点坐标通过SAM2生成初始分割掩码,这些掩码用于训练YOLOv11检测器;随后系统转为全自动模式,由YOLO输出边界框作为SAM2的提示进行持续分割,并采用时间一致性DICE_t和新组件计数NC_t两个无监督指标评估和优化分割质量。 Result: HideAndSeg相比手动提示方法显著减少了分割噪声,能够在完全遮挡后重新识别并准确分割章鱼,在真实自然场景中表现出良好的鲁棒性和连续性。 Conclusion: 该方法大幅降低了对人工标注的依赖,为野生头足类动物的行为研究提供了一个高效、实用的视频分析工具。 Abstract: Analyzing octopuses in their natural habitats is challenging due to their camouflage capability, rapid changes in skin texture and color, non-rigid body deformations, and frequent occlusions, all of which are compounded by variable underwater lighting and turbidity. Addressing the lack of large-scale annotated datasets, this paper introduces HideAndSeg, a novel, minimally supervised AI-based tool for segmenting videos of octopuses. It establishes a quantitative baseline for this task. HideAndSeg integrates SAM2 with a custom-trained YOLOv11 object detector. First, the user provides point coordinates to generate the initial segmentation masks with SAM2. These masks serve as training data for the YOLO model. After that, our approach fully automates the pipeline by providing a bounding box prompt to SAM2, eliminating the need for further manual intervention. We introduce two unsupervised metrics - temporal consistency $DICE_t$ and new component count $NC_t$ - to quantitatively evaluate segmentation quality and guide mask refinement in the absence of ground-truth data, i.e., real-world information that serves to train, validate, and test AI models. Results show that HideAndSeg achieves satisfactory performance, reducing segmentation noise compared to the manually prompted approach. Our method can re-identify and segment the octopus even after periods of complete occlusion in natural environments, a scenario in which the manually prompted model fails. By reducing the need for manual analysis in real-world scenarios, this work provides a practical tool that paves the way for more efficient behavioral studies of wild cephalopods.[92] Solving Convex Partition Visual Jigsaw Puzzles
Yaniv Ohayon,Ofir Itzhak Shahar,Ohad Ben-Shahar
Main category: cs.CV
TL;DR: 本文提出了一种针对凸分割多边形拼图的自动求解方法,结合几何和图像兼容性,提出了一种贪心求解器,并发布了首个此类拼图的基准数据集。
Details
Motivation: 现有研究主要集中于方形拼图求解,限制了实际应用,因此需要扩展到更广泛的拼图类型。 Method: 利用几何和图像兼容性特征,设计了一种贪心算法来求解凸分割拼图。 Result: 实现了对凸分割拼图的有效求解,并建立了首个基准数据集,报告了多种性能指标。 Conclusion: 该方法显著扩展了可计算求解的拼图类型,为多边形拼图的自动化求解提供了新思路和基础资源。 Abstract: Jigsaw puzzle solving requires the rearrangement of unordered pieces into their original pose in order to reconstruct a coherent whole, often an image, and is known to be an intractable problem. While the possible impact of automatic puzzle solvers can be disruptive in various application domains, most of the literature has focused on developing solvers for square jigsaw puzzles, severely limiting their practical use. In this work, we significantly expand the types of puzzles handled computationally, focusing on what is known as Convex Partitions, a major subset of polygonal puzzles whose pieces are convex. We utilize both geometrical and pictorial compatibilities, introduce a greedy solver, and report several performance measures next to the first benchmark dataset of such puzzles.[93] V-Thinker: Interactive Thinking with Images
Runqi Qiao,Qiuna Tan,Minghan Yang,Guanting Dong,Peiqing Yang,Shiqiang Lang,Enhui Wan,Xiaowan Wang,Yida Xu,Lan Yang,Chong Sun,Chen Li,Honggang Zhang
Main category: cs.CV
TL;DR: 本文提出V-Thinker,一种通过端到端强化学习实现图像交互式推理的通用多模态推理助手,并构建数据进化飞轮和视觉渐进训练课程以提升模型在多样、高质量和高难度任务上的表现。
Details
Motivation: 现有大视觉模型在图像交互与长视野推理融合方面受限于视觉工具空间有限和任务特定流程设计,亟需更通用的解决方案。 Method: 提出V-Thinker,包含数据进化飞轮(自动生成、演化和验证交互推理数据集)和视觉渐进训练课程(先通过点级监督对齐感知,再通过两阶段强化学习整合交互推理)。 Result: 在VTBench专家验证基准上,V-Thinker在通用和交互式推理场景中均显著优于强大多模态基线模型。 Conclusion: V-Thinker推动了“图像中思考”范式的发展,为图像交互式推理应用提供了有效框架和重要进展。 Abstract: Empowering Large Multimodal Models (LMMs) to deeply integrate image interaction with long-horizon reasoning capabilities remains a long-standing challenge in this field. Recent advances in vision-centric reasoning explore a promising "Thinking with Images" paradigm for LMMs, marking a shift from image-assisted reasoning to image-interactive thinking. While this milestone enables models to focus on fine-grained image regions, progress remains constrained by limited visual tool spaces and task-specific workflow designs. To bridge this gap, we present V-Thinker, a general-purpose multimodal reasoning assistant that enables interactive, vision-centric thinking through end-to-end reinforcement learning. V-Thinker comprises two key components: (1) a Data Evolution Flywheel that automatically synthesizes, evolves, and verifies interactive reasoning datasets across three dimensions-diversity, quality, and difficulty; and (2) a Visual Progressive Training Curriculum that first aligns perception via point-level supervision, then integrates interactive reasoning through a two-stage reinforcement learning framework. Furthermore, we introduce VTBench, an expert-verified benchmark targeting vision-centric interactive reasoning tasks. Extensive experiments demonstrate that V-Thinker consistently outperforms strong LMM-based baselines in both general and interactive reasoning scenarios, providing valuable insights for advancing image-interactive reasoning applications.[94] Landslide Hazard Mapping with Geospatial Foundation Models: Geographical Generalizability, Data Scarcity, and Band Adaptability
Wenwen Li,Sizhe Wang,Hyunho Lee,Chenyan Lu,Sujit Roy,Rahul Ramachandran,Chia-Yu Hsu
Main category: cs.CV
TL;DR: 本研究提出了一种基于地理空间基础模型Prithvi-EO-2.0的三轴分析框架(传感器、标签、领域),用于滑坡制图,显著优于传统模型,并在跨区域、少样本等挑战下表现出强鲁棒性和泛化能力。
Details
Motivation: 传统深度学习模型在不同传感器、区域或训练数据有限的情况下表现不佳,难以满足滑坡灾害精准快速制图的需求。 Method: 采用基于全球预训练、自监督学习和可适应微调的地理空间基础模型Prithvi-EO-2.0,构建传感器、标签和领域三轴分析框架,通过多组实验评估其性能。 Result: Prithvi-EO-2.0在多种指标上持续优于U-Net、Segformer等专用模型及其他GeoFMs,在光谱变化、标注稀缺和跨域场景中均保持高精度和良好泛化性,但存在计算成本高和可用AI就绪数据少的问题。 Conclusion: 地理空间基础模型为滑坡风险减灾和环境监测提供了更稳健、可扩展的解决方案,具有广泛应用前景,但需进一步降低计算开销并建设高质量标注数据集。 Abstract: Landslides cause severe damage to lives, infrastructure, and the environment, making accurate and timely mapping essential for disaster preparedness and response. However, conventional deep learning models often struggle when applied across different sensors, regions, or under conditions of limited training data. To address these challenges, we present a three-axis analytical framework of sensor, label, and domain for adapting geospatial foundation models (GeoFMs), focusing on Prithvi-EO-2.0 for landslide mapping. Through a series of experiments, we show that it consistently outperforms task-specific CNNs (U-Net, U-Net++), vision transformers (Segformer, SwinV2-B), and other GeoFMs (TerraMind, SatMAE). The model, built on global pretraining, self-supervision, and adaptable fine-tuning, proved resilient to spectral variation, maintained accuracy under label scarcity, and generalized more reliably across diverse datasets and geographic settings. Alongside these strengths, we also highlight remaining challenges such as computational cost and the limited availability of reusable AI-ready training data for landslide research. Overall, our study positions GeoFMs as a step toward more robust and scalable approaches for landslide risk reduction and environmental monitoring.[95] THEval. Evaluation Framework for Talking Head Video Generation
Nabyl Quignon,Baptiste Chopin,Yaohui Wang,Antitza Dantcheva
Main category: cs.CV
TL;DR: 提出了一种新的评估框架,包含8个与质量、自然度和同步性相关的指标,用于更全面地评估说话人头像生成视频的质量。
Details
Motivation: 现有的评估指标有限,主要集中在视频质量、唇形同步和用户研究上,难以全面反映生成视频的真实水平。 Method: 设计了一个涵盖质量、自然度和同步性的三维评估框架,选取了8个高效且与人类偏好对齐的指标,细粒度分析头部、嘴部、眉毛运动及面部质量。 Result: 在17种最先进模型生成的8.5万段视频上的实验表明,尽管许多算法在唇形同步方面表现良好,但在表情生成和细节无伪影方面仍存在挑战。 Conclusion: 所提出的评估框架能更全面地衡量说话头像生成技术的进步,促进该领域的发展,并将公开代码、数据集和排行榜以持续更新。 Abstract: Video generation has achieved remarkable progress, with generated videos increasingly resembling real ones. However, the rapid advance in generation has outpaced the development of adequate evaluation metrics. Currently, the assessment of talking head generation primarily relies on limited metrics, evaluating general video quality, lip synchronization, and on conducting user studies. Motivated by this, we propose a new evaluation framework comprising 8 metrics related to three dimensions (i) quality, (ii) naturalness, and (iii) synchronization. In selecting the metrics, we place emphasis on efficiency, as well as alignment with human preferences. Based on this considerations, we streamline to analyze fine-grained dynamics of head, mouth, and eyebrows, as well as face quality. Our extensive experiments on 85,000 videos generated by 17 state-of-the-art models suggest that while many algorithms excel in lip synchronization, they face challenges with generating expressiveness and artifact-free details. These videos were generated based on a novel real dataset, that we have curated, in order to mitigate bias of training data. Our proposed benchmark framework is aimed at evaluating the improvement of generative methods. Original code, dataset and leaderboards will be publicly released and regularly updated with new methods, in order to reflect progress in the field.[96] Learning from Single Timestamps: Complexity Estimation in Laparoscopic Cholecystectomy
Dimitrios Anastasiou,Santiago Barbarisi,Lucy Culshaw,Jayna Patel,Evangelos B. Mazomenos,Imanol Luengo,Danail Stoyanov
Main category: cs.CV
TL;DR: 本研究提出了一种名为STC-Net的新框架,用于在腹腔镜胆囊切除术(LC)中基于Parkland分级量表(PGS)自动评估手术复杂性,能够在弱时间监督下直接处理完整手术视频,显著优于基线方法。
Details
Motivation: 准确评估手术复杂性对LC至关重要,严重炎症与较长的手术时间和术后并发症风险增加相关;然而,PGS在真实场景下的手术视频自动化分析仍缺乏探索,尤其是在无需手动剪辑的完整视频分析方面。 Method: 提出STC-Net框架,包含定位、窗口提议和分级模块,联合实现时间定位与炎症分级;采用结合硬性和软性定位目标及背景感知分级监督的新型损失函数,在弱时间监督下对1,859个完整LC视频进行端到端训练。 Result: 在包含1,859个LC视频的私有数据集上,STC-Net的准确率为62.11%,F1分数为61.42%,两项指标均超过非定位基线方法10%以上,验证了弱监督在手术复杂性评估中的有效性。 Conclusion: STC-Net展示了一种可扩展且有效的方法,能够从完整的LC视频中自动估计基于PGS的手术复杂性,具有应用于术后分析和外科培训的潜力。 Abstract: Purpose: Accurate assessment of surgical complexity is essential in Laparoscopic Cholecystectomy (LC), where severe inflammation is associated with longer operative times and increased risk of postoperative complications. The Parkland Grading Scale (PGS) provides a clinically validated framework for stratifying inflammation severity; however, its automation in surgical videos remains largely unexplored, particularly in realistic scenarios where complete videos must be analyzed without prior manual curation. Methods: In this work, we introduce STC-Net, a novel framework for SingleTimestamp-based Complexity estimation in LC via the PGS, designed to operate under weak temporal supervision. Unlike prior methods limited to static images or manually trimmed clips, STC-Net operates directly on full videos. It jointly performs temporal localization and grading through a localization, window proposal, and grading module. We introduce a novel loss formulation combining hard and soft localization objectives and background-aware grading supervision. Results: Evaluated on a private dataset of 1,859 LC videos, STC-Net achieves an accuracy of 62.11% and an F1-score of 61.42%, outperforming non-localized baselines by over 10% in both metrics and highlighting the effectiveness of weak supervision for surgical complexity assessment. Conclusion: STC-Net demonstrates a scalable and effective approach for automated PGS-based surgical complexity estimation from full LC videos, making it promising for post-operative analysis and surgical training.[97] UniSplat: Unified Spatio-Temporal Fusion via 3D Latent Scaffolds for Dynamic Driving Scene Reconstruction
Chen Shi,Shaoshuai Shi,Xiaoyang Lyu,Chunyang Liu,Kehua Sheng,Bo Zhang,Li Jiang
Main category: cs.CV
TL;DR: UniSplat提出了一种面向自动驾驶的通用前馈式3D动态场景重建框架,通过统一的潜在时空融合机制,在稀疏、非重叠视角和复杂动态场景下实现鲁棒重建。
Details
Motivation: 现有方法在处理稀疏、非重叠相机视图与复杂场景动态性联合挑战时表现不佳,难以实现完整且一致的动态3D重建。 Method: 构建一个3D潜在支架,利用预训练基础模型捕捉几何与语义上下文;设计高效的支架内融合机制实现跨时空对齐;采用双分支解码器结合点锚定细化与体素生成动态感知高斯,并维护静态高斯的持久记忆以支持流式场景补全。 Result: 在真实世界数据集上实验表明,UniSplat在新视角合成任务中达到最先进性能,且能对超出原始相机覆盖范围的视角生成高质量、鲁棒的渲染结果。 Conclusion: UniSplat通过统一的潜在时空融合与持久化表示,有效解决了自动驾驶中稀疏视角与动态场景下的3D重建难题,具备良好的实际应用潜力。 Abstract: Feed-forward 3D reconstruction for autonomous driving has advanced rapidly, yet existing methods struggle with the joint challenges of sparse, non-overlapping camera views and complex scene dynamics. We present UniSplat, a general feed-forward framework that learns robust dynamic scene reconstruction through unified latent spatio-temporal fusion. UniSplat constructs a 3D latent scaffold, a structured representation that captures geometric and semantic scene context by leveraging pretrained foundation models. To effectively integrate information across spatial views and temporal frames, we introduce an efficient fusion mechanism that operates directly within the 3D scaffold, enabling consistent spatio-temporal alignment. To ensure complete and detailed reconstructions, we design a dual-branch decoder that generates dynamic-aware Gaussians from the fused scaffold by combining point-anchored refinement with voxel-based generation, and maintain a persistent memory of static Gaussians to enable streaming scene completion beyond current camera coverage. Extensive experiments on real-world datasets demonstrate that UniSplat achieves state-of-the-art performance in novel view synthesis, while providing robust and high-quality renderings even for viewpoints outside the original camera coverage.[98] PixCLIP: Achieving Fine-grained Visual Language Understanding via Any-granularity Pixel-Text Alignment Learning
Yicheng Xiao,Yu Chen,Haoxuan Ma,Jiale Hong,Caorui Li,Lingxiang Wu,Haiyun Guo,Jinqiao Wang
Main category: cs.CV
TL;DR: 本文提出了PixCLIP,一个能够同时处理视觉提示和长文本描述的新型框架,通过构建包含近150万样本的LongGRIT数据集和采用三分支像素-文本对齐学习框架,在细粒度图像-文本对齐方面实现了突破性进展。
Details
Motivation: 尽管CLIP模型在多种视觉语言理解任务中取得了显著成功,但其在细粒度图像-文本对齐方面的能力仍有待提升,尤其是受限于文本编码器的token长度限制,难以处理更细致的长文本信息。 Method: 提出PixCLIP框架,首先建立自动化标注流水线生成像素级局部化、长篇幅文本描述,构建LongGRIT数据集;其次,用大语言模型替换CLIP原有的文本编码器,并设计三分支像素-文本对齐学习框架以实现任意粒度下的图像区域与文本描述的细粒度对齐。 Result: 实验表明,PixCLIP在像素级交互和处理长文本方面表现出色,实现了最先进的性能。 Conclusion: PixCLIP通过协同增强视觉和文本信息处理的粒度,有效提升了细粒度图像-文本对齐能力,为后续研究提供了新方向。 Abstract: While the Contrastive Language-Image Pretraining(CLIP) model has achieved remarkable success in a variety of downstream vison language understanding tasks, enhancing its capability for fine-grained image-text alignment remains an active research focus. To this end, most existing works adopt the strategy of explicitly increasing the granularity of visual information processing, e.g., incorporating visual prompts to guide the model focus on specific local regions within the image. Meanwhile, researches on Multimodal Large Language Models(MLLMs) have demonstrated that training with long and detailed textual descriptions can effectively improve the model's fine-grained vision-language alignment. However, the inherent token length limitation of CLIP's text encoder fundamentally limits CLIP to process more granular textual information embedded in long text sequences. To synergistically leverage the advantages of enhancing both visual and textual content processing granularity, we propose PixCLIP, a novel framework designed to concurrently accommodate visual prompt inputs and process lengthy textual descriptions. Specifically, we first establish an automated annotation pipeline capable of generating pixel-level localized, long-form textual descriptions for images. Utilizing this pipeline, we construct LongGRIT, a high-quality dataset comprising nearly 1.5 million samples. Secondly, we replace CLIP's original text encoder with the LLM and propose a three-branch pixel-text alignment learning framework, facilitating fine-grained alignment between image regions and corresponding textual descriptions at arbitrary granularity. Experiments demonstrate that PixCLIP showcases breakthroughs in pixel-level interaction and handling long-form texts, achieving state-of-the-art performance.[99] Building Trust in Virtual Immunohistochemistry: Automated Assessment of Image Quality
Tushar Kataria,Shikha Dubey,Mary Bronner,Jolanta Jedrzkiewicz,Ben J. Brintz,Shireen Y. Elhabian,Beatrice S. Knudsen
Main category: cs.CV
TL;DR: 提出了一种基于准确性的自动化框架来评估虚拟免疫组化(IHC)染色图像质量,使用像素级染色准确度指标(如Dice、IoU、Hausdorff距离),发现传统图像保真度指标与实际染色准确度相关性差,配对模型表现更优,且全切片图像评估比局部补丁更敏感。
Details
Motivation: 现有基于纹理和分布的图像质量评估指标无法准确反映虚拟IHC染色的准确性,缺乏无需人工标注的客观、可靠的评估方法。 Method: 采用颜色反卷积生成真实和虚拟IHC的棕色染色像素掩码,利用分割掩码计算Dice、IoU和Hausdorff距离等指标评估染色准确性,对比十六种配对或非配对图像翻译模型,并在全切片图像上验证性能。 Result: 传统指标(FID、PSNR、SSIM)与染色准确度及病理学家评估相关性差;配对模型(如PyramidPix2Pix、AdaptiveNCE)染色准确性最高,非配对扩散模型和GAN模型可靠性较低;全切片评估揭示了补丁级评估无法发现的性能下降。 Conclusion: 该框架提供了一种可重复、基于准确性的虚拟IHC图像质量评估方法,对推动其在病理学中的临床应用具有重要意义。 Abstract: Deep learning models can generate virtual immunohistochemistry (IHC) stains from hematoxylin and eosin (H&E) images, offering a scalable and low-cost alternative to laboratory IHC. However, reliable evaluation of image quality remains a challenge as current texture- and distribution-based metrics quantify image fidelity rather than the accuracy of IHC staining. Here, we introduce an automated and accuracy grounded framework to determine image quality across sixteen paired or unpaired image translation models. Using color deconvolution, we generate masks of pixels stained brown (i.e., IHC-positive) as predicted by each virtual IHC model. We use the segmented masks of real and virtual IHC to compute stain accuracy metrics (Dice, IoU, Hausdorff distance) that directly quantify correct pixel - level labeling without needing expert manual annotations. Our results demonstrate that conventional image fidelity metrics, including Frechet Inception Distance (FID), peak signal-to-noise ratio (PSNR), and structural similarity (SSIM), correlate poorly with stain accuracy and pathologist assessment. Paired models such as PyramidPix2Pix and AdaptiveNCE achieve the highest stain accuracy, whereas unpaired diffusion- and GAN-based models are less reliable in providing accurate IHC positive pixel labels. Moreover, whole-slide images (WSI) reveal performance declines that are invisible in patch-based evaluations, emphasizing the need for WSI-level benchmarks. Together, this framework defines a reproducible approach for assessing the quality of virtual IHC models, a critical step to accelerate translation towards routine use by pathologists.[100] NovisVQ: A Streaming Convolutional Neural Network for No-Reference Opinion-Unaware Frame Quality Assessment
Kylie Cancilla,Alexander Moore,Amar Saini,Carmen Carrano
Main category: cs.CV
TL;DR: 提出了一种基于时间感知的无参考视频质量评估模型,通过合成退化数据训练,直接预测全参考指标,无需人工标注或参考视频。
Details
Motivation: 现有视频质量评估方法依赖参考视频或人工评分,且多数无参考方法忽略时间信息,难以适用于真实场景的视频检测任务。 Method: 利用DAVIS数据集的合成退化版本,训练一个具有时间感知能力的卷积网络,以预测LPIPS、PSNR和SSIM等全参考指标,推理时无需参考视频。 Result: 该方法在多种退化类型上优于图像基线模型,且与全参考指标的相关性高于BRISQUE,验证了时序建模的有效性。 Conclusion: 所提出的流式、无参考、意见无关的VQA模型通过引入时间信息,显著提升了真实应用场景下的可扩展性和性能。 Abstract: Video quality assessment (VQA) is vital for computer vision tasks, but existing approaches face major limitations: full-reference (FR) metrics require clean reference videos, and most no-reference (NR) models depend on training on costly human opinion labels. Moreover, most opinion-unaware NR methods are image-based, ignoring temporal context critical for video object detection. In this work, we present a scalable, streaming-based VQA model that is both no-reference and opinion-unaware. Our model leverages synthetic degradations of the DAVIS dataset, training a temporal-aware convolutional architecture to predict FR metrics (LPIPS , PSNR, SSIM) directly from degraded video, without references at inference. We show that our streaming approach outperforms our own image-based baseline by generalizing across diverse degradations, underscoring the value of temporal modeling for scalable VQA in real-world vision systems. Additionally, we demonstrate that our model achieves higher correlation with full-reference metrics compared to BRISQUE, a widely-used opinion-aware image quality assessment baseline, validating the effectiveness of our temporal, opinion-unaware approach.[101] Polarization-resolved imaging improves eye tracking
Mantas Žurauskas,Tom Bu,Sanaz Alali,Beyza Kalkanli,Derek Shi,Fernando Alamos,Gauresh Pandit,Christopher Mei,Ali Behrooz,Ramin Mirjalili,Dave Stronks,Alexander Fix,Dmitri Model
Main category: cs.CV
TL;DR: 本文提出了一种基于偏振分辨近红外成像的偏振增强型眼动追踪(PET)系统,通过结合偏振滤波阵列相机和线性偏振近红外光源,提升了在复杂条件下的眼动追踪精度。
Details
Motivation: 传统强度图像在眼动追踪中受限于光照变化、眼睑遮挡等因素,缺乏稳定特征;因此需要引入偏振信息以增强光学对比度,提取更多眼部结构特征。 Method: 采用偏振滤波阵列相机与线性偏振近红外照明器组成PET系统,采集眼部反射光的偏振状态信息;利用卷积神经网络模型在346名参与者数据上进行训练,评估其在多种干扰条件下的眼动追踪性能。 Result: PET系统相比仅使用强度信息的基线模型,在正常条件下及存在眼睑遮挡、眼距变化和瞳孔大小变化时,中位95%绝对凝视误差降低了10-16%。 Conclusion: 偏振信息能有效提升眼动追踪系统的鲁棒性和精度,PET作为一种简单且可靠的感知模态,具有在可穿戴设备中应用的潜力。 Abstract: Polarization-resolved near-infrared imaging adds a useful optical contrast mechanism to eye tracking by measuring the polarization state of light reflected by ocular tissues in addition to its intensity. In this paper we demonstrate how this contrast can be used to enable eye tracking. Specifically, we demonstrate that a polarization-enabled eye tracking (PET) system composed of a polarization--filter--array camera paired with a linearly polarized near-infrared illuminator can reveal trackable features across the sclera and gaze-informative patterns on the cornea, largely absent in intensity-only images. Across a cohort of 346 participants, convolutional neural network based machine learning models trained on data from PET reduced the median 95th-percentile absolute gaze error by 10--16\% relative to capacity-matched intensity baselines under nominal conditions and in the presence of eyelid occlusions, eye-relief changes, and pupil-size variation. These results link light--tissue polarization effects to practical gains in human--computer interaction and position PET as a simple, robust sensing modality for future wearable devices.[102] Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts
Ellis Brown,Jihan Yang,Shusheng Yang,Rob Fergus,Saining Xie
Main category: cs.CV
TL;DR: 本文提出了一种诊断和去偏框架,用于检测和减轻多模态大模型基准中的非视觉偏差,通过测试集压力测试和迭代偏差剪枝方法,揭示了多个现有基准存在严重非视觉捷径问题,并构建了更鲁棒的VSI-Bench-Debiased。
Details
Motivation: 现有视觉为中心的多模态基准可能被语言先验和表面模式所利用,导致模型无需真正视觉理解即可取得高分,因此需要更严格的诊断方法来评估基准的可靠性。 Method: 提出Test-set Stress-Test(TsT)方法,使用k折交叉验证在纯文本输入上微调大语言模型以探测测试集的可利用模式,并结合随机森林特征分析进行可解释审计;进一步提出Iterative Bias Pruning(IBP)去除高偏差样本。 Result: 在VSI-Bench、CV-Bench、MMMU和VideoMME四个基准上发现普遍存在非视觉偏差;构建VSI-Bench-Debiased后显著降低纯语言模型性能,扩大视觉盲模型与视觉模型之间的性能差距。 Conclusion: 基准设计者应主动‘攻击’自己的测试集以识别偏差,所提框架有助于构建更可靠、真正依赖视觉理解的多模态评测基准。 Abstract: Robust benchmarks are crucial for evaluating Multimodal Large Language Models (MLLMs). Yet we find that models can ace many multimodal benchmarks without strong visual understanding, instead exploiting biases, linguistic priors, and superficial patterns. This is especially problematic for vision-centric benchmarks that are meant to require visual inputs. We adopt a diagnostic principle for benchmark design: if a benchmark can be gamed, it will be. Designers should therefore try to ``game'' their own benchmarks first, using diagnostic and debiasing procedures to systematically identify and mitigate non-visual biases. Effective diagnosis requires directly ``training on the test set'' -- probing the released test set for its intrinsic, exploitable patterns. We operationalize this standard with two components. First, we diagnose benchmark susceptibility using a ``Test-set Stress-Test'' (TsT) methodology. Our primary diagnostic tool involves fine-tuning a powerful Large Language Model via $k$-fold cross-validation on exclusively the non-visual, textual inputs of the test set to reveal shortcut performance and assign each sample a bias score $s(x)$. We complement this with a lightweight Random Forest-based diagnostic operating on hand-crafted features for fast, interpretable auditing. Second, we debias benchmarks by filtering high-bias samples using an ``Iterative Bias Pruning'' (IBP) procedure. Applying this framework to four benchmarks -- VSI-Bench, CV-Bench, MMMU, and VideoMME -- we uncover pervasive non-visual biases. As a case study, we apply our full framework to create VSI-Bench-Debiased, demonstrating reduced non-visual solvability and a wider vision-blind performance gap than the original.[103] SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding
Ellis Brown,Arijit Ray,Ranjay Krishna,Ross Girshick,Rob Fergus,Saining Xie
Main category: cs.CV
TL;DR: 提出SIMS-V框架,利用3D模拟器生成空间丰富的视频训练数据,仅用25K样本即在空间推理任务上超越大模型。
Details
Motivation: 现有视频数据缺乏多样且精确的空间标注,限制了多模态语言模型在时空空间推理上的能力。 Method: 构建SIMS-V数据生成框架,利用3D模拟器的先验信息生成带空间标注的视频数据,并通过系统性消融研究确定最有效的三种问题类型。 Result: 7B参数模型在仅25K模拟数据上微调后,超越72B基线模型,在真实世界空间推理基准上达到与专有模型相当的性能。 Conclusion: SIMS-V能高效提升多模态语言模型的空间推理能力,实现强泛化和实际应用潜力。 Abstract: Despite impressive high-level video comprehension, multimodal language models struggle with spatial reasoning across time and space. While current spatial training approaches rely on real-world video data, obtaining diverse footage with precise spatial annotations remains a bottleneck. To alleviate this bottleneck, we present SIMS-V -- a systematic data-generation framework that leverages the privileged information of 3D simulators to create spatially-rich video training data for multimodal language models. Using this framework, we investigate which properties of simulated data drive effective real-world transfer through systematic ablations of question types, mixes, and scales. We identify a minimal set of three question categories (metric measurement, perspective-dependent reasoning, and temporal tracking) that prove most effective for developing transferable spatial intelligence, outperforming comprehensive coverage despite using fewer question types. These insights enable highly efficient training: our 7B-parameter video LLM fine-tuned on just 25K simulated examples outperforms the larger 72B baseline and achieves competitive performance with proprietary models on rigorous real-world spatial reasoning benchmarks. Our approach demonstrates robust generalization, maintaining performance on general video understanding while showing substantial improvements on embodied and real-world spatial tasks.[104] Cambrian-S: Towards Spatial Supersensing in Video
Shusheng Yang,Jihan Yang,Pinzhi Huang,Ellis Brown,Zihao Yang,Yue Yu,Shengbang Tong,Zihan Zheng,Yifan Xu,Muhan Wang,Daohan Lu,Rob Fergus,Yann LeCun,Li Fei-Fei,Saining Xie
Main category: cs.CV
TL;DR: 本文提出“超感知”新范式,强调多模态智能需超越被动任务处理和长上下文暴力扩展,发展语义感知、事件认知、隐式3D空间理解和预测性世界建模四阶段能力,并发布VSI-SUPER基准测试,验证当前模型局限,提出基于预测的感知机制作为未来方向。
Details
Motivation: 现有视觉-语言模型主要依赖反应式任务和长上下文处理,缺乏对空间认知和真实世界建模能力的全面评估与提升,难以实现真正的多模态智能。 Method: 提出空间超感知四阶段框架,构建包含长时视觉空间回忆(VSR)和连续视觉空间计数(VSC)的VSI-SUPER基准,训练Cambrian-S模型并探索自监督下一潜在帧预测器驱动的记忆与事件分割机制。 Result: Cambrian-S在VSI-Bench上提升30%,但在VSI-SUPER上表现仍有限;引入预测性感知机制显著优于主流闭源基线。 Conclusion: 仅靠数据和模型扩展不足以实现空间超感知,模型需具备预测、选择和组织经验的能力,预测性感知是迈向真正多模态智能的关键路径。 Abstract: We argue that progress in true multimodal intelligence calls for a shift from reactive, task-driven systems and brute-force long context towards a broader paradigm of supersensing. We frame spatial supersensing as four stages beyond linguistic-only understanding: semantic perception (naming what is seen), streaming event cognition (maintaining memory across continuous experiences), implicit 3D spatial cognition (inferring the world behind pixels), and predictive world modeling (creating internal models that filter and organize information). Current benchmarks largely test only the early stages, offering narrow coverage of spatial cognition and rarely challenging models in ways that require true world modeling. To drive progress in spatial supersensing, we present VSI-SUPER, a two-part benchmark: VSR (long-horizon visual spatial recall) and VSC (continual visual spatial counting). These tasks require arbitrarily long video inputs yet are resistant to brute-force context expansion. We then test data scaling limits by curating VSI-590K and training Cambrian-S, achieving +30% absolute improvement on VSI-Bench without sacrificing general capabilities. Yet performance on VSI-SUPER remains limited, indicating that scale alone is insufficient for spatial supersensing. We propose predictive sensing as a path forward, presenting a proof-of-concept in which a self-supervised next-latent-frame predictor leverages surprise (prediction error) to drive memory and event segmentation. On VSI-SUPER, this approach substantially outperforms leading proprietary baselines, showing that spatial supersensing requires models that not only see but also anticipate, select, and organize experience.[105] InfinityStar: Unified Spacetime AutoRegressive Modeling for Visual Generation
Jinlai Liu,Jian Han,Bin Yan,Hui Wu,Fengda Zhu,Xing Wang,Yi Jiang,Bingyue Peng,Zehuan Yuan
Main category: cs.CV
TL;DR: InfinityStar是一个统一的时空自回归框架,用于高分辨率图像和动态视频合成,能够高效生成720p视频,并在VBench上显著超越现有自回归模型,甚至优于部分扩散模型。
Details
Motivation: 现有的视频生成模型在效率和质量之间难以平衡,且多数无法同时支持多种生成任务。因此需要一个统一、高效的框架来实现高质量、高分辨率的视频生成。 Method: 基于纯离散的自回归建模方法,将空间和时间依赖性联合建模于单一架构中,通过简单的时序自回归实现文本到图像、文本到视频、图像到视频等多种任务。 Result: 在VBench上得分为83.74,大幅超过其他自回归模型,生成5秒720p视频的速度比主流扩散模型快约10倍,且无需额外优化。 Conclusion: InfinityStar是首个能生成工业级720p视频的离散自回归视频生成模型,兼具高效性与高质量,推动了高效视频生成的发展。 Abstract: We introduce InfinityStar, a unified spacetime autoregressive framework for high-resolution image and dynamic video synthesis. Building on the recent success of autoregressive modeling in both vision and language, our purely discrete approach jointly captures spatial and temporal dependencies within a single architecture. This unified design naturally supports a variety of generation tasks such as text-to-image, text-to-video, image-to-video, and long interactive video synthesis via straightforward temporal autoregression. Extensive experiments demonstrate that InfinityStar scores 83.74 on VBench, outperforming all autoregressive models by large margins, even surpassing some diffusion competitors like HunyuanVideo. Without extra optimizations, our model generates a 5s, 720p video approximately 10x faster than leading diffusion-based methods. To our knowledge, InfinityStar is the first discrete autoregressive video generator capable of producing industrial level 720p videos. We release all code and models to foster further research in efficient, high-quality video generation.[106] Tracking and Understanding Object Transformations
Yihong Sun,Xinyu Yang,Jennifer J. Sun,Bharath Hariharan
Main category: cs.CV
TL;DR: 提出了Track Any State任务,旨在跟踪物体在状态变换过程中的变化,并引入了VOST-TAS数据集和TubeletGraph方法,实现了在变换下的先进跟踪性能。
Details
Motivation: 现有方法在物体外观发生显著变化时容易丢失目标,难以有效跟踪经历状态变换的真实物体。 Method: 提出TubeletGraph,通过语义和邻近先验识别并整合被忽略的轨迹,构建描述状态演变的图结构,实现对变换后物体的恢复与跟踪。 Result: 在VOST-TAS数据集上实现了最先进的跟踪性能,同时展现出对物体状态变换的深入理解,具备良好的时间定位和语义推理能力。 Conclusion: TubeletGraph能有效应对物体状态变换带来的挑战,在零样本设置下实现了对复杂变换的建模与跟踪,推动了对动态物体理解的发展。 Abstract: Real-world objects frequently undergo state transformations. From an apple being cut into pieces to a butterfly emerging from its cocoon, tracking through these changes is important for understanding real-world objects and dynamics. However, existing methods often lose track of the target object after transformation, due to significant changes in object appearance. To address this limitation, we introduce the task of Track Any State: tracking objects through transformations while detecting and describing state changes, accompanied by a new benchmark dataset, VOST-TAS. To tackle this problem, we present TubeletGraph, a zero-shot system that recovers missing objects after transformation and maps out how object states are evolving over time. TubeletGraph first identifies potentially overlooked tracks, and determines whether they should be integrated based on semantic and proximity priors. Then, it reasons about the added tracks and generates a state graph describing each observed transformation. TubeletGraph achieves state-of-the-art tracking performance under transformations, while demonstrating deeper understanding of object transformations and promising capabilities in temporal grounding and semantic reasoning for complex object transformations. Code, additional results, and the benchmark dataset are available at https://tubelet-graph.github.io.[107] Carousel: A High-Resolution Dataset for Multi-Target Automatic Image Cropping
Rafe Loya,Andrew Hamara,Benjamin Estell,Benjamin Kilpatrick,Andrew C. Freeman
Main category: cs.CV
TL;DR: 本文探讨了自动生成多张具有美学吸引力的不同裁剪图像的问题,提出了一个包含277张图像及人类标注的数据集,并评估了结合图像分割预处理的单裁剪模型的效果。