Table of Contents
cs.CL [Back]
[1] Activation-Space Personality Steering: Hybrid Layer Selection for Stable Trait Control in LLMs
Pranav Bhandari,Nicolas Fay,Sanjeevan Selvaganapathy,Amitava Datta,Usman Naseem,Mehwish Nasim
Main category: cs.CL
TL;DR: 提出一种新方法,通过大五人格特质从Transformer模型中提取隐状态,利用低秩子空间发现技术识别不同架构下的最优层,并实现对语言模型输出性格特征的精确控制。
Details
Motivation: 目前尚缺乏有效机制在生成过程中操控大语言模型的行为,且人格特质与模型内部表征之间的关系研究不足,需要探索如何利用这些表征来引导模型行为。 Method: 提出一个新流程:基于大五人格特质提取Transformer层的隐藏状态,应用低秩子空间发现方法,识别特定人格特质的最优层,并通过动态层选择的灵活引导框架进行干预。 Result: 发现人格特质存在于低秩共享子空间中,可通过精细扰动转化为有效的引导机制,在不损害语言流畅性、多样性和整体能力的前提下,实现对人格表达的精准调控。 Conclusion: 该方法成功将心理学理论与模型对齐实践结合,为构建可控制、个性化的语言模型提供了可靠路径。 Abstract: Large Language Models exhibit implicit personalities in their generation, but reliably controlling or aligning these traits to meet specific needs remains an open challenge. The need for effective mechanisms for behavioural manipulation of the model during generation is a critical gap in the literature that needs to be fulfilled. Personality-aware LLMs hold a promising direction towards this objective. However, the relationship between these psychological constructs and their representations within LLMs remains underexplored and requires further investigation. Moreover, it is intriguing to understand and study the use of these representations to steer the models' behaviour. We propose a novel pipeline that extracts hidden state activations from transformer layers using the Big Five Personality Traits (Openness, Conscientiousness, Extraversion, Agreeableness and Neuroticism), which is a comprehensive and empirically validated framework to model human personality applies low-rank subspace discovery methods, and identifies trait-specific optimal layers across different model architectures for robust injection. The resulting personality-aligned directions are then operationalised through a flexible steering framework with dynamic layer selection, enabling precise control of trait expression in LLM outputs. Our findings reveal that personality traits occupy a low-rank shared subspace, and that these latent structures can be transformed into actionable mechanisms for effective steering through careful perturbations without impacting the fluency, variance and general capabilities, helping to bridge the gap between psychological theory and practical model alignment.[2] TextualVerifier: Verify TextGrad Step-by-Step
Eugenius Mario Situmorang,Adila Alfa Krisnadhi,Ari Wibisono
Main category: cs.CL
TL;DR: 本文提出了TextualVerifier,一个基于大语言模型的链式思维推理和多数投票机制的文本验证框架,用于弥补TextGrad在文本决策中缺乏自我验证机制的问题。
Details
Motivation: TextGrad虽然实现了基于文本的自动微分优化,但缺乏确保推理有效性的自我验证机制,限制了其在复杂AI系统中的可靠性。 Method: TextualVerifier采用四阶段工作流:链式思维分解、变体生成、多数投票和共识聚合,并非侵入式地集成到TextGrad的损失函数和优化结果验证阶段。 Result: 实验表明,TextualVerifier显著提升了推理有效性:单独评估时推理步骤有效性提高29%;与TextGrad集成后,在多个基准测试中性能提升2.2至10.71个百分点,且平均仅增加5.9次LLM调用开销。 Conclusion: TextualVerifier是首个针对TextGrad的自验证框架,无需数值梯度即可提升文本优化系统的可靠性,为文本驱动的AI系统验证提供了新方向。 Abstract: TextGrad is a novel approach to text-based automatic differentiation that enables composite AI systems to perform optimization without explicit numerical equations. However, it currently lacks self-verification mechanisms that ensure reasoning validity in text-based decision making. This research introduces TextualVerifier, a verification framework that leverages chain-of-thought reasoning and majority voting with large language models to address this verification gap. TextualVerifier implements a four-stage workflow: chain-of-thought decomposition, variant generation, majority voting, and consensus aggregation. It integrates non-invasively with TextGrad at both the loss function and optimization result verification stages. Experimental evaluation using the Gemini 1.5 Pro model is conducted in two phases: (1) standalone evaluation on PRM800K, and (2) integrated evaluation with TextGrad on GPQA-Diamond, MMLU-ML, and MMLU-CP benchmarks. Results show statistically significant improvements (p < 0.001). In phase one, TextualVerifier improves the validity of reasoning steps by 29 percent. In phase two, integration into TextGrad loss function yields a 2.2 percentage point gain from 68.2 to 70.4 percent with a moderate overhead of 5.9 LLM calls on average. Further evaluations of TextualVerifier versioning yield 8.08, 10.71, and 3.92 percentage point improvements on GPQA, MMLU-ML, and MMLU-CP respectively. TextualVerifier thus presents the first self-verification framework for TextGrad through LLM-based techniques without requiring numerical gradients, enabling more reliable reasoning and opening new directions for verification in text-based optimization.[3] GRDD+: An Extended Greek Dialectal Dataset with Cross-Architecture Fine-tuning Evaluation
Stergios Chatzikyriakidis,Dimitris Papadakis,Sevasti-Ioanna Papaioannou,Erofili Psaltaki
Main category: cs.CL
TL;DR: 本文介绍了扩展的希腊方言数据集GRDD+,新增了多种希腊方言数据,成为目前规模最大、方言种类最多的数据集,并通过在多个大语言模型上的微调实验评估其效果。
Details
Motivation: 为了弥补现有希腊方言数据集在覆盖范围和数据量上的不足,构建一个更全面、多样化的方言数据集以支持相关研究。 Method: 在原有GRDD数据集基础上,补充了克里特、塞浦路斯、本都等四种方言数据,并新增六种方言,构建包含10种方言、总计637万词的GRDD+数据集,并对多种LLM进行微调实验。 Result: GRDD+成为迄今为止规模和方言多样性最大的希腊方言数据集,并通过实验证明高质量方言数据对大语言模型性能具有积极影响。 Conclusion: GRDD+为希腊方言研究提供了重要资源,展示了丰富方言数据在提升语言模型表现方面的潜力。 Abstract: We present an extended Greek Dialectal Dataset (GRDD+) 1that complements the existing GRDD dataset with more data from Cretan, Cypriot, Pontic and Northern Greek, while we add six new varieties: Greco-Corsican, Griko (Southern Italian Greek), Maniot, Heptanesian, Tsakonian, and Katharevusa Greek. The result is a dataset with total size 6,374,939 words and 10 varieties. This is the first dataset with such variation and size to date. We conduct a number of fine-tuning experiments to see the effect of good quality dialectal data on a number of LLMs. We fine-tune three model architectures (Llama-3-8B, Llama-3.1-8B, Krikri-8B) and compare the results to frontier models (Claude-3.7-Sonnet, Gemini-2.5, ChatGPT-5).[4] PLLuM: A Family of Polish Large Language Models
Jan Kocoń,Maciej Piasecki,Arkadiusz Janz,Teddy Ferdinan,Łukasz Radliński,Bartłomiej Koptyra,Marcin Oleksy,Stanisław Woźniak,Paweł Walkowiak,Konrad Wojtasik,Julia Moska,Tomasz Naskręt,Bartosz Walkowiak,Mateusz Gniewkowski,Kamil Szyc,Dawid Motyka,Dawid Banach,Jonatan Dalasiński,Ewa Rudnicka,Bartłomiej Alberski,Tomasz Walkowiak,Aleksander Szczęsny,Maciej Markiewicz,Tomasz Bernaś,Hubert Mazur,Kamil Żyta,Mateusz Tykierko,Grzegorz Chodak,Tomasz Kajdanowicz,Przemysław Kazienko,Agnieszka Karlińska,Karolina Seweryn,Anna Kołos,Maciej Chrabąszcz,Katarzyna Lorenc,Aleksandra Krasnodębska,Artur Wilczek,Katarzyna Dziewulska,Paula Betscher,Zofia Cieślińska,Katarzyna Kowol,Daria Mikoś,Maciej Trzciński,Dawid Krutul,Marek Kozłowski,Sławomir Dadas,Rafał Poświata,Michał Perełkiewicz,Małgorzata Grębowiec,Maciej Kazuła,Marcin Białas,Roman Roszko,Danuta Roszko,Jurgita Vaičenonienė,Andrius Utka,Paweł Levchuk,Paweł Kowalski,Irena Prawdzic-Jankowska,Maciej Ogrodniczuk,Monika Borys,Anna Bulińska,Wiktoria Gumienna,Witold Kieraś,Dorota Komosińska,Katarzyna Krasnowska-Kieraś,Łukasz Kobyliński,Martyna Lewandowska,Marek Łaziński,Mikołaj Łątkowski,Dawid Mastalerz,Beata Milewicz,Agnieszka Anna Mykowiecka,Angelika Peljak-Łapińska,Sandra Penno,Zuzanna Przybysz,Michał Rudolf,Piotr Rybak,Karolina Saputa,Aleksandra Tomaszewska,Aleksander Wawer,Marcin Woliński,Joanna Wołoszyn,Alina Wróblewska,Bartosz Żuk,Filip Żarnecki,Konrad Kaczyński,Anna Cichosz,Zuzanna Deckert,Monika Garnys,Izabela Grabarczyk,Wojciech Janowski,Sylwia Karasińska,Aleksandra Kujawiak,Piotr Misztela,Maria Szymańska,Karolina Walkusz,Igor Siek,Jakub Kwiatkowski,Piotr Pęzik
Main category: cs.CL
TL;DR: PLLuM是波兰首个专为波兰语设计的开源大语言模型系列,由主要研究机构合作开发,包含1400亿token的波兰语文本语料库、定制指令与偏好数据集,并集成负责任AI框架以确保安全与合规,旨在推动开放研究和国家AI技术发展。
Details
Motivation: 由于大语言模型的发展主要集中于英语,其他语言支持有限,尤其是波兰语。因此需要开发高质量、透明且符合本地文化需求的语言模型,以弥补非英语模型在商业和技术上的不足。 Method: 构建了1400亿token的波兰语文本预训练语料库,收集7.7万条自定义指令数据和10万条偏好优化数据;采用基础模型与指令微调相结合的架构,结合对齐技术,并引入包含严格数据治理和混合式输出修正与安全过滤的负责任AI框架。 Result: 成功开发出PLLUM系列模型,在公共管理领域的下游任务中展现出良好实用性,模型性能优于现有同类模型,同时保证输出的安全性与文化相关性。 Conclusion: PLLuM填补了波兰语大模型的空白,通过公开发布促进开放研究,增强波兰在人工智能领域的技术主权,并为多语言大模型发展提供可复用的负责任AI实践路径。 Abstract: Large Language Models (LLMs) play a central role in modern artificial intelligence, yet their development has been primarily focused on English, resulting in limited support for other languages. We present PLLuM (Polish Large Language Model), the largest open-source family of foundation models tailored specifically for the Polish language. Developed by a consortium of major Polish research institutions, PLLuM addresses the need for high-quality, transparent, and culturally relevant language models beyond the English-centric commercial landscape. We describe the development process, including the construction of a new 140-billion-token Polish text corpus for pre-training, a 77k custom instructions dataset, and a 100k preference optimization dataset. A key component is a Responsible AI framework that incorporates strict data governance and a hybrid module for output correction and safety filtering. We detail the models' architecture, training procedures, and alignment techniques for both base and instruction-tuned variants, and demonstrate their utility in a downstream task within public administration. By releasing these models publicly, PLLuM aims to foster open research and strengthen sovereign AI technologies in Poland.[5] STARS: Segment-level Token Alignment with Rejection Sampling in Large Language Models
Mohammad Atif Quamar,Mohammad Areeb,Mikhail Kuznetsov,Muslum Ozgur Ozmen,Z. Berkay Celik
Main category: cs.CL
TL;DR: 提出了一种名为STARS的解码时对齐算法,通过分段采样、评分和拒绝/接受机制,在计算效率和对齐质量上优于传统微调方法。
Details
Motivation: 现有大模型对齐方法如微调计算成本高,而推理时方法如Best-of-N需要过大计算量,难以实用。 Method: 提出STARS:在解码时对固定长度的token段进行迭代采样、评分,并基于奖励进行拒绝或接受,实现生成路径的早期纠偏。 Result: 在六个大语言模型上实验表明,STARS在胜率上比监督微调最高提升14.9个百分点,比DPO最高提升4.3个百分点,且与强Best-of-N基线相当。 Conclusion: STARS提供了一种可泛化、鲁棒且高效的LLM对齐新范式,是传统微调和全序列排序方法的有效替代。 Abstract: Aligning large language models with human values is crucial for their safe deployment; however, existing methods, such as fine-tuning, are computationally expensive and suboptimal. In contrast, inference-time approaches like Best-of-N sampling require practically infeasible computation to achieve optimal alignment. We propose STARS: Segment-level Token Alignment with Rejection Sampling, a decoding-time algorithm that steers model generation by iteratively sampling, scoring, and rejecting/accepting short, fixed-size token segments. This allows for early correction of the generation path, significantly improving computational efficiency and boosting alignment quality. Across a suite of six LLMs, we show that STARS outperforms Supervised Fine-Tuning (SFT) by up to 14.9 percentage points and Direct Preference Optimization (DPO) by up to 4.3 percentage points on win-rates, while remaining highly competitive with strong Best-of-N baselines. Our work establishes granular, reward-guided sampling as a generalizable, robust, and efficient alternative to traditional fine-tuning and full-sequence ranking methods for aligning LLMs.[6] Divide, Cache, Conquer: Dichotomic Prompting for Efficient Multi-Label LLM-Based Classification
Mikołaj Langner,Jan Eliasz,Ewa Rudnicka,Jan Kocoń
Main category: cs.CL
TL;DR: 提出一种基于二元决策和前缀缓存的高效多标签文本分类方法,通过LLM-to-SLM蒸馏提升小模型性能,在情感分析任务中表现优异。
Details
Motivation: 传统多标签分类在大语言模型上效率低,难以处理大规模标签空间,需要更高效的推理方法。 Method: 将多标签分类分解为多个独立的是/否判断问题,结合前缀缓存机制进行高效推理,并利用大模型(DeepSeek-V3)生成多标注数据,通过知识蒸馏微调小型模型。 Result: 该方法在24个情感维度上显著提升小模型(如HerBERT-Large、Gemma3-1B等)的分类性能,优于零样本基线,且推理效率高、准确率不损失。 Conclusion: 将多标签分类解耦为二元查询,结合知识蒸馏与缓存优化,构成可扩展、高效的LLM-based分类框架,具有跨领域应用潜力。 Abstract: We introduce a method for efficient multi-label text classification with large language models (LLMs), built on reformulating classification tasks as sequences of dichotomic (yes/no) decisions. Instead of generating all labels in a single structured response, each target dimension is queried independently, which, combined with a prefix caching mechanism, yields substantial efficiency gains for short-text inference without loss of accuracy. To demonstrate the approach, we focus on affective text analysis, covering 24 dimensions including emotions and sentiment. Using LLM-to-SLM distillation, a powerful annotator model (DeepSeek-V3) provides multiple annotations per text, which are aggregated to fine-tune smaller models (HerBERT-Large, CLARIN-1B, PLLuM-8B, Gemma3-1B). The fine-tuned models show significant improvements over zero-shot baselines, particularly on the dimensions seen during training. Our findings suggest that decomposing multi-label classification into dichotomic queries, combined with distillation and cache-aware inference, offers a scalable and effective framework for LLM-based classification. While we validate the method on affective states, the approach is general and applicable across domains.[7] Evaluating Machine Translation Datasets for Low-Web Data Languages: A Gendered Lens
Hellina Hailu Nigatu,Bethelhem Yemane Mamo,Bontu Fufa Balcha,Debora Taye Tesfaye,Elbethel Daniel Zewdie,Ikram Behiru Nesiru,Jitu Ewnetu Hailu,Senait Mengesha Yayo
Main category: cs.CL
TL;DR: 本文研究了三种低资源语言(Afan Oromo、Amharic 和 Tigrinya)的机器翻译数据集质量,重点关注性别表征问题。研究发现数据集中存在男性偏向及对女性的有害刻板印象,且数据量大并不保证质量,呼吁在构建低资源语言数据集时重视质量问题。
Details
Motivation: 随着低资源语言越来越多地被纳入NLP研究,大规模数据集的收集受到重视,但往往重数量轻质量,可能导致技术性能差和传播社会偏见的风险。因此,有必要评估这些数据集的质量,尤其是性别表征问题。 Method: 分析三种低资源语言(Afan Oromo、Amharic、Tigrinya)的机器翻译训练数据与基准数据集的内容领域分布,并系统考察人名、动词语法性别以及文本描述中的性别倾向与刻板印象。 Result: 发现训练数据多来自政治和宗教领域,而基准数据集中在新闻、健康和体育;数据集中普遍存在男性主导现象,并包含针对女性的有害和毒性描述,且数据量最大的语言此类问题更严重。 Conclusion: 数据量不等于数据质量,当前低资源语言数据集存在显著的性别偏见和有害内容,需在数据收集阶段就引入质量审查与偏差缓解机制。 Abstract: As low-resourced languages are increasingly incorporated into NLP research, there is an emphasis on collecting large-scale datasets. But in prioritizing quantity over quality, we risk 1) building language technologies that perform poorly for these languages and 2) producing harmful content that perpetuates societal biases. In this paper, we investigate the quality of Machine Translation (MT) datasets for three low-resourced languages--Afan Oromo, Amharic, and Tigrinya, with a focus on the gender representation in the datasets. Our findings demonstrate that while training data has a large representation of political and religious domain text, benchmark datasets are focused on news, health, and sports. We also found a large skew towards the male gender--in names of persons, the grammatical gender of verbs, and in stereotypical depictions in the datasets. Further, we found harmful and toxic depictions against women, which were more prominent for the language with the largest amount of data, underscoring that quantity does not guarantee quality. We hope that our work inspires further inquiry into the datasets collected for low-resourced languages and prompts early mitigation of harmful content. WARNING: This paper contains discussion of NSFW content that some may find disturbing.[8] GRAD: Graph-Retrieved Adaptive Decoding for Hallucination Mitigation
Manh Nguyen,Sunil Gupta,Dai Do,Hung Le
Main category: cs.CL
TL;DR: 本文提出了Graph-Retrieved Adaptive Decoding (GRAD),一种在解码时利用语料库衍生证据来抑制大语言模型幻觉的方法,无需重新训练,通过构建稀疏的token转移图并自适应融合模型logits,在多个问答基准上显著提升了准确性和事实性。
Details
Motivation: 现有的幻觉缓解方法依赖外部知识源,存在脆弱性或高检索成本的问题,因此需要一种轻量且通用的解码时干预方法。 Method: GRAD通过在少量检索到的语料库上累积下一个token的logits,构建稀疏token转移图,并在解码过程中将图中检索到的logits与模型原始logits进行最大归一化和自适应融合,以支持高证据支持的生成路径。 Result: 在三种模型和多种问答基准(包括内在/外在幻觉和事实性任务)上,GRAD相比贪婪解码最高提升9.7%的内在准确性,降低8.6%的幻觉率,提高6.9%的正确性,并在所有方法中取得最高的真实-信息量乘积得分。 Conclusion: GRAD提供了一种轻量、即插即用的替代方案,优于对比解码和知识图谱增强方法,表明基于语料库级token转移的统计证据可有效引导生成更真实、可验证的输出。 Abstract: Hallucination mitigation remains a persistent challenge for large language models (LLMs), even as model scales grow. Existing approaches often rely on external knowledge sources, such as structured databases or knowledge graphs, accessed through prompting or retrieval. However, prompt-based grounding is fragile and domain-sensitive, while symbolic knowledge integration incurs heavy retrieval and formatting costs. Motivated by knowledge graphs, we introduce Graph-Retrieved Adaptive Decoding (GRAD), a decoding-time method that grounds generation in corpus-derived evidence without retraining. GRAD constructs a sparse token transition graph by accumulating next-token logits across a small retrieved corpus in a single forward pass. During decoding, graph-retrieved logits are max-normalized and adaptively fused with model logits to favor high-evidence continuations while preserving fluency. Across three models and a range of question-answering benchmarks spanning intrinsic, extrinsic hallucination, and factuality tasks, GRAD consistently surpasses baselines, achieving up to 9.7$\%$ higher intrinsic accuracy, 8.6$\%$ lower hallucination rates, and 6.9$\%$ greater correctness compared to greedy decoding, while attaining the highest truth--informativeness product score among all methods. GRAD offers a lightweight, plug-and-play alternative to contrastive decoding and knowledge graph augmentation, demonstrating that statistical evidence from corpus-level token transitions can effectively steer generation toward more truthful and verifiable outputs.[9] Context informs pragmatic interpretation in vision-language models
Alvin Wei Ming Tan,Ben Prystawski,Veronica Boyce,Michael C. Frank
Main category: cs.CL
TL;DR: 研究了人类与视觉语言模型在迭代指代游戏中的表现,发现相关语境显著提升模型性能,但抽象指代的少样本任务对模型仍具挑战。
Details
Motivation: 探讨智能体在多轮语言环境中进行上下文敏感的语用推理能力,特别是在迭代指代游戏中的表现差异。 Method: 通过改变上下文的数量、顺序和相关性,在迭代指代游戏中测试人类与视觉语言模型的表现。 Result: 没有相关上下文时,模型表现高于随机但远差于人类;有相关上下文时,模型表现随轮次显著提升。 Conclusion: 相关上下文对模型性能至关重要,但当前机器学习模型在少样本、抽象指代任务上仍有局限。 Abstract: Iterated reference games - in which players repeatedly pick out novel referents using language - present a test case for agents' ability to perform context-sensitive pragmatic reasoning in multi-turn linguistic environments. We tested humans and vision-language models on trials from iterated reference games, varying the given context in terms of amount, order, and relevance. Without relevant context, models were above chance but substantially worse than humans. However, with relevant context, model performance increased dramatically over trials. Few-shot reference games with abstract referents remain a difficult task for machine learning models.[10] The Human Flourishing Geographic Index: A County-Level Dataset for the United States, 2013--2023
Stefano M. Iacus,Devika Jain,Andrea Nasuto,Giuseppe Porro,Marcello Carammia,Andrea Vezzulli
Main category: cs.CL
TL;DR: 提出基于26亿条美国推文的人类繁荣地理指数(HFGI),利用大语言模型分析48个指标,提供县和州级别的月度、年度人类繁荣相关讨论数据集。
Details
Motivation: 现有衡量人类繁荣的指标在时空分辨率上不足,且多依赖传统经济指标,难以全面反映社会福祉。需要一种高分辨率、多维度的方法来量化人类繁荣。 Method: 基于2013-2023年约26亿条带地理位置的美国推文,使用微调的大语言模型识别与人类繁荣相关的表达,涵盖哈佛全球繁荣研究框架中的48个指标,并生成县级和州级的时间序列数据。 Result: 构建了人类繁荣地理指数(HFGI),具有良好的构念效度,与已有指标呈现预期相关性,提供了过去十年美国人类繁荣讨论的高时空分辨率数据集。 Conclusion: HFGI为跨学科研究社会福祉、不平等和社会变迁提供了新工具,展示了社交媒体数据在监测和理解人类繁荣动态方面的潜力。 Abstract: Quantifying human flourishing, a multidimensional construct including happiness, health, purpose, virtue, relationships, and financial stability, is critical for understanding societal well-being beyond economic indicators. Existing measures often lack fine spatial and temporal resolution. Here we introduce the Human Flourishing Geographic Index (HFGI), derived from analyzing approximately 2.6 billion geolocated U.S. tweets (2013-2023) using fine-tuned large language models to classify expressions across 48 indicators aligned with Harvard's Global Flourishing Study framework plus attitudes towards migration and perception of corruption. The dataset offers monthly and yearly county- and state-level indicators of flourishing-related discourse, validated to confirm that the measures accurately represent the underlying constructs and show expected correlations with established indicators. This resource enables multidisciplinary analyses of well-being, inequality, and social change at unprecedented resolution, offering insights into the dynamics of human flourishing as reflected in social media discourse across the United States over the past decade.[11] Direct Semantic Communication Between Large Language Models via Vector Translation
Fu-Chun Yang,Jason Eshraghian
Main category: cs.CL
TL;DR: 提出了一种通过向量转换在多模型间建立潜在语义桥梁的方法,实现跨模型的直接语义交换,提升信息传递效率。
Details
Motivation: 现有方法在多智能体交互中仅传递token,丢失了大部分潜在语义,限制了信息传输并增加计算开销。 Method: 设计双编码器翻译器,在Llama-2-7B和Mistral-7B-Instruct之间学习映射关系,实现表示空间的向量翻译,并以30%混合强度注入目标模型。 Result: 平均余弦对齐度达到0.538,双向评估显示2.01:1的传输不对称性,表明通用模型比指令调优模型更具可迁移性。 Conclusion: 跨模型潜在通信是可行的,可在保持计算稳定的同时实现语义共享,为协作式AI系统提供了新路径。 Abstract: In multi-agent settings, such as debate, reflection, or tool-calling, large language models (LLMs) pass messages as plain tokens, discarding most latent semantics. This constrains information transfer and adds unnecessary computational overhead. We form a latent bridge via vector translations, which use learned mappings that enable direct semantic exchange between representation spaces. A dual-encoder translator trained between Llama-2-7B and Mistral-7B-Instruct attains an average cosine alignment of 0.538. Injecting the translated vectors at 30 percent blending strength steers the target model's generation without destabilizing logits. Bidirectional evaluation shows a 2.01:1 transfer asymmetry, indicating that general-purpose models yield more transferable representations than instruction-tuned variants. This conservative injection preserves computational stability while demonstrating that cross-model latent communication is feasible, enabling collaborative AI systems that share meaning rather than tokens.[12] Abductive Inference in Retrieval-Augmented Language Models: Generating and Validating Missing Premises
Shiyin Lin
Main category: cs.CL
TL;DR: 本文提出了一种将溯因推理融入检索增强型大语言模型的框架,通过生成和验证缺失前提来弥补证据不足的问题,提升了回答准确性和推理可信度。
Details
Motivation: 当检索到的证据不完整时,现有的检索增强生成(RAG)系统在推理过程中容易出现漏洞,因此需要一种能够合理填补这些空白的方法。 Method: 该方法检测证据不足的情况,生成候选的缺失前提,并通过一致性与合理性检查对其进行验证。 Result: 在溯因推理和多跳问答基准上的实验表明,该方法提高了答案准确性和推理过程的忠实性。 Conclusion: 溯因推理是增强RAG系统鲁棒性和可解释性的一个有前景的方向。 Abstract: Large Language Models (LLMs) enhanced with retrieval -- commonly referred to as Retrieval-Augmented Generation (RAG) -- have demonstrated strong performance in knowledge-intensive tasks. However, RAG pipelines often fail when retrieved evidence is incomplete, leaving gaps in the reasoning process. In such cases, \emph{abductive inference} -- the process of generating plausible missing premises to explain observations -- offers a principled approach to bridge these gaps. In this paper, we propose a framework that integrates abductive inference into retrieval-augmented LLMs. Our method detects insufficient evidence, generates candidate missing premises, and validates them through consistency and plausibility checks. Experimental results on abductive reasoning and multi-hop QA benchmarks show that our approach improves both answer accuracy and reasoning faithfulness. This work highlights abductive inference as a promising direction for enhancing the robustness and explainability of RAG systems.[13] WST: Weakly Supervised Transducer for Automatic Speech Recognition
Dongji Gao,Chenda Liao,Changliang Liu,Matthew Wiesner,Leibny Paola Garcia,Daniel Povey,Sanjeev Khudanpur,Jian Wu
Main category: cs.CL
TL;DR: 提出了一种弱监督Transducer(WST)方法,能够在高错误率的转录文本下保持自动语音识别性能,优于现有的CTC-based方法。
Details
Motivation: RNN-T在端到端语音识别中依赖大量高质量标注数据,成本高且难以获取,因此需要减少对精确标注的依赖。 Method: 设计了一种灵活的训练图结构,使WST能鲁棒地处理转录错误,无需额外的置信度估计或预训练模型。 Result: 在合成和工业数据集上验证,WST在高达70%的转录错误率下仍保持良好性能,且优于BTC和OTC等现有弱监督方法。 Conclusion: WST在真实场景的ASR任务中具有实用性和强鲁棒性,有望降低对高质量标注数据的依赖。 Abstract: The Recurrent Neural Network-Transducer (RNN-T) is widely adopted in end-to-end (E2E) automatic speech recognition (ASR) tasks but depends heavily on large-scale, high-quality annotated data, which are often costly and difficult to obtain. To mitigate this reliance, we propose a Weakly Supervised Transducer (WST), which integrates a flexible training graph designed to robustly handle errors in the transcripts without requiring additional confidence estimation or auxiliary pre-trained models. Empirical evaluations on synthetic and industrial datasets reveal that WST effectively maintains performance even with transcription error rates of up to 70%, consistently outperforming existing Connectionist Temporal Classification (CTC)-based weakly supervised approaches, such as Bypass Temporal Classification (BTC) and Omni-Temporal Classification (OTC). These results demonstrate the practical utility and robustness of WST in realistic ASR settings. The implementation will be publicly available.[14] T-FIX: Text-Based Explanations with Features Interpretable to eXperts
Shreya Havaldar,Helen Jin,Chaehyeon Kim,Anton Xue,Weiqiu You,Marco Gatti,Bhuvnesh Jain,Helen Qu,Daniel A Hashimoto,Amin Madani,Rajat Deo,Sameed Ahmed M. Khatana,Gary E. Weissman,Lyle Ungar,Eric Wong
Main category: cs.CL
TL;DR: 提出T-FIX基准,用于评估大模型在知识密集型领域中生成解释与专家判断的一致性。
Details
Motivation: 现有评估方法主要关注解释的合理性或内部一致性,无法反映解释内容是否符合专家直觉,特别是在专业领域中需要专家级推理的场景。 Method: 与领域专家合作,构建覆盖七个知识密集型领域的T-FIX基准,并开发新指标来衡量大模型解释与专家判断的对齐程度。 Result: 建立了跨领域的评估基准T-FIX,并提出了可量化专家对齐的新度量方式。 Conclusion: 专家对齐是评估大模型解释质量的重要标准,T-FIX为未来解释性研究提供了更贴近实际应用需求的评估框架。 Abstract: As LLMs are deployed in knowledge-intensive settings (e.g., surgery, astronomy, therapy), users expect not just answers, but also meaningful explanations for those answers. In these settings, users are often domain experts (e.g., doctors, astrophysicists, psychologists) who require explanations that reflect expert-level reasoning. However, current evaluation schemes primarily emphasize plausibility or internal faithfulness of the explanation, which fail to capture whether the content of the explanation truly aligns with expert intuition. We formalize expert alignment as a criterion for evaluating explanations with T-FIX, a benchmark spanning seven knowledge-intensive domains. In collaboration with domain experts, we develop novel metrics to measure the alignment of LLM explanations with expert judgment.[15] Plan of Knowledge: Retrieval-Augmented Large Language Models for Temporal Knowledge Graph Question Answering
Xinying Qian,Ying Zhang,Yu Zhao,Baohang Zhou,Xuhui Sui,Xiaojie Yuan
Main category: cs.CL
TL;DR: 本文提出了一种名为PoK的框架,结合知识规划与对比时间检索,提升大语言模型在时序知识图谱问答中的推理准确性和可解释性。
Details
Motivation: 现有方法在处理时序知识图谱问答时难以充分理解复杂的时间约束语义,且大语言模型存在幻觉和知识缺失问题,限制了其时序推理能力。 Method: 提出Plan of Knowledge模块,将复杂时序问题分解为子目标序列,并构建带对比检索机制的时序知识库(TKS),实现语义和时间对齐的事实检索,结合结构化规划与知识检索进行推理。 Result: 在四个基准数据集上实验表明,PoK显著提升了检索精度和推理准确性,最高超越现有最先进方法56.0%。 Conclusion: PoK通过结构化规划与对比检索有效增强了大语言模型在时序知识图谱问答中的事实一致性和可解释性,显著提升了性能。 Abstract: Temporal Knowledge Graph Question Answering (TKGQA) aims to answer time-sensitive questions by leveraging factual information from Temporal Knowledge Graphs (TKGs). While previous studies have employed pre-trained TKG embeddings or graph neural networks to inject temporal knowledge, they fail to fully understand the complex semantic information of time constraints. Recently, Large Language Models (LLMs) have shown remarkable progress, benefiting from their strong semantic understanding and reasoning generalization capabilities. However, their temporal reasoning ability remains limited. LLMs frequently suffer from hallucination and a lack of knowledge. To address these limitations, we propose the Plan of Knowledge framework with a contrastive temporal retriever, which is named PoK. Specifically, the proposed Plan of Knowledge module decomposes a complex temporal question into a sequence of sub-objectives from the pre-defined tools, serving as intermediate guidance for reasoning exploration. In parallel, we construct a Temporal Knowledge Store (TKS) with a contrastive retrieval framework, enabling the model to selectively retrieve semantically and temporally aligned facts from TKGs. By combining structured planning with temporal knowledge retrieval, PoK effectively enhances the interpretability and factual consistency of temporal reasoning. Extensive experiments on four benchmark TKGQA datasets demonstrate that PoK significantly improves the retrieval precision and reasoning accuracy of LLMs, surpassing the performance of the state-of-the-art TKGQA methods by 56.0% at most.[16] The truth is no diaper: Human and AI-generated associations to emotional words
Špela Vintar,Jan Jona Javoršek
Main category: cs.CL
TL;DR: 比较人类与大语言模型在情感词汇联想上的行为,发现两者有中等程度的重叠,但大语言模型的联想更可预测、创造性较低,并倾向于放大情感负荷。
Details
Motivation: 探究大语言模型是否以与人类相似的方式生成词汇联想,特别是在情感词汇上的表现,以理解其创造力和认知模拟能力。 Method: 通过对比人类受试者与大语言模型对情感词汇的联想反应,分析其关联模式、情感强度及创造性差异。 Result: 人类与大语言模型的联想重叠程度中等;大语言模型的联想更具可预测性,创造性较低,并倾向于放大原始刺激的情感负荷。 Conclusion: 大语言模型在词汇联想上虽表现出一定类似人类的行为,但在创造性和情感处理方面仍存在显著差异,显示出其局限性。 Abstract: Human word associations are a well-known method of gaining insight into the internal mental lexicon, but the responses spontaneously offered by human participants to word cues are not always predictable as they may be influenced by personal experience, emotions or individual cognitive styles. The ability to form associative links between seemingly unrelated concepts can be the driving mechanisms of creativity. We perform a comparison of the associative behaviour of humans compared to large language models. More specifically, we explore associations to emotionally loaded words and try to determine whether large language models generate associations in a similar way to humans. We find that the overlap between humans and LLMs is moderate, but also that the associations of LLMs tend to amplify the underlying emotional load of the stimulus, and that they tend to be more predictable and less creative than human ones.[17] Improving the Performance of Radiology Report De-identification with Large-Scale Training and Benchmarking Against Cloud Vendor Methods
Eva Prakash,Maayane Attias,Pierre Chambon,Justin Xu,Steven Truong,Jean-Benoit Delbrouck,Tessa Cook,Curtis Langlotz
Main category: cs.CL
TL;DR: 本研究通过大规模训练数据优化基于Transformer的放射学报告去标识化模型,并在多个数据集上验证其性能优于现有学术和商业系统。
Details
Motivation: 提升放射学报告中受保护健康信息(PHI)自动检测与去标识化的准确性和泛化能力,以支持跨机构临床文本的安全共享与处理。 Method: 基于先进的Transformer架构,在斯坦福大学的两个大型标注放射学语料库上进行微调,引入新的AGE类别,并使用‘明文隐藏’方法生成合成PHI用于评估;在斯坦福和宾夕法尼亚大学的数据集上评估模型性能,并与商业云服务对比。 Result: 模型在宾夕法尼亚数据集上的F1得分为0.973,在斯坦福数据集上为0.996,显著优于商业系统(F1: 0.632–0.754);合成PHI检测稳定(F1: 0.959),且保持数据可用性。 Conclusion: 在多样化放射学数据上训练的Transformer模型在PHI检测方面表现优越,建立了安全临床文本处理的新基准。 Abstract: Objective: To enhance automated de-identification of radiology reports by scaling transformer-based models through extensive training datasets and benchmarking performance against commercial cloud vendor systems for protected health information (PHI) detection. Materials and Methods: In this retrospective study, we built upon a state-of-the-art, transformer-based, PHI de-identification pipeline by fine-tuning on two large annotated radiology corpora from Stanford University, encompassing chest X-ray, chest CT, abdomen/pelvis CT, and brain MR reports and introducing an additional PHI category (AGE) into the architecture. Model performance was evaluated on test sets from Stanford and the University of Pennsylvania (Penn) for token-level PHI detection. We further assessed (1) the stability of synthetic PHI generation using a "hide-in-plain-sight" method and (2) performance against commercial systems. Precision, recall, and F1 scores were computed across all PHI categories. Results: Our model achieved overall F1 scores of 0.973 on the Penn dataset and 0.996 on the Stanford dataset, outperforming or maintaining the previous state-of-the-art model performance. Synthetic PHI evaluation showed consistent detectability (overall F1: 0.959 [0.958-0.960]) across 50 independently de-identified Penn datasets. Our model outperformed all vendor systems on synthetic Penn reports (overall F1: 0.960 vs. 0.632-0.754). Discussion: Large-scale, multimodal training improved cross-institutional generalization and robustness. Synthetic PHI generation preserved data utility while ensuring privacy. Conclusion: A transformer-based de-identification model trained on diverse radiology datasets outperforms prior academic and commercial systems in PHI detection and establishes a new benchmark for secure clinical text processing.[18] A Characterization of List Language Identification in the Limit
Moses Charikar,Chirag Pabbaraju,Ambuj Tewari
Main category: cs.CL
TL;DR: 本文研究了在极限情况下的语言识别问题,提出了一种k-列表识别的新框架,并给出了可识别语言集合的精确刻画,证明了其与经典结果的深刻联系及统计设置下的收敛速率。
Details
Motivation: 受近期语言生成问题正向结果的启发,重新审视在增加多猜测能力后,语言在极限情况下是否可识别的问题。 Method: 基于Angluin的经典刻画,引入递归方法,对k-列表识别进行理论分析,并扩展到统计设定下研究识别速率。 Result: 给出了k-列表可识别语言集合的精确特征:当且仅当该集合可分解为k个各自可识别的子集;并在统计设定下证明,若可k-列表识别,则收敛速率为指数级,否则无法以任何趋于零的速率识别。 Conclusion: k-列表识别能力取决于语言集合能否分解为k个经典可识别的子集,且在统计情形下存在最优的指数收敛速率。 Abstract: We study the problem of language identification in the limit, where given a sequence of examples from a target language, the goal of the learner is to output a sequence of guesses for the target language such that all the guesses beyond some finite time are correct. Classical results of Gold showed that language identification in the limit is impossible for essentially any interesting collection of languages. Later, Angluin gave a precise characterization of language collections for which this task is possible. Motivated by recent positive results for the related problem of language generation, we revisit the classic language identification problem in the setting where the learner is given the additional power of producing a list of $k$ guesses at each time step. The goal is to ensure that beyond some finite time, one of the guesses is correct at each time step. We give an exact characterization of collections of languages that can be $k$-list identified in the limit, based on a recursive version of Angluin's characterization (for language identification with a list of size $1$). This further leads to a conceptually appealing characterization: A language collection can be $k$-list identified in the limit if and only if the collection can be decomposed into $k$ collections of languages, each of which can be identified in the limit (with a list of size $1$). We also use our characterization to establish rates for list identification in the statistical setting where the input is drawn as an i.i.d. stream from a distribution supported on some language in the collection. Our results show that if a collection is $k$-list identifiable in the limit, then the collection can be $k$-list identified at an exponential rate, and this is best possible. On the other hand, if a collection is not $k$-list identifiable in the limit, then it cannot be $k$-list identified at any rate that goes to zero.[19] Batch Prompting Suppresses Overthinking Reasoning Under Constraint: How Batch Prompting Suppresses Overthinking in Reasoning Models
Wenmo Qiu,Saurabh Srivastava
Main category: cs.CL
TL;DR: 批处理不仅提高了大推理模型的推理效率,还通过抑制过度思考和减少犹豫语言来正则化模型行为,同时展现出跨样本的集体泛化效应。
Details
Motivation: 探索批处理在大语言模型推理中除吞吐优化外的潜在益处,特别是在多步推理中的正则化作用。 Method: 在13个多样化基准上进行综合研究,分析批处理对准确性、推理token使用及模型行为的影响。 Result: 批处理显著提升准确率并减少3-5倍的推理token消耗,抑制过度思考和犹豫语言,并引发模型间的集体泛化现象。 Conclusion: 批处理不仅是推理加速技术,更是一种有效的推理时正则化手段,可提升大推理模型的效率与可靠性。 Abstract: Recent work has explored batch prompting as a strategy to amortize inference cost in large language models (LLMs). In this paper, we show that batching offers an additional, underappreciated benefit: it regularizes model behavior during multi-step reasoning for Large Reasoning Models (LRMs). We conduct a comprehensive study across 13 diverse benchmarks and observe that batching improves accuracy while substantially reducing reasoning token usage, often by 3x-5x. Through detailed behavioral analysis, we find that batching suppresses overthinking, reduces hedging language (e.g., repetitive self-corrections), and encourages more decisive answers. Surprisingly, we also observe emergent collective effects in batched inference: models often generalize patterns from earlier examples to solve harder ones in the same batch. These findings position batching not just as a throughput optimization, but as a powerful inference-time regularizer for more efficient and reliable LLM reasoning.[20] RIDE: Difficulty Evolving Perturbation with Item Response Theory for Mathematical Reasoning
Xinyuan Li,Murong Xu,Wenbiao Tao,Hanlun Zhu,Yike Zhao,Jipeng Zhang,Yunshi Lan
Main category: cs.CL
TL;DR: 提出RIDE框架,利用项目反应理论(IRT)和强化学习生成更具挑战性的数学问题变体,以更严格地评估大语言模型的数学推理能力。
Details
Motivation: 现有基于规则的对抗扰动方法常生成不合理的题目,难以系统评估问题难度,且可能高估模型的真实数学推理能力。因此需要一种更严谨的方法来衡量模型的鲁棒性。 Method: 提出RIDE框架:使用35个大语言模型模拟学生答题行为,构建基于IRT的难度排序器;通过强化学习指导问题重写模型,在保持问题合理性的前提下生成跨难度级别的新问题。 Result: 在竞赛级数学基准上应用RIDE后,26个先进大语言模型的性能平均下降21.73%,表明当前模型的数学推理能力缺乏鲁棒性。 Conclusion: RIDE能有效生成高质量、更具挑战性的数学问题,为评估大语言模型的真实推理能力提供了可靠且可扩展的评测方法。 Abstract: Large language models (LLMs) achieve high performance on mathematical reasoning, but these results can be inflated by training data leakage or superficial pattern matching rather than genuine reasoning. To this end, an adversarial perturbation-based evaluation is needed to measure true mathematical reasoning ability. Current rule-based perturbation methods often generate ill-posed questions and impede the systematic evaluation of question difficulty and the evolution of benchmarks. To bridge this gap, we propose RIDE, a novel adversarial question-rewriting framework that leverages Item Response Theory (IRT) to rigorously measure question difficulty and to generate intrinsically more challenging, well-posed variations of mathematical problems. We employ 35 LLMs to simulate students and build a difficulty ranker from their responses. This ranker provides a reward signal during reinforcement learning and guides a question-rewriting model to reformulate existing questions across difficulty levels. Applying RIDE to competition-level mathematical benchmarks yields perturbed versions that degrade advanced LLM performance, with experiments showing an average 21.73% drop across 26 models, thereby exposing limited robustness in mathematical reasoning and confirming the validity of our evaluation approach.[21] CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese
Dazhong Chen,Yi-Cheng Lin,Yuchen Huang,Ziwei Gong,Di Jiang,Zeying Xie,Yi R.,Fung
Main category: cs.CL
TL;DR: 本文提出CantoASR,一种结合强制对齐、LoRA微调Whisper和指令调优Qwen-Audio的协作式ASR-LALM错误校正框架,显著提升低资源粤语语音识别性能。
Details
Motivation: 低资源粤语语音识别面临标注数据少、六种声调、变调现象和口音差异等挑战,现有模型如Whisper存在高词错误率问题。 Method: 提出CantoASR框架:利用强制对齐提取声学特征,采用LoRA微调Whisper以增强声调分辨能力,并使用指令调优的Qwen-Audio进行韵律感知的错误校正。 Result: 在自发性粤语数据上的实验表明,CantoASR相比Whisper-Large-V3显著降低了字符错误率(CER)。 Conclusion: 结合显式声学线索与大音频语言模型推理的协同框架,为低资源声调语言及方言的语音识别提供了可扩展的有效解决方案。 Abstract: Automatic speech recognition (ASR) is critical for language accessibility, yet low-resource Cantonese remains challenging due to limited annotated data, six lexical tones, tone sandhi, and accent variation. Existing ASR models, such as Whisper, often suffer from high word error rates. Large audio-language models (LALMs), in contrast, can leverage broader contextual reasoning but still require explicit tonal and prosodic acoustic cues. We introduce CantoASR, a collaborative ASR-LALM error correction framework that integrates forced alignment for acoustic feature extraction, a LoRA-finetuned Whisper for improved tone discrimination, and an instruction-tuned Qwen-Audio for prosody-aware correction. Evaluations on spontaneous Cantonese data show substantial CER gains over Whisper-Large-V3. These findings suggest that integrating acoustic cues with LALM reasoning provides a scalable strategy for low-resource tonal and dialectal ASR.[22] BAPPA: Benchmarking Agents, Plans, and Pipelines for Automated Text-to-SQL Generation
Fahim Ahmed,Md Mubtasim Ahasan,Jahir Sadik Monon,Muntasir Wahed,M Ashraful Amin,A K M Mahbubur Rahman,Amin Ahsan Ali
Main category: cs.CL
TL;DR: 本文探索了三种多智能体LLM流水线,用于提升小模型在Text-to-SQL任务上的性能,实验表明多智能体讨论和推理-编码架构能显著提高执行准确率。
Details
Motivation: 现有大语言模型在处理大规模模式和复杂推理时生成SQL效果不佳,且多数研究集中于复杂且不实用的方案,忽视了小型高效模型的潜力。 Method: 提出并评估三种多智能体LLM流水线:多智能体讨论、Planner-Coder(规划-编码)和Coder-Aggregator(编码-聚合),并在多种开源模型上进行系统性基准测试。 Result: 多智能体讨论使小模型(如Qwen2.5-7b-Instruct)执行准确率提升最高达10.6%;LLM Reasoner-Coder流水线表现最佳,使用DeepSeek-R1-32B和QwQ-32B作为规划器将Gemma 3 27B IT的准确率从52.4%提升至56.4%。 Conclusion: 多智能体协作框架能有效提升小模型在Text-to-SQL任务中的表现,特别是Reasoner-Coder架构具有显著优势,为高效SQL生成提供了实用且可扩展的解决方案。 Abstract: Text-to-SQL systems provide a natural language interface that can enable even laymen to access information stored in databases. However, existing Large Language Models (LLM) struggle with SQL generation from natural instructions due to large schema sizes and complex reasoning. Prior work often focuses on complex, somewhat impractical pipelines using flagship models, while smaller, efficient models remain overlooked. In this work, we explore three multi-agent LLM pipelines, with systematic performance benchmarking across a range of small to large open-source models: (1) Multi-agent discussion pipeline, where agents iteratively critique and refine SQL queries, and a judge synthesizes the final answer; (2) Planner-Coder pipeline, where a thinking model planner generates stepwise SQL generation plans and a coder synthesizes queries; and (3) Coder-Aggregator pipeline, where multiple coders independently generate SQL queries, and a reasoning agent selects the best query. Experiments on the Bird-Bench Mini-Dev set reveal that Multi-Agent discussion can improve small model performance, with up to 10.6% increase in Execution Accuracy for Qwen2.5-7b-Instruct seen after three rounds of discussion. Among the pipelines, the LLM Reasoner-Coder pipeline yields the best results, with DeepSeek-R1-32B and QwQ-32B planners boosting Gemma 3 27B IT accuracy from 52.4% to the highest score of 56.4%. Codes are available at https://github.com/treeDweller98/bappa-sql.[23] Trustworthy LLM-Mediated Communication: Evaluating Information Fidelity in LLM as a Communicator (LAAC) Framework in Multiple Application Domains
Mohammed Musthafa Rafi,Adarsh Krishnamurthy,Aditya Balu
Main category: cs.CL
TL;DR: 本文提出了LAAC(大语言模型作为沟通中介)范式,旨在通过结构化对话捕捉发送者意图并促进真实的知识交流,以解决AI生成内容导致的冗长与压缩循环问题,并系统评估了其在信息保真度、可重复性和查询响应完整性方面的可信度要求。
Details
Motivation: 由于AI生成内容泛滥,导致沟通中出现‘膨胀-压缩’循环,双方不再接触真实内容,缺乏有效的知识传递。 Method: 提出LAAC多智能体架构,通过结构化对话提取发送者意图,并在多种通信场景中实验评估信息捕获保真度、可重复性和查询响应完整性三个维度的可信性。 Result: 实验发现当前LAAC在高风险通信场景下存在可测量的信任差距,尤其在意图提取准确性、知识一致性及避免幻觉方面仍有不足。 Conclusion: LLM作为沟通中介需满足严格的可信度标准,未来需改进模型在信息忠实性与稳定性方面的能力,才能实现可靠部署。 Abstract: The proliferation of AI-generated content has created an absurd communication theater where senders use LLMs to inflate simple ideas into verbose content, recipients use LLMs to compress them back into summaries, and as a consequence neither party engage with authentic content. LAAC (LLM as a Communicator) proposes a paradigm shift - positioning LLMs as intelligent communication intermediaries that capture the sender's intent through structured dialogue and facilitate genuine knowledge exchange with recipients. Rather than perpetuating cycles of AI-generated inflation and compression, LAAC enables authentic communication across diverse contexts including academic papers, proposals, professional emails, and cross-platform content generation. However, deploying LLMs as trusted communication intermediaries raises critical questions about information fidelity, consistency, and reliability. This position paper systematically evaluates the trustworthiness requirements for LAAC's deployment across multiple communication domains. We investigate three fundamental dimensions: (1) Information Capture Fidelity - accuracy of intent extraction during sender interviews across different communication types, (2) Reproducibility - consistency of structured knowledge across multiple interaction instances, and (3) Query Response Integrity - reliability of recipient-facing responses without hallucination, source conflation, or fabrication. Through controlled experiments spanning multiple LAAC use cases, we assess these trust dimensions using LAAC's multi-agent architecture. Preliminary findings reveal measurable trust gaps that must be addressed before LAAC can be reliably deployed in high-stakes communication scenarios.[24] Computational Turing Test Reveals Systematic Differences Between Human and AI Language
Nicolò Pagan,Petter Törnberg,Christopher A. Bail,Anikó Hannák,Christopher Barrie
Main category: cs.CL
TL;DR: 本文提出了一种计算图灵测试框架,用于评估大语言模型(LLM)生成文本与人类语言的相似性,并比较了九种开源LLM在不同校准策略下的表现,发现即使经过校准,LLM输出仍明显区别于人类文本,且存在人类相似性与语义保真度之间的权衡。
Details
Motivation: 现有对大语言模型模拟人类行为的研究多假设其能生成逼真的类人文本,但这一假设缺乏可靠验证。传统依赖人工判断的方法不够精确,导致缺乏有效工具来评估和校准LLM生成内容的真实性。 Method: 提出一种结合聚合指标(如BERT-based可检测性和语义相似性)与可解释语言特征(如风格标记和话题模式)的计算图灵测试框架,并系统比较九个开源LLM在五种校准策略(包括微调、风格提示和上下文检索)下再现X、Bluesky和Reddit用户互动的能力。 Result: 研究发现,即使经过校准,LLM输出仍显著区别于人类文本,尤其在情感表达和情绪基调方面;指令微调模型表现不如基础模型,增大模型规模也未提升类人程度;并且优化类人程度常以牺牲语义保真度为代价,反之亦然。 Conclusion: 该研究提供了一个可扩展的LLM仿真验证与校准框架,揭示了当前LLM在模拟人类交流方面的局限性,提醒研究者谨慎使用LLM进行社会科学研究中的行为模拟。 Abstract: Large language models (LLMs) are increasingly used in the social sciences to simulate human behavior, based on the assumption that they can generate realistic, human-like text. Yet this assumption remains largely untested. Existing validation efforts rely heavily on human-judgment-based evaluations -- testing whether humans can distinguish AI from human output -- despite evidence that such judgments are blunt and unreliable. As a result, the field lacks robust tools for assessing the realism of LLM-generated text or for calibrating models to real-world data. This paper makes two contributions. First, we introduce a computational Turing test: a validation framework that integrates aggregate metrics (BERT-based detectability and semantic similarity) with interpretable linguistic features (stylistic markers and topical patterns) to assess how closely LLMs approximate human language within a given dataset. Second, we systematically compare nine open-weight LLMs across five calibration strategies -- including fine-tuning, stylistic prompting, and context retrieval -- benchmarking their ability to reproduce user interactions on X (formerly Twitter), Bluesky, and Reddit. Our findings challenge core assumptions in the literature. Even after calibration, LLM outputs remain clearly distinguishable from human text, particularly in affective tone and emotional expression. Instruction-tuned models underperform their base counterparts, and scaling up model size does not enhance human-likeness. Crucially, we identify a trade-off: optimizing for human-likeness often comes at the cost of semantic fidelity, and vice versa. These results provide a much-needed scalable framework for validation and calibration in LLM simulations -- and offer a cautionary note about their current limitations in capturing human communication.[25] LLM-as-a-Judge is Bad, Based on AI Attempting the Exam Qualifying for the Member of the Polish National Board of Appeal
Michał Karp,Anna Kubaszewska,Magdalena Król,Robert Król,Aleksander Smywiński-Pohl,Mateusz Szymański,Witold Wydmański
Main category: cs.CL
TL;DR: 本研究评估了当前大语言模型(LLM)是否能够通过波兰国家上诉委员会的官方资格考试,发现尽管LLM在知识测试中表现尚可,但在实际写作部分均未达标,且“LLM作为评判者”的方法与官方评分存在偏差,表明现有LLM尚不能替代人类法官或独立考官。
Details
Motivation: 探索大语言模型在法律专业考试中的应用潜力,评估其在真实法律场景下的可行性和局限性。 Method: 将LLM作为考生参与考试,并采用'LLM-as-a-judge'方法自动评估生成答案;构建混合信息检索与提取管道,在闭卷及多种检索增强生成(RAG)设置下测试多个LLM。 Result: LLM在多项选择题的知识测试中得分满意,但在书面判决的实际写作部分均未达到及格线,且模型间的自动评估结果与官方评审意见常不一致。 Conclusion: 当前的大语言模型因易产生幻觉、错误引用法律条文、逻辑论证薄弱等问题,仍无法取代人类在波兰公共采购裁决中的法官或独立审查角色。 Abstract: This study provides an empirical assessment of whether current large language models (LLMs) can pass the official qualifying examination for membership in Poland's National Appeal Chamber (Krajowa Izba Odwo{\l}awcza). The authors examine two related ideas: using LLM as actual exam candidates and applying the 'LLM-as-a-judge' approach, in which model-generated answers are automatically evaluated by other models. The paper describes the structure of the exam, which includes a multiple-choice knowledge test on public procurement law and a written judgment, and presents the hybrid information recovery and extraction pipeline built to support the models. Several LLMs (including GPT-4.1, Claude 4 Sonnet and Bielik-11B-v2.6) were tested in closed-book and various Retrieval-Augmented Generation settings. The results show that although the models achieved satisfactory scores in the knowledge test, none met the passing threshold in the practical written part, and the evaluations of the 'LLM-as-a-judge' often diverged from the judgments of the official examining committee. The authors highlight key limitations: susceptibility to hallucinations, incorrect citation of legal provisions, weaknesses in logical argumentation, and the need for close collaboration between legal experts and technical teams. The findings indicate that, despite rapid technological progress, current LLMs cannot yet replace human judges or independent examiners in Polish public procurement adjudication.[26] REMIND: Input Loss Landscapes Reveal Residual Memorization in Post-Unlearning LLMs
Liran Cohen,Yaniv Nemcovesky,Avi Mendelson
Main category: cs.CL
TL;DR: 本文提出了一种名为REMIND的新方法,用于评估机器学习模型在“机器遗忘”任务中的表现,通过分析模型在输入微小变化下的损失动态来检测被遗忘数据的残余记忆影响。
Details
Motivation: 现有的遗忘评估方法通常只在单个输入级别上进行,可能忽略语义相似样本中的残余影响,导致隐私泄露。因此需要一种更敏感、更可靠的评估方式。 Method: REMIND通过查询访问模型,分析目标数据邻域内的损失景观变化,利用损失曲面的平滑程度判断是否有效遗忘:被遗忘数据对应更平坦的损失曲面,而保留或无关数据则呈现更尖锐、波动更大的模式。 Result: REMIND在多种模型、数据集和改写输入下均表现出优于现有方法的性能和鲁棒性,能有效识别传统方法难以发现的残余记忆。 Conclusion: REMIND提供了一个更敏感、可解释且实用的框架来评估语言模型的遗忘效果,为机器遗忘的验证提供了新视角。 Abstract: Machine unlearning aims to remove the influence of specific training data from a model without requiring full retraining. This capability is crucial for ensuring privacy, safety, and regulatory compliance. Therefore, verifying whether a model has truly forgotten target data is essential for maintaining reliability and trustworthiness. However, existing evaluation methods often assess forgetting at the level of individual inputs. This approach may overlook residual influence present in semantically similar examples. Such influence can compromise privacy and lead to indirect information leakage. We propose REMIND (Residual Memorization In Neighborhood Dynamics), a novel evaluation method aiming to detect the subtle remaining influence of unlearned data and classify whether the data has been effectively forgotten. REMIND analyzes the model's loss over small input variations and reveals patterns unnoticed by single-point evaluations. We show that unlearned data yield flatter, less steep loss landscapes, while retained or unrelated data exhibit sharper, more volatile patterns. REMIND requires only query-based access, outperforms existing methods under similar constraints, and demonstrates robustness across different models, datasets, and paraphrased inputs, making it practical for real-world deployment. By providing a more sensitive and interpretable measure of unlearning effectiveness, REMIND provides a reliable framework to assess unlearning in language models. As a result, REMIND offers a novel perspective on memorization and unlearning.[27] Reusing Pre-Training Data at Test Time is a Compute Multiplier
Alex Fang,Thomas Voice,Ruoming Pang,Ludwig Schmidt,Tom Gunter
Main category: cs.CL
TL;DR: 研究表明,当前的预训练方法未能充分利用数据集中的信息,通过检索增强生成和测试时计算可显著提升模型性能,表明现有预训练过程仍有较大改进空间。
Details
Motivation: 理解当前预训练机制在从数据中提取知识方面的效率,并探索是否存在未被充分利用的数据价值。 Method: 采用检索增强生成(RAG)结合测试时计算的方法,量化不同规模下预训练过程中遗漏的数据价值,并在MMLU、Math-500和SimpleQA等任务上进行评估。 Result: 在多个基准上实现了显著准确率提升,检索相当于约5倍计算量的增益,并通过测试时计算进一步在MMLU上为LLaMA 3.1 8B模型带来10个百分点的提升。 Conclusion: 当前的预训练方法未能充分挖掘现有数据集中的信息,存在通过检索和测试时计算大幅改进模型性能的空间。 Abstract: Large language models learn from their vast pre-training corpora, gaining the ability to solve an ever increasing variety of tasks; yet although researchers work to improve these datasets, there is little effort to understand how efficient the pre-training apparatus is at extracting ideas and knowledge from the data. In this work, we use retrieval augmented generation along with test-time compute as a way to quantify how much dataset value was left behind by the process of pre-training, and how this changes across scale. We demonstrate that pre-training then retrieving from standard and largely open-sourced datasets results in significant accuracy gains in MMLU, Math-500, and SimpleQA, which persist through decontamination. For MMLU we observe that retrieval acts as a ~5x compute multiplier versus pre-training alone. We show that these results can be further improved by leveraging additional compute at test time to parse the retrieved context, demonstrating a 10 percentage point improvement on MMLU for the public LLaMA 3.1 8B model. Overall, our results suggest that today's pre-training methods do not make full use of the information in existing pre-training datasets, leaving significant room for progress.[28] Efficient Topic Extraction via Graph-Based Labeling: A Lightweight Alternative to Deep Models
Salma Mekaoui,Hiba Sofyan,Imane Amaaz,Imane Benchrif,Arsalane Zarghili,Ilham Chaker,Nikola S. Nikolov
Main category: cs.CL
TL;DR: 本文提出一种基于图的方法来进行主题标注,通过语义扩展和关系分析为无标签文本中的主题分配有意义的标签,在保持计算效率的同时取得了与ChatGPT-3.5相当的效果。
Details
Motivation: 现有主题建模方法生成的主题词缺乏可解释性,且多数方法计算成本高,需要更高效、可解释的主题标注方案。 Method: 采用图结构对主题词进行语义扩展,引入相关术语并分析词语间关系,基于图中连接关系生成具有语义代表性的主题标签。 Result: 在两个数据集上与多个基准模型(包括ChatGPT-3.5)对比,该方法在BERTScore和余弦相似度指标上优于传统基准方法,结果与ChatGPT-3.5相当,且计算资源消耗更低。 Conclusion: 所提出的图方法在主题标注任务中兼具高效性和有效性,能够在低计算开销下实现良好的可解释性,为未来提升主题可读性和自动化提供了可行方向。 Abstract: Extracting topics from text has become an essential task, especially with the rapid growth of unstructured textual data. Most existing works rely on highly computational methods to address this challenge. In this paper, we argue that probabilistic and statistical approaches, such as topic modeling (TM), can offer effective alternatives that require fewer computational resources. TM is a statistical method that automatically discovers topics in large collections of unlabeled text; however, it produces topics as distributions of representative words, which often lack clear interpretability. Our objective is to perform topic labeling by assigning meaningful labels to these sets of words. To achieve this without relying on computationally expensive models, we propose a graph-based approach that not only enriches topic words with semantically related terms but also explores the relationships among them. By analyzing these connections within the graph, we derive suitable labels that accurately capture each topic's meaning. We present a comparative study between our proposed method and several benchmarks, including ChatGPT-3.5, across two different datasets. Our method achieved consistently better results than traditional benchmarks in terms of BERTScore and cosine similarity and produced results comparable to ChatGPT-3.5, while remaining computationally efficient. Finally, we discuss future directions for topic labeling and highlight potential research avenues for enhancing interpretability and automation.[29] SSPO: Subsentence-level Policy Optimization
Kun Yang,Zikang chen,Yanmeng Wang,Zhigen Li
Main category: cs.CL
TL;DR: 本文提出了SSPO方法,通过引入句子级重要性比率和基于句子熵的动态剪裁机制,平衡了GRPO和GSPO的优点,提升了大语言模型在推理任务中的训练稳定性与数据利用率,在多个数据集上取得了优于现有方法的性能。
Details
Motivation: 现有的RLVR算法如GRPO和GSPO在训练稳定性或采样数据利用效率方面存在缺陷:GRPO因基于token级重要性比率易受异常值影响导致训练崩溃,而GSPO虽降低方差但因整段响应共享同一比率导致极端值影响整体判断,造成数据浪费。因此需要一种兼顾稳定性和数据利用率的新方法。 Method: 提出SSPO(Sentence-level Sequence Policy Optimization),采用句子级重要性比率,在token级与响应级之间取得平衡,避免训练崩溃和高方差问题;同时结合句子熵动态调整PPO-CLIP的剪裁边界,鼓励高熵token探索并缩小低熵token的剪裁范围,提升训练稳定性与数据利用效率。 Result: SSPO在五个数据集上的平均得分为46.57,超过GRPO(43.01)和GSPO(44.42),并在三个数据集中达到最先进水平。实验表明其有效提升了生成数据的利用效率,解决了GSPO的数据丢弃问题,同时保持了训练过程的稳定性。 Conclusion: SSPO通过句子级重要性比率和动态剪裁机制,成功平衡了GRPO与GSPO的优势,克服了训练不稳定和数据利用率低的问题,显著提升了大语言模型在推理任务中的表现,是RLVR框架下更高效、稳定的后训练策略。 Abstract: As a significant part of post-training of the Large Language Models (LLMs), Reinforcement Learning from Verifiable Reward (RLVR) has greatly improved LLMs' reasoning skills. However, some RLVR algorithms, such as GRPO (Group Relative Policy Optimization) and GSPO (Group Sequence Policy Optimization), are observed to suffer from unstable policy updates and low usage of sampling data, respectively. The importance ratio of GRPO is calculated at the token level, which focuses more on optimizing a single token. This will be easily affected by outliers, leading to model training collapse. GSPO proposed the calculation of the response level importance ratio, which solves the problem of high variance and training noise accumulation in the calculation of the GRPO importance ratio. However, since all the response tokens share a common importance ratio, extreme values can easily raise or lower the overall mean, leading to the entire response being mistakenly discarded, resulting in a decrease in the utilization of sampled data. This paper introduces SSPO, which applies sentence-level importance ratio, taking the balance between GRPO and GSPO. SSPO not only avoids training collapse and high variance, but also prevents the whole response tokens from being abandoned by the clipping mechanism. Furthermore, we apply sentence entropy to PPO-CLIP to steadily adjust the clipping bounds, encouraging high-entropy tokens to explore and narrow the clipping range of low-entropy tokens. In particular, SSPO achieves an average score of 46.57 across five datasets, surpassing GRPO (43.01) and GSPO (44.42), and wins state-of-the-art performance on three datasets. These results highlight SSPO's effectiveness in leveraging generated data by taking the essence of GSPO but rejecting its shortcomings.[30] Dynamic Jointly Batch Selection for Data Efficient Machine Translation Fine-Tuning
Mohammad Amin Ghanizadeh,Mohammad Javad Dousti
Main category: cs.CL
TL;DR: 提出一种基于学习性评分和批处理选择策略的数据选择方法,用于机器翻译模型的微调,显著提升数据效率和计算效率。
Details
Motivation: 为了提高机器翻译模型的性能,需要有效选择高质量的训练数据,解决传统随机选择方法数据利用效率低的问题。 Method: 通过结合学习模型和预训练参考模型,定义学习性评分来评估数据点的训练价值,并采用考虑数据点间依赖关系的批量选择策略进行数据筛选。 Result: 在多个语言对(如英译波斯语)上的实验表明,相比随机基线,该方法数据效率提升达五倍,使用缓存嵌入时计算效率提高24倍,且翻译性能更优。 Conclusion: 所提出的数据选择方法能显著提升机器翻译微调过程中的数据利用率、计算效率和模型泛化能力。 Abstract: Data quality and its effective selection are fundamental to improving the performance of machine translation models, serving as cornerstones for achieving robust and reliable translation systems. This paper presents a data selection methodology specifically designed for fine-tuning machine translation systems, which leverages the synergy between a learner model and a pre-trained reference model to enhance overall training effectiveness. By defining a learnability score, our approach systematically evaluates the utility of data points for training, ensuring that only the most relevant and impactful examples contribute to the fine-tuning process. Furthermore, our method employs a batch selection strategy which considers interdependencies among data points, optimizing the efficiency of the training process while maintaining a focus on data relevance. Experiments on English to Persian and several other language pairs using an mBART model fine-tuned on the CCMatrix dataset demonstrate that our method can achieve up to a fivefold improvement in data efficiency compared to an iid baseline. Experimental results indicate that our approach improves computational efficiency by 24 when utilizing cached embeddings, as it requires fewer training data points. Additionally, it enhances generalization, resulting in superior translation performance compared to random selection method.[31] If I Could Turn Back Time: Temporal Reframing as a Historical Reasoning Task for LLMs
Lars Bungum,Charles Yijia Huang,Abeer Kashar
Main category: cs.CL
TL;DR: 本研究探讨了大语言模型(LLM)在1940年时间背景下进行时间推理的能力,使用一本挪威语 trivia 书籍中的问题,以英语和挪威语提问,并通过LLM评分和母语者抽样验证答案。结果显示,英语提示效果优于挪威语,且更大的模型表现更好。
Details
Motivation: 探究大语言模型在特定历史时间点(如1940年)进行准确时间推理的能力,尤其是在多语言环境下模型是否能正确反映当时的知识状态。 Method: 使用1940年的一本挪威语 trivia 书籍中的问题,将问题翻译为英语和挪威语,分别输入多个主流LLM(包括DeepSeek-R1、Gemma3、Qwen3、Llama3.1及专为挪威语设计的最大LLM),要求模型以1940年的知识作答;通过LLM-as-judge方式进行自动评分,并由挪威语母语者抽样核查。 Result: 英语提示下的模型表现 consistently 优于挪威语提示;更大的模型规模带来更好的结果;专为挪威语优化的大型模型未在母语任务中超越通用大模型。 Conclusion: 模型的语言选择和参数规模显著影响其在历史情境下的推理能力;英语提示可能因训练数据更丰富而表现更优,表明当前LLM在非英语语境下的时间一致性推理仍有局限。 Abstract: In this study, we experiment with the ability of LLMs to do temporal reasoning. Using a Norwegian book from 1940 containing trivia questions, we prompt the LLMs to answer the questions as if it were 1940. We also pose the questions in both English and Norwegian. Correct answers are often presented as sentences, and grading is done by means of LLM-as-judge, with sampled checks by a native speaker. Prompting in English consistently gave better results than in Norwegian, an unexpected result. In contrast, using larger LLMs improved results. We tested the DeepSeek-R1, Gemma3, Qwen3, and Llama3.1 model families, and also the largest available LLM especially crafted for Norwegian.[32] Probabilistic Textual Time Series Depression Detection
Fabian Schmidt,Seyedehmoniba Ravan,Vladimir Vlassov
Main category: cs.CL
TL;DR: 提出PTTSD框架,通过结合双向LSTM、自注意力和残差连接,以高精度和良好校准的不确定性估计预测抑郁严重程度。
Details
Motivation: 现有抑郁严重程度预测模型通常缺乏不确定性估计和时间建模能力,限制了其在临床决策中的可解释性和可靠性。 Method: 提出PTTSD框架,包含序列到序列和序列到单值两种结构,结合双向LSTM、自注意力机制、残差连接与高斯或Student-t输出头,通过负对数似然训练实现不确定性建模。 Result: 在E-DAIC和DAIC-WOZ数据集上达到文本系统最优性能(如E-DAIC上MAE=3.85),并生成校准良好的预测区间;消融实验验证了注意力机制与概率建模的作用。 Conclusion: PTTSD在抑郁严重程度预测中实现了高准确性和可解释性,其不确定性感知建模具有临床应用潜力。 Abstract: Accurate and interpretable predictions of depression severity are essential for clinical decision support, yet existing models often lack uncertainty estimates and temporal modeling. We propose PTTSD, a Probabilistic Textual Time Series Depression Detection framework that predicts PHQ-8 scores from utterance-level clinical interviews while modeling uncertainty over time. PTTSD includes sequence-to-sequence and sequence-to-one variants, both combining bidirectional LSTMs, self-attention, and residual connections with Gaussian or Student-t output heads trained via negative log-likelihood. Evaluated on E-DAIC and DAIC-WOZ, PTTSD achieves state-of-the-art performance among text-only systems (e.g., MAE = 3.85 on E-DAIC, 3.55 on DAIC) and produces well-calibrated prediction intervals. Ablations confirm the value of attention and probabilistic modeling, while comparisons with MentalBERT establish generality. A three-part calibration analysis and qualitative case studies further highlight the interpretability and clinical relevance of uncertainty-aware forecasting.[33] ThaiOCRBench: A Task-Diverse Benchmark for Vision-Language Understanding in Thai
Surapon Nonesung,Teetouch Jaknamon,Sirinya Chaiophat,Natapong Nitarach,Chanakan Wittayasakpan,Warit Sirichotedumrong,Adisai Na-Thalang,Kunat Pipatanakul
Main category: cs.CL
TL;DR: ThaiOCRBench是首个针对泰语文本丰富视觉理解任务的综合基准,包含2,808个人工标注样本,涵盖13个任务类别,用于评估多模态模型在低资源、文字复杂场景下的表现。
Details
Motivation: 现有视觉-语言模型基准主要关注高资源语言,泰语在文档结构理解等任务中缺乏代表性,亟需专门的评估基准。 Method: 构建了一个包含2,808个样本、13个任务类别的高质量人工标注数据集ThaiOCRBench,并在零样本设置下对多种前沿视觉-语言模型(包括闭源和开源)进行系统评估。 Result: 实验显示闭源模型(如Gemini 2.5 Pro)显著优于开源模型;开源模型在细粒度文本识别和手写内容提取上表现最差;错误分析揭示了语言偏见、结构不匹配和幻觉内容等关键挑战。 Conclusion: ThaiOCRBench为泰语等低资源语言的视觉-语言模型评估提供了标准化框架,并为改进泰语文档理解技术提供了可操作的洞察。 Abstract: We present ThaiOCRBench, the first comprehensive benchmark for evaluating vision-language models (VLMs) on Thai text-rich visual understanding tasks. Despite recent progress in multimodal modeling, existing benchmarks predominantly focus on high-resource languages, leaving Thai underrepresented, especially in tasks requiring document structure understanding. ThaiOCRBench addresses this gap by offering a diverse, human-annotated dataset comprising 2,808 samples across 13 task categories. We evaluate a wide range of state-of-the-art VLMs in a zero-shot setting, spanning both proprietary and open-source systems. Results show a significant performance gap, with proprietary models (e.g., Gemini 2.5 Pro) outperforming open-source counterparts. Notably, fine-grained text recognition and handwritten content extraction exhibit the steepest performance drops among open-source models. Through detailed error analysis, we identify key challenges such as language bias, structural mismatch, and hallucinated content. ThaiOCRBench provides a standardized framework for assessing VLMs in low-resource, script-complex settings, and provides actionable insights for improving Thai-language document understanding.[34] RUST-BENCH: Benchmarking LLM Reasoning on Unstructured Text within Structured Tables
Nikhil Abhyankar,Purvi Chaurasia,Sanchit Kabra,Ananya Srivastava,Vivek Gupta,Chandan K. Reddy
Main category: cs.CL
TL;DR: RUST-BENCH是一个新的基准,用于评估大语言模型在真实、复杂表格数据上的推理能力,涵盖规模、异构性、领域特异性和推理复杂性。
Details
Motivation: 现有表格推理基准主要测试小型、同质的表格,无法充分反映现实世界数据的复杂性,也无法全面评估大语言模型的推理能力。 Method: 构建了一个包含7966个问题、来自2031个真实世界表格的基准RUST-BENCH,涵盖科学(NSF资助记录)和体育(NBA统计数据)两个领域,评估模型在大规模、异构、领域特定和多跳推理任务上的表现。 Result: 实验表明,当前开源和专有大语言模型在处理异构模式和复杂多跳推理时表现不佳,暴露出架构和提示策略中的持续弱点。 Conclusion: RUST-BENCH为推进表格推理研究提供了一个具有挑战性的新测试平台。 Abstract: Existing tabular reasoning benchmarks mostly test models on small, uniform tables, underrepresenting the complexity of real-world data and giving an incomplete view of Large Language Models' (LLMs) reasoning abilities. Real tables are long, heterogeneous, and domain-specific, mixing structured fields with free text and requiring multi-hop reasoning across thousands of tokens. To address this gap, we introduce RUST-BENCH, a benchmark of 7966 questions from 2031 real-world tables spanning two domains: i) RB-Science (NSF grant records) and ii) RB-Sports (NBA statistics). Unlike prior work, RUST-BENCH evaluates LLMs jointly across scale, heterogeneity, domain specificity, and reasoning complexity. Experiments with open-source and proprietary models show that LLMs struggle with heterogeneous schemas and complex multi-hop inference, revealing persistent weaknesses in current architectures and prompting strategies. RUST-BENCH establishes a challenging new testbed for advancing tabular reasoning research.[35] OUNLP at TSAR 2025 Shared Task: Multi-Round Text Simplifier via Code Generation
Cuong Huynh,Jie Cao
Main category: cs.CL
TL;DR: 本文提出了基于多轮简化的文本简化方法,利用GPT-4o生成,发现源CEFR级别与目标CEFR级别之间的差距显著影响简化效果。
Details
Motivation: 受提示词驱动的文本简化方法启发,探索如何通过多轮简化提升可读性控制的文本简化性能。 Method: 提出两种多轮简化方法:基于规则的简化(MRS-Rule)和联合规则与LLM的简化(MRS-Joint),并通过GPT-4o生成结果。 Result: 在TSAR-2025共享任务中,系统在20支队伍中排名第7;后续改进显示,以LLM简化的候选文本为起点可进一步提升性能。 Conclusion: 源文本与目标文本的CEFR等级差距对简化效果有重要影响,多轮简化尤其是MRS-Joint具有优化潜力。 Abstract: This paper describes the OUNLP system submitted to the TSAR-2025 Shared Task (Alva-Manchego et al., 2025), designed for readability-controlled text simplification using LLM-prompting-based generation. Based on the analysis of prompt-based text simplification methods, we discovered an interesting finding that text simplification performance is highly related to the gap between the source CEFR (Arase et al., 2022) level and the target CEFR level. Inspired by this finding, we propose two multi-round simplification methods and generate them via GPT-4o: rule-based simplification (MRS-Rule) and jointly rule-based LLM simplification (MRS-Joint). Our submitted systems ranked 7 out of 20 teams. Later improvements with MRS-Joint show that taking the LLM simplified candidates as the starting point could further boost the multi-round simplification performance.[36] Decoding Emergent Big Five Traits in Large Language Models: Temperature-Dependent Expression and Architectural Clustering
Christos-Nikolaos Zacharopoulos,Revekka Kyriakoglou
Main category: cs.CL
TL;DR: 该研究使用BFI-2框架系统评估六种大语言模型在不同采样温度下的五大人格特质表达,发现其中四项特质存在显著差异,且神经质和外向性受温度影响较大;聚类分析显示模型架构可能影响其人格特征模式,为AI人格化行为的理解、模型调优与伦理治理提供了新视角。
Details
Motivation: 随着大语言模型(LLMs)在人类中心应用中的普及,理解其类人格行为对负责任的开发与部署至关重要。 Method: 采用Big Five Inventory-2(BFI-2)框架,对六种LLMs在不同采样温度下进行系统性人格特质评估,并通过层次聚类分析模型间的特征模式。 Result: 在五大人格维度中,四个维度表现出显著差异;神经质和外向性受采样温度影响明显;层次聚类揭示了基于架构特征的模型分组,表明架构可能决定稳定的人格特征模式。 Conclusion: 大语言模型会展现出可测量的类人格特征,这些特征受温度调节和模型架构影响,提示在模型选择、调参与AI伦理治理中需考虑此类行为模式。 Abstract: As Large Language Models (LLMs) become integral to human-centered applications, understanding their personality-like behaviors is increasingly important for responsible development and deployment. This paper systematically evaluates six LLMs, applying the Big Five Inventory-2 (BFI-2) framework, to assess trait expressions under varying sampling temperatures. We find significant differences across four of the five personality dimensions, with Neuroticism and Extraversion susceptible to temperature adjustments. Further, hierarchical clustering reveals distinct model clusters, suggesting that architectural features may predispose certain models toward stable trait profiles. Taken together, these results offer new insights into the emergence of personality-like patterns in LLMs and provide a new perspective on model tuning, selection, and the ethical governance of AI systems. We share the data and code for this analysis here: https://osf.io/bsvzc/?view_only=6672219bede24b4e875097426dc3fac1[37] RAGalyst: Automated Human-Aligned Agentic Evaluation for Domain-Specific RAG
Joshua Gao,Quoc Huy Pham,Subin Varghese,Silwal Saurav,Vedhus Hoskere
Main category: cs.CL
TL;DR: 本文提出了RAGalyst,一种用于评估特定领域检索增强生成(RAG)系统的自动化、与人类判断对齐的智能体框架。该框架通过生成高质量的合成问答数据集并优化LLM-as-a-Judge指标,实现了在军事、网络安全和桥梁工程等领域的可靠评估。
Details
Motivation: 现有RAG评估方法在专业且安全关键的领域中难以捕捉领域细微差异,且缺乏与人类判断的一致性,因此需要更严谨、可信赖的评估框架。 Method: 提出RAGalyst框架,包含一个智能体流水线,用于从源文档生成合成问答数据集,并引入过滤步骤确保数据保真度;同时优化Answer Correctness和Answerability两个LLM-as-a-Judge指标的提示,以提升与人类标注的相关性。 Result: 在三个不同领域应用该框架发现,RAG性能高度依赖上下文,不存在普遍最优的嵌入模型、LLM或超参数配置;同时分析了导致答案正确性低的主要原因。 Conclusion: RAGalyst提供了一种系统化的评估方法,帮助实践者揭示领域特定的权衡,做出更明智的设计决策,从而构建更可靠和高效的RAG系统。 Abstract: Retrieval-Augmented Generation (RAG) is a critical technique for grounding Large Language Models (LLMs) in factual evidence, yet evaluating RAG systems in specialized, safety-critical domains remains a significant challenge. Existing evaluation frameworks often rely on heuristic-based metrics that fail to capture domain-specific nuances and other works utilize LLM-as-a-Judge approaches that lack validated alignment with human judgment. This paper introduces RAGalyst, an automated, human-aligned agentic framework designed for the rigorous evaluation of domain-specific RAG systems. RAGalyst features an agentic pipeline that generates high-quality, synthetic question-answering (QA) datasets from source documents, incorporating an agentic filtering step to ensure data fidelity. The framework refines two key LLM-as-a-Judge metrics-Answer Correctness and Answerability-using prompt optimization to achieve a strong correlation with human annotations. Applying this framework to evaluate various RAG components across three distinct domains (military operations, cybersecurity, and bridge engineering), we find that performance is highly context-dependent. No single embedding model, LLM, or hyperparameter configuration proves universally optimal. Additionally, we provide an analysis on the most common low Answer Correctness reasons in RAG. These findings highlight the necessity of a systematic evaluation framework like RAGalyst, which empowers practitioners to uncover domain-specific trade-offs and make informed design choices for building reliable and effective RAG systems. RAGalyst is available on our Github.[38] Modeling Clinical Uncertainty in Radiology Reports: from Explicit Uncertainty Markers to Implicit Reasoning Pathways
Paloma Rabaey,Jong Hak Moon,Jung-Oh Lee,Min Gwan Kim,Hangyul Yoon,Thomas Demeester,Edward Choi
Main category: cs.CL
TL;DR: 本文提出了一种双部分框架来处理放射学报告中的显性和隐性不确定性,通过专家验证的LLM方法量化显性不确定性,并基于诊断路径扩展隐性不确定性,发布了具有细粒度结构和不确定性感知的Lunguage++数据集。
Details
Motivation: 放射学报告中存在显性和隐性不确定性,影响自动化分析的准确性,传统规则系统难以有效处理这些不确定性。 Method: 1. 构建专家验证的LLM基准对常见模糊表达进行排序并映射为概率值以量化显性不确定性;2. 基于14种常见诊断的专家定义路径,通过扩展框架补充缺失的子发现以建模隐性不确定性。 Result: 发布了Lunguage++数据集,增强了对不确定性的表达能力,支持不确定性感知的图像分类和更真实的诊断推理。 Conclusion: 该框架能有效建模放射学报告中的两类不确定性,提升结构化报告的质量与临床实用性。 Abstract: Radiology reports are invaluable for clinical decision-making and hold great potential for automated analysis when structured into machine-readable formats. These reports often contain uncertainty, which we categorize into two distinct types: (i) Explicit uncertainty reflects doubt about the presence or absence of findings, conveyed through hedging phrases. These vary in meaning depending on the context, making rule-based systems insufficient to quantify the level of uncertainty for specific findings; (ii) Implicit uncertainty arises when radiologists omit parts of their reasoning, recording only key findings or diagnoses. Here, it is often unclear whether omitted findings are truly absent or simply unmentioned for brevity. We address these challenges with a two-part framework. We quantify explicit uncertainty by creating an expert-validated, LLM-based reference ranking of common hedging phrases, and mapping each finding to a probability value based on this reference. In addition, we model implicit uncertainty through an expansion framework that systematically adds characteristic sub-findings derived from expert-defined diagnostic pathways for 14 common diagnoses. Using these methods, we release Lunguage++, an expanded, uncertainty-aware version of the Lunguage benchmark of fine-grained structured radiology reports. This enriched resource enables uncertainty-aware image classification, faithful diagnostic reasoning, and new investigations into the clinical impact of diagnostic uncertainty.[39] Are language models aware of the road not taken? Token-level uncertainty and hidden state dynamics
Amir Zur,Atticus Geiger,Ekdeep Singh Lubana,Eric Bigelow
Main category: cs.CL
TL;DR: 研究发现语言模型的隐藏激活状态可以反映其在推理过程中的不确定性,并能预测可能的生成路径。
Details
Motivation: 探索语言模型在生成文本时是否表征了潜在的不同推理路径,以及如何量化其不确定性。 Method: 通过分析隐藏激活状态来控制和预测语言模型在思维链推理中的不确定性。 Result: 发现模型在不同token上的不确定性与其激活状态的可操控性有明显相关性,且隐藏激活能预测模型未来的输出分布。 Conclusion: 语言模型的隐藏激活状态隐式地表示了可能的推理路径空间,干预效果在模型尚未确定最终答案时最为显著。 Abstract: When a language model generates text, the selection of individual tokens might lead it down very different reasoning paths, making uncertainty difficult to quantify. In this work, we consider whether reasoning language models represent the alternate paths that they could take during generation. To test this hypothesis, we use hidden activations to control and predict a language model's uncertainty during chain-of-thought reasoning. In our experiments, we find a clear correlation between how uncertain a model is at different tokens, and how easily the model can be steered by controlling its activations. This suggests that activation interventions are most effective when there are alternate paths available to the model -- in other words, when it has not yet committed to a particular final answer. We also find that hidden activations can predict a model's future outcome distribution, demonstrating that models implicitly represent the space of possible paths.[40] IntelliProof: An Argumentation Network-based Conversational Helper for Organized Reflection
Kaveh Eskandari Miandoab,Katharine Kowalyshyn,Kabir Pamnani,Anesu Gavhera,Vasanth Sarathy,Matthias Scheutz
Main category: cs.CL
TL;DR: IntelliProof是一个利用大语言模型(LLM)分析议论文的交互式系统,将文章结构化为论证图,强调用户体验,提供可视化、分类解释和连贯性量化指标。
Details
Motivation: 现有自动作文评分系统缺乏对论证结构的深入分析和用户友好性,难以有效评估议论文的逻辑连贯性和论证质量。 Method: 将议论文建模为论证图,节点表示主张,边表示支持或攻击关系,使用LLM进行关系分类与评分,并通过可视化和自然语言工具增强理解。 Result: 系统能够生成论证图的可视化结果,提供分类依据和连贯性评分,支持快速评估论证质量并保留人工监督。 Conclusion: IntelliProof有效结合了LLM与交互设计,提升了对议论文结构的理解与分析效率,弥合了文本语义结构与用户理解之间的差距。 Abstract: We present IntelliProof, an interactive system for analyzing argumentative essays through LLMs. IntelliProof structures an essay as an argumentation graph, where claims are represented as nodes, supporting evidence is attached as node properties, and edges encode supporting or attacking relations. Unlike existing automated essay scoring systems, IntelliProof emphasizes the user experience: each relation is initially classified and scored by an LLM, then visualized for enhanced understanding. The system provides justifications for classifications and produces quantitative measures for essay coherence. It enables rapid exploration of argumentative quality while retaining human oversight. In addition, IntelliProof provides a set of tools for a better understanding of an argumentative essay and its corresponding graph in natural language, bridging the gap between the structural semantics of argumentative essays and the user's understanding of a given text. A live demo and the system are available here to try: \textbf{https://intelliproof.vercel.app}[41] From Model to Breach: Towards Actionable LLM-Generated Vulnerabilities Reporting
Cyril Vallez,Alexander Sternfeld,Andrei Kucharavy,Ljiljana Dolamic
Main category: cs.CL
TL;DR: 本文研究了基于大语言模型(LLM)的编程助手在软件开发中引入的安全漏洞问题,指出当前主流开源模型仍易受早期已知漏洞影响,并提出了一种新的风险度量指标Prompt Exposure(PE)和Model Exposure(ME)评分,用于评估和缓解LLM生成代码的安全风险。
Details
Motivation: 随着LLM在编程助手中的广泛应用,其生成的代码漏洞对网络安全构成日益严重的威胁。然而现有安全基准和改进方法对实际模型的影响尚不明确,亟需有效评估和缓解机制。 Method: 提出Prompt Exposure(PE)指标,综合考虑漏洞严重性、生成概率及诱导漏洞的提示词形式;进一步定义Model Exposure(ME)评分,衡量模型生成漏洞的整体风险水平。 Result: 发现最新的开源权重LLM在现实使用场景下仍易产生早期已知类型的漏洞,表明安全性与功能性的权衡阻碍了有效修复;PE和ME能够量化模型的安全暴露程度。 Conclusion: 当前LLM编码助手存在显著安全风险,需通过PE和ME等新指标推动对高危、高频漏洞的重点治理,以改善生成代码的安全性。 Abstract: As the role of Large Language Models (LLM)-based coding assistants in software development becomes more critical, so does the role of the bugs they generate in the overall cybersecurity landscape. While a number of LLM code security benchmarks have been proposed alongside approaches to improve the security of generated code, it remains unclear to what extent they have impacted widely used coding LLMs. Here, we show that even the latest open-weight models are vulnerable in the earliest reported vulnerability scenarios in a realistic use setting, suggesting that the safety-functionality trade-off has until now prevented effective patching of vulnerabilities. To help address this issue, we introduce a new severity metric that reflects the risk posed by an LLM-generated vulnerability, accounting for vulnerability severity, generation chance, and the formulation of the prompt that induces vulnerable code generation - Prompt Exposure (PE). To encourage the mitigation of the most serious and prevalent vulnerabilities, we use PE to define the Model Exposure (ME) score, which indicates the severity and prevalence of vulnerabilities a model generates.[42] BanglaMedQA and BanglaMMedBench: Evaluating Retrieval-Augmented Generation Strategies for Bangla Biomedical Question Answering
Sadia Sultana,Saiyma Sittul Muna,Mosammat Zannatul Samarukh,Ajwad Abrar,Tareque Mohmud Chowdhury
Main category: cs.CL
TL;DR: 本文提出了首个大规模孟加拉语生物医学多选题数据集BanglaMedQA和BanglaMMedBench,并评估了多种检索增强生成(RAG)策略,其中基于代理的RAG方法结合教科书与网络检索,在GPT-120B上达到89.54%的准确率,显著提升了孟加拉语医学问答系统的性能。
Details
Motivation: 低资源语言中的生物医学问答系统发展受限,难以实现医疗知识的公平获取。为解决孟加拉语缺乏高质量医学问答数据集和有效AI评估框架的问题,推动多语言医疗AI的发展。 Method: 构建了两个新的孟加拉语医学MCQ数据集,并应用多种RAG策略(包括传统、零样本回退、代理式、迭代反馈和聚合RAG),结合OCR提取的孟加拉语医学教科书和网页检索,提出一种动态选择检索与推理策略的代理RAG框架。 Result: 实验结果显示,代理RAG在openai/gpt-oss-120b模型上取得了89.54%的最高准确率,优于其他配置,且生成的推理理由质量更高。 Conclusion: 基于RAG的方法能有效提升孟加拉语医学问答系统的准确性与可靠性,所提出的代理RAG框架和新数据集为多语言医疗AI研究奠定了基础。 Abstract: Developing accurate biomedical Question Answering (QA) systems in low-resource languages remains a major challenge, limiting equitable access to reliable medical knowledge. This paper introduces BanglaMedQA and BanglaMMedBench, the first large-scale Bangla biomedical Multiple Choice Question (MCQ) datasets designed to evaluate reasoning and retrieval in medical artificial intelligence (AI). The study applies and benchmarks several Retrieval-Augmented Generation (RAG) strategies, including Traditional, Zero-Shot Fallback, Agentic, Iterative Feedback, and Aggregate RAG, combining textbook-based and web retrieval with generative reasoning to improve factual accuracy. A key novelty lies in integrating a Bangla medical textbook corpus through Optical Character Recognition (OCR) and implementing an Agentic RAG pipeline that dynamically selects between retrieval and reasoning strategies. Experimental results show that the Agentic RAG achieved the highest accuracy 89.54% with openai/gpt-oss-120b, outperforming other configurations and demonstrating superior rationale quality. These findings highlight the potential of RAG-based methods to enhance the reliability and accessibility of Bangla medical QA, establishing a foundation for future research in multilingual medical artificial intelligence.[43] When retrieval outperforms generation: Dense evidence retrieval for scalable fake news detection
Alamgir Munir Qazi,John P. McCrae,Jamal Abdul Nasir
Main category: cs.CL
TL;DR: 本文提出了一种名为DeReC的轻量级事实验证框架,利用通用文本嵌入结合密集检索与分类,在准确性和效率上均优于基于大语言模型的解释生成方法。
Details
Motivation: 当前基于大语言模型的事实验证方法存在计算成本高和幻觉风险的问题,限制了其在实际场景中的部署。 Method: 提出DeReC框架,使用密集检索(Dense Retrieval)获取相关证据,并通过专门的分类器进行事实判断,避免使用自回归式大语言模型生成解释。 Result: 在RAWFC数据集上F1得分为65.58%,超过现有最佳方法L-Defense(61.20%);运行时间比基于LLM的方法减少95%(从454分钟降至23分钟);在LIAR-RAW上减少92%。 Conclusion: 精心设计的基于检索的系统可以在特定任务上达到甚至超越大语言模型的性能,同时具备更高的实用性与部署效率。 Abstract: The proliferation of misinformation necessitates robust yet computationally efficient fact verification systems. While current state-of-the-art approaches leverage Large Language Models (LLMs) for generating explanatory rationales, these methods face significant computational barriers and hallucination risks in real-world deployments. We present DeReC (Dense Retrieval Classification), a lightweight framework that demonstrates how general-purpose text embeddings can effectively replace autoregressive LLM-based approaches in fact verification tasks. By combining dense retrieval with specialized classification, our system achieves better accuracy while being significantly more efficient. DeReC outperforms explanation-generating LLMs in efficiency, reducing runtime by 95% on RAWFC (23 minutes 36 seconds compared to 454 minutes 12 seconds) and by 92% on LIAR-RAW (134 minutes 14 seconds compared to 1692 minutes 23 seconds), showcasing its effectiveness across varying dataset sizes. On the RAWFC dataset, DeReC achieves an F1 score of 65.58%, surpassing the state-of-the-art method L-Defense (61.20%). Our results demonstrate that carefully engineered retrieval-based systems can match or exceed LLM performance in specialized tasks while being significantly more practical for real-world deployment.[44] Logit-Entropy Adaptive Stopping Heuristic for Efficient Chain-of-Thought Reasoning
Mohammad Atif Quamar,Mohammad Areeb
Main category: cs.CL
TL;DR: LEASH是一种无需训练的解码算法,通过监控token级熵斜率和顶级logit margin的改善来自适应停止推理生成,减少30-35%的token使用和27%的延迟,但准确率下降10个百分点。
Details
Motivation: 传统Chain-of-Thought提示生成固定长度的推理链存在计算浪费,导致token消耗高和延迟增加。 Method: 提出LEASH算法,利用logit熵和top-logit margin的变化趋势判断模型是否达到稳定推理状态,并在此时提前终止生成。 Result: 在GSM8K和AQuA-RAT基准上,四个指令调优模型平均减少30-35% token生成和27%延迟,准确率下降约10 p.p. Conclusion: LEASH是一种模型无关、无需训练的高效CoT替代方案,能在可接受精度损失下显著提升推理效率。 Abstract: Chain-of-Thought (CoT) prompting is a key technique for enabling complex reasoning in large language models. However, generating full, fixed-length rationales is computationally wasteful, inflating both token usage and latency. We introduce LEASH: Logit-Entropy Adaptive Stopping Heuristic, a training-free decoding algorithm that adaptively halts rationale generation. LEASH monitors two intrinsic signals: the slope of token-level entropy and the improvement in the top-logit margin. It terminates the generation once both signals plateau, indicating the model has reached a stable reasoning state. Across four instruction-tuned models on the GSM8K and AQuA-RAT benchmarks, LEASH reduces average token generation by 30--35% and latency by 27%, while incurring a 10 p.p. accuracy drop relative to CoT. LEASH is model-agnostic and requires no additional training or supervision, offering a simple and efficient alternative to CoT decoding.cs.CV [Back]
[45] LoRA-Edge: Tensor-Train-Assisted LoRA for Practical CNN Fine-Tuning on Edge Devices
Hyunseok Kwak,Kyeongwon Lee,Jae-Jin Lee,Woojoo Lee
Main category: cs.CV
TL;DR: LoRA-Edge是一种面向边缘设备的参数高效微调方法,基于低秩适应(LoRA)并引入张量分解(TT-SVD),在保持推理开销不变的同时显著减少可训练参数数量,适用于资源受限的CNN模型在线微调。
Details
Motivation: 在边缘设备上进行CNN的全量微调因内存、计算和能耗限制而不可行,但应对域偏移又需要微调能力,因此需要一种高效且结构对齐的参数微调方法。 Method: 提出LoRA-Edge:1)对预训练卷积层应用TT-SVD分解;2)仅选择性更新输出侧的核心张量,并采用零初始化使辅助路径初始不激活;3)将更新融合回原始密集卷积核中,保持推理成本不变。 Result: 相比全量微调,可将可训练参数减少两个数量级,在多个HAR数据集和CNN主干网络上仅更新最多1.49%的参数,性能仍接近全微调(差距<4.7%),且收敛速度提升1.4-3.8倍,优于现有PEFT基线。 Conclusion: LoRA-Edge实现了结构对齐、参数高效的卷积神经网络在设备端微调,为边缘计算平台上的模型自适应提供了实用解决方案。 Abstract: On-device fine-tuning of CNNs is essential to withstand domain shift in edge applications such as Human Activity Recognition (HAR), yet full fine-tuning is infeasible under strict memory, compute, and energy budgets. We present LoRA-Edge, a parameter-efficient fine-tuning (PEFT) method that builds on Low-Rank Adaptation (LoRA) with tensor-train assistance. LoRA-Edge (i) applies Tensor-Train Singular Value Decomposition (TT-SVD) to pre-trained convolutional layers, (ii) selectively updates only the output-side core with zero-initialization to keep the auxiliary path inactive at the start, and (iii) fuses the update back into dense kernels, leaving inference cost unchanged. This design preserves convolutional structure and reduces the number of trainable parameters by up to two orders of magnitude compared to full fine-tuning. Across diverse HAR datasets and CNN backbones, LoRA-Edge achieves accuracy within 4.7% of full fine-tuning while updating at most 1.49% of parameters, consistently outperforming prior parameter-efficient baselines under similar budgets. On a Jetson Orin Nano, TT-SVD initialization and selective-core training yield 1.4-3.8x faster convergence to target F1. LoRA-Edge thus makes structure-aligned, parameter-efficient on-device CNN adaptation practical for edge platforms.[46] SILVI: Simple Interface for Labeling Video Interactions
Ozan Kanbertay,Richard Vogg,Elif Karakoc,Peter M. Kappeler,Claudia Fichtel,Alexander S. Ecker
Main category: cs.CV
TL;DR: SILVI是一个开源的视频标注工具,能够同时标注动物行为和个体间的交互,填补了现有工具在行为生态学与计算机视觉结合方面的空白。
Details
Motivation: 现有的开源标注工具无法同时支持个体定位和交互行为标注,限制了对动物社会性和个体化行为的细粒度分析。 Method: 开发了一个名为SILVI的开源标注软件,集成行为标注与个体定位功能,可在视频中直接标注行为和交互,并生成适用于训练和验证计算机视觉模型的结构化输出。 Result: SILVI成功实现了对动物行为和交互的联合标注,支持结构化数据输出,可用于训练计算机视觉模型,且具有扩展至人类交互标注的潜力。 Conclusion: SILVI桥接了行为生态学与计算机视觉之间的工具缺口,促进了自动化、细粒度行为分析方法的发展,具有广泛的应用前景。 Abstract: Computer vision methods are increasingly used for the automated analysis of large volumes of video data collected through camera traps, drones, or direct observations of animals in the wild. While recent advances have focused primarily on detecting individual actions, much less work has addressed the detection and annotation of interactions -- a crucial aspect for understanding social and individualized animal behavior. Existing open-source annotation tools support either behavioral labeling without localization of individuals, or localization without the capacity to capture interactions. To bridge this gap, we present SILVI, an open-source labeling software that integrates both functionalities. SILVI enables researchers to annotate behaviors and interactions directly within video data, generating structured outputs suitable for training and validating computer vision models. By linking behavioral ecology with computer vision, SILVI facilitates the development of automated approaches for fine-grained behavioral analyses. Although developed primarily in the context of animal behavior, SILVI could be useful more broadly to annotate human interactions in other videos that require extracting dynamic scene graphs. The software, along with documentation and download instructions, is available at: https://gitlab.gwdg.de/kanbertay/interaction-labelling-app.[47] Noise Injection: Improving Out-of-Distribution Generalization for Limited Size Datasets
Duong Mai,Lawrence Hall
Main category: cs.CV
TL;DR: 本研究探讨了在训练过程中引入基本噪声(如高斯、斑点、泊松和椒盐噪声)以提高深度学习模型在不同分布数据下的泛化能力,特别是在胸部X光片中检测COVID-19的应用中。
Details
Motivation: 深度学习模型在图像识别中难以泛化到不同设备或人群的数据,尤其在COVID-19的CXRs检测中,模型容易依赖训练数据中的源特异性伪影而非真正的生物标志物,导致在新临床来源的OOD数据上表现下降。 Method: 在训练过程中注入多种基本噪声(高斯、斑点、泊松、椒盐),以削弱模型对源特异性捷径的依赖,增强其对分布偏移的鲁棒性。 Result: 该方法显著缩小了ID与OOD数据之间的性能差距,将AUC、F1、准确率、召回率和特异性等关键指标上的差距从0.10-0.20降低至0.01-0.06(基于十个随机种子的平均结果)。 Conclusion: 噪声注入是一种简单而有效的方法,可提升深度学习模型在医学影像中的跨分布泛化能力,有助于更可靠地部署于真实临床环境。 Abstract: Deep learned (DL) models for image recognition have been shown to fail to generalize to data from different devices, populations, etc. COVID-19 detection from Chest X-rays (CXRs), in particular, has been shown to fail to generalize to out-of-distribution (OOD) data from new clinical sources not covered in the training set. This occurs because models learn to exploit shortcuts - source-specific artifacts that do not translate to new distributions - rather than reasonable biomarkers to maximize performance on in-distribution (ID) data. Rendering the models more robust to distribution shifts, our study investigates the use of fundamental noise injection techniques (Gaussian, Speckle, Poisson, and Salt and Pepper) during training. Our empirical results demonstrate that this technique can significantly reduce the performance gap between ID and OOD evaluation from 0.10-0.20 to 0.01-0.06, based on results averaged over ten random seeds across key metrics such as AUC, F1, accuracy, recall and specificity. Our source code is publicly available at https://github.com/Duongmai127/Noisy-ood[48] Investigating Robot Control Policy Learning for Autonomous X-ray-guided Spine Procedures
Florence Klitzner,Blanca Inigo,Benjamin D. Killeen,Lalithkumar Seenivasan,Michelle Song,Axel Krieger,Mathias Unberath
Main category: cs.CV
TL;DR: 本文探讨了基于模仿学习的机器人控制策略在双平面X光引导下套管插入手术中的应用,提出了一个高仿真的模拟沙箱和相应的数据集,并训练了仅基于视觉信息的模仿学习策略,初步结果表明该方法在多种椎体水平和复杂解剖结构中具有潜力,但仍存在入口点精度等局限性。
Details
Motivation: 由于多视角X光解读复杂,目前尚不清楚基于模仿学习的视频机器人控制策略是否适用于X光引导下的脊柱手术,因此本文旨在探索该方法在此类手术中的机会与挑战。 Method: 开发了一个高仿真的体内模拟沙箱,构建了包含正确轨迹和对应双平面X光序列的数据集,并训练了用于规划和开环控制的模仿学习策略,使其能仅根据视觉信息迭代对齐套管。 Result: 策略在68.5%的情况下首次尝试即成功,能够在不同椎体水平保持安全的椎弓内路径,对骨折等复杂解剖结构具有泛化能力,且对不同初始化具有鲁棒性;在真实X光图像上的测试显示模型能生成合理轨迹,尽管其仅在模拟环境中训练。 Conclusion: 尽管初步结果令人鼓舞,但该方法在入口点精度等方面仍有局限,实现完全闭环控制还需更频繁的反馈机制;结合更强的先验知识和领域信息,此类模型有望为轻量化、无需CT的术中脊柱导航机器人系统奠定基础。 Abstract: Imitation learning-based robot control policies are enjoying renewed interest in video-based robotics. However, it remains unclear whether this approach applies to X-ray-guided procedures, such as spine instrumentation. This is because interpretation of multi-view X-rays is complex. We examine opportunities and challenges for imitation policy learning in bi-plane-guided cannula insertion. We develop an in silico sandbox for scalable, automated simulation of X-ray-guided spine procedures with a high degree of realism. We curate a dataset of correct trajectories and corresponding bi-planar X-ray sequences that emulate the stepwise alignment of providers. We then train imitation learning policies for planning and open-loop control that iteratively align a cannula solely based on visual information. This precisely controlled setup offers insights into limitations and capabilities of this method. Our policy succeeded on the first attempt in 68.5% of cases, maintaining safe intra-pedicular trajectories across diverse vertebral levels. The policy generalized to complex anatomy, including fractures, and remained robust to varied initializations. Rollouts on real bi-planar X-rays further suggest that the model can produce plausible trajectories, despite training exclusively in simulation. While these preliminary results are promising, we also identify limitations, especially in entry point precision. Full closed-look control will require additional considerations around how to provide sufficiently frequent feedback. With more robust priors and domain knowledge, such models may provide a foundation for future efforts toward lightweight and CT-free robotic intra-operative spinal navigation.[49] Desert Waste Detection and Classification Using Data-Based and Model-Based Enhanced YOLOv12 DL Model
Abdulmumin Sa'ad,Sulaimon Oyeniyi Adebayo,Abdul Jabbar Siddiqui
Main category: cs.CV
TL;DR: 提出一种基于轻量级YOLOv12与自对抗训练和数据增强策略结合的实时垃圾检测框架,显著提升沙漠环境中有机和有害垃圾检测的精度与效率。
Details
Motivation: 传统垃圾收集方法在沙漠等恶劣环境中效率低且危险,现有研究多集中于城市环境和可回收物,忽视了有机和有害垃圾及偏远地区的需求。 Method: 采用剪枝后的轻量级YOLOv12模型,结合自对抗训练(SAT)和专用数据增强策略,在DroneTrashNet数据集上进行训练与验证。 Result: 在精度、召回率和mAP方面显著提升,同时具备低延迟和小模型尺寸,适合资源受限的无人机部署,优于当前轻量级YOLO变体。 Conclusion: 数据驱动与模型优化相结合的方法能有效提升复杂环境下实时垃圾检测的性能,具有实际应用潜力。 Abstract: The global waste crisis is escalating, with solid waste generation expected to increase by 70% by 2050. Traditional waste collection methods, particularly in remote or harsh environments like deserts, are labor-intensive, inefficient, and often hazardous. Recent advances in computer vision and deep learning have opened the door to automated waste detection systems, yet most research focuses on urban environments and recyclable materials, overlooking organic and hazardous waste and underexplored terrains such as deserts. In this work, we propose an enhanced real-time object detection framework based on a pruned, lightweight version of YOLOv12 integrated with Self-Adversarial Training (SAT) and specialized data augmentation strategies. Using the DroneTrashNet dataset, we demonstrate significant improvements in precision, recall, and mean average precision (mAP), while achieving low latency and compact model size suitable for deployment on resource-constrained aerial drones. Benchmarking our model against state-of-the-art lightweight YOLO variants further highlights its optimal balance of accuracy and efficiency. Our results validate the effectiveness of combining data-centric and model-centric enhancements for robust, real-time waste detection in desert environments.[50] Improving Diagnostic Performance on Small and Imbalanced Datasets Using Class-Based Input Image Composition
Hlali Azzeddine,Majid Ben Yakhlef,Soulaiman El Hazzat
Main category: cs.CV
TL;DR: 本文提出了一种类别级图像组合方法(Class-Based Image Composition),通过融合同类图像生成复合输入图像(CoImg),以提升小规模、类别不平衡数据集下深度学习模型的诊断性能。
Details
Motivation: 针对小样本、类别不平衡及图像质量差导致深度学习模型误判率高的问题,需要增强训练样本的信息密度和类内多样性。 Method: 将同一类别的多张图像融合为复合图像(CoImg),构建均衡的数据集Co-OCTDL,并采用VGG16模型进行公平对比实验。 Result: 在OCTDL数据集上,新方法达到99.6%准确率,F1-score为0.995,AUC为0.9996,显著优于原始数据集上的基线模型,且误判率大幅降低。 Conclusion: 该方法能有效提升模型对细微疾病模式的识别能力,适用于样本少或类别不平衡的医学图像分析任务。 Abstract: Small, imbalanced datasets and poor input image quality can lead to high false predictions rates with deep learning models. This paper introduces Class-Based Image Composition, an approach that allows us to reformulate training inputs through a fusion of multiple images of the same class into combined visual composites, named Composite Input Images (CoImg). That enhances the intra-class variance and improves the valuable information density per training sample and increases the ability of the model to distinguish between subtle disease patterns. Our method was evaluated on the Optical Coherence Tomography Dataset for Image-Based Deep Learning Methods (OCTDL) (Kulyabin et al., 2024), which contains 2,064 high-resolution optical coherence tomography (OCT) scans of the human retina, representing seven distinct diseases with a significant class imbalance. We constructed a perfectly class-balanced version of this dataset, named Co-OCTDL, where each scan is resented as a 3x1 layout composite image. To assess the effectiveness of this new representation, we conducted a comparative analysis between the original dataset and its variant using a VGG16 model. A fair comparison was ensured by utilizing the identical model architecture and hyperparameters for all experiments. The proposed approach markedly improved diagnostic results.The enhanced Dataset achieved near-perfect accuracy (99.6%) with F1-score (0.995) and AUC (0.9996), compared to a baseline model trained on raw dataset. The false prediction rate was also significantly lower, this demonstrates that the method can producehigh-quality predictions even for weak datasets affected by class imbalance or small sample size.[51] I Detect What I Don't Know: Incremental Anomaly Learning with Stochastic Weight Averaging-Gaussian for Oracle-Free Medical Imaging
Nand Kumar Yadav,Rodrigue Rizk,William CW Chen,KC Santosh
Main category: cs.CV
TL;DR: 提出一种无需标签、无需oracle的增量式正常样本扩展框架,用于医学图像中的未知异常检测,通过轻量级适配器更新和不确定性门控机制实现高效且安全的领域自适应。
Details
Motivation: 由于医学图像中异常样本标注稀缺且专家监督成本高,现有方法难以有效检测未知异常,因此需要一种无需异常标签的无监督解决方案。 Method: 基于冻结的预训练视觉骨干网络,添加小型卷积适配器进行快速领域适应;利用紧凑的coreset存储特征,并通过k近邻进行异常评分;采用基于z-score距离和SWAG不确定性的双重概率门控机制控制正常样本库的增量扩展。 Result: 在多个医学影像数据集上显著提升异常检测性能:COVID-CXR的ROC-AUC从0.9489升至0.9982,Pneumonia CXR从0.6834升至0.8968,Brain MRI ND-5的ROC-AUC从0.6041升至0.7269,PR-AUC从0.7539升至0.8211。 Conclusion: 该框架在标签稀缺的实际医疗场景中表现出高效性和有效性,能够在不依赖生成模型或回放缓冲区的情况下持续优化正常模式的学习,避免记忆漂移和错误纳入异常样本。 Abstract: Unknown anomaly detection in medical imaging remains a fundamental challenge due to the scarcity of labeled anomalies and the high cost of expert supervision. We introduce an unsupervised, oracle-free framework that incrementally expands a trusted set of normal samples without any anomaly labels. Starting from a small, verified seed of normal images, our method alternates between lightweight adapter updates and uncertainty-gated sample admission. A frozen pretrained vision backbone is augmented with tiny convolutional adapters, ensuring rapid domain adaptation with negligible computational overhead. Extracted embeddings are stored in a compact coreset enabling efficient k-nearest neighbor anomaly (k-NN) scoring. Safety during incremental expansion is enforced by dual probabilistic gates, a sample is admitted into the normal memory only if its distance to the existing coreset lies within a calibrated z-score threshold, and its SWAG-based epistemic uncertainty remains below a seed-calibrated bound. This mechanism prevents drift and false inclusions without relying on generative reconstruction or replay buffers. Empirically, our system steadily refines the notion of normality as unlabeled data arrive, producing substantial gains over baselines. On COVID-CXR, ROC-AUC improves from 0.9489 to 0.9982 (F1: 0.8048 to 0.9746); on Pneumonia CXR, ROC-AUC rises from 0.6834 to 0.8968; and on Brain MRI ND-5, ROC-AUC increases from 0.6041 to 0.7269 and PR-AUC from 0.7539 to 0.8211. These results highlight the effectiveness and efficiency of the proposed framework for real-world, label-scarce medical imaging applications.[52] Adaptive Temporal Refinement: Continuous Depth Allocation and Distance Regression for Efficient Action Localization
Ibne Farabi Shihab,Sanjeda Akter,Anuj Sharma
Main category: cs.CV
TL;DR: 本文提出了两种互补的方法:边界距离回归(BDR)和自适应时间细化(ATR),以提高时序动作定位的精度和计算效率。BDR通过符号距离回归实现更优的边界检测,ATR则通过连续深度选择自适应分配计算资源。
Details
Motivation: 现有方法在处理不同难度的边界时采用统一计算,忽略了边界复杂度的差异,导致效率和精度受限。 Method: 提出边界距离回归(BDR)进行信息论最优的定位,并设计自适应时间细化(ATR)实现端到端可微的计算分配;同时使用知识蒸馏降低训练成本。 Result: BDR使边界峰值锐化43%,在多个架构上带来1.8-3.1%的mAP@0.7提升;ATR在THUMOS14上以更少18%的计算量实现2.9%性能提升,短动作上提升达4.2%;轻量级学生模型保留99%性能。 Conclusion: BDR和ATR有效提升了时序动作定位的边界精度与计算效率,且可广泛适用于不同架构,在多个基准上通过严格统计检验验证了其优势。 Abstract: Temporal action localization requires precise boundary detection; however, current methods apply uniform computation despite significant variations in difficulty across boundaries. We present two complementary contributions. First, Boundary Distance Regression (BDR) provides information-theoretically optimal localization through signed-distance regression rather than classification, achieving 43\% sharper boundary peaks. BDR retrofits to existing methods with approximately 50 lines of code, yielding consistent 1.8 to 3.1\% mAP@0.7 improvements across diverse architectures. Second, Adaptive Temporal Refinement (ATR) allocates computation via continuous depth selection $\tau \in [0,1]$, enabling end-to-end differentiable optimization without reinforcement learning. On THUMOS14, ATR achieves 56.5\% mAP@0.7 at 162G FLOPs, compared to 53.6\% at 198G for uniform processing, providing a 2.9\% improvement with 18\% less compute. Gains scale with boundary heterogeneity, showing 4.2\% improvement on short actions. Training cost is mitigated via knowledge distillation, with lightweight students retaining 99\% performance at baseline cost. Results are validated across four benchmarks with rigorous statistical testing.[53] Improving Multi-View Reconstruction via Texture-Guided Gaussian-Mesh Joint Optimization
Zhejia Cai,Puhua Jiang,Shiwei Mao,Hongkun Cao,Ruqi Huang
Main category: cs.CV
TL;DR: 提出了一种统一优化几何和外观的高斯-网格联合优化框架,通过高斯引导的可微渲染实现高质量3D重建,支持光照编辑和形状变形等下游任务。
Details
Motivation: 现有方法通常将几何与外观优化分离,导致在编辑任务中表现不佳,本文旨在统一二者优化以提升重建质量与可编辑性。 Method: 提出一种新框架,利用输入图像的光度一致性以及法线和深度图的几何正则化,通过高斯引导的网格可微分渲染同时优化网格顶点位置、面片和顶点颜色。 Result: 实现了高质量的3D重建,在多视角图像下兼顾几何精度与渲染真实感,并支持光照重置和形状变形等编辑任务。 Conclusion: 该方法有效融合几何与外观优化,提升了3D重建结果的可编辑性和实用性,适用于AR/VR和数字内容创作等领域。 Abstract: Reconstructing real-world objects from multi-view images is essential for applications in 3D editing, AR/VR, and digital content creation. Existing methods typically prioritize either geometric accuracy (Multi-View Stereo) or photorealistic rendering (Novel View Synthesis), often decoupling geometry and appearance optimization, which hinders downstream editing tasks. This paper advocates an unified treatment on geometry and appearance optimization for seamless Gaussian-mesh joint optimization. More specifically, we propose a novel framework that simultaneously optimizes mesh geometry (vertex positions and faces) and vertex colors via Gaussian-guided mesh differentiable rendering, leveraging photometric consistency from input images and geometric regularization from normal and depth maps. The obtained high-quality 3D reconstruction can be further exploit in down-stream editing tasks, such as relighting and shape deformation. The code will be publicly available upon acceptance.[54] A Linear Fractional Transformation Model and Calibration Method for Light Field Camera
Zhong Chen,Changfeng Chen
Main category: cs.CV
TL;DR: 提出一种基于线性分数变换参数α的光场相机内参标定方法,通过解耦主镜头与微透镜阵列,结合解析解与非线性优化,提升了3D重建的标定精度与仿真速度。
Details
Motivation: 准确标定光场相机的内参对3D重建至关重要,但现有方法在解耦主镜头与微透镜阵列方面存在挑战。 Method: 引入线性分数变换参数α来解耦主镜头与微透镜阵列,采用基于最小二乘的解析解法,并进行非线性优化精炼;同时提出从原始图像中检测特征的方法。 Result: 在真实和模拟数据上的实验验证了该方法的有效性,且基于新模型的光场图像仿真速度显著提升。 Conclusion: 所提方法能有效提高光场相机内参标定的精度和效率,加速光场图像仿真,有利于数据驱动的深度学习应用。 Abstract: Accurate calibration of internal parameters is a crucial yet challenging prerequisite for 3D reconstruction using light field cameras. In this paper, we propose a linear fractional transformation(LFT) parameter $\alpha$ to decoupled the main lens and micro lens array (MLA). The proposed method includes an analytical solution based on least squares, followed by nonlinear refinement. The method for detecting features from the raw images is also introduced. Experimental results on both physical and simulated data have verified the performance of proposed method. Based on proposed model, the simulation of raw light field images becomes faster, which is crucial for data-driven deep learning methods. The corresponding code can be obtained from the author's website.[55] Room Envelopes: A Synthetic Dataset for Indoor Layout Reconstruction from Images
Sam Bahrami,Dylan Campbell
Main category: cs.CV
TL;DR: 本文提出了一种合成数据集Room Envelopes,用于促进单目几何估计器对场景中可见表面和结构布局的预测,从而实现对被遮挡表面的完整3D重建。
Details
Motivation: 现有的场景重建方法只能恢复可见的3D表面,无法处理被遮挡的部分,而场景中的结构性元素(如墙壁、地板和天花板)通常较为简单且具有重复性,因此值得研究更低成本的预测方法。 Method: 构建了一个名为Room Envelopes的合成数据集,包含RGB图像以及两个对应的点图:一个表示可见表面,另一个表示去除装饰物后的结构布局表面,以此为前馈单目几何估计器提供直接监督。 Result: 该数据集能够支持直接监督学习,使模型可以同时预测可见表面和结构布局表面,进而理解场景的范围及物体的形状与位置。 Conclusion: 通过引入Room Envelopes数据集,为完整场景结构的单目3D重建提供了新的可能性,并推动了对场景结构性元素的生成与预测研究。 Abstract: Modern scene reconstruction methods are able to accurately recover 3D surfaces that are visible in one or more images. However, this leads to incomplete reconstructions, missing all occluded surfaces. While much progress has been made on reconstructing entire objects given partial observations using generative models, the structural elements of a scene, like the walls, floors and ceilings, have received less attention. We argue that these scene elements should be relatively easy to predict, since they are typically planar, repetitive and simple, and so less costly approaches may be suitable. In this work, we present a synthetic dataset -- Room Envelopes -- that facilitates progress on this task by providing a set of RGB images and two associated pointmaps for each image: one capturing the visible surface and one capturing the first surface once fittings and fixtures are removed, that is, the structural layout. As we show, this enables direct supervision for feed-forward monocular geometry estimators that predict both the first visible surface and the first layout surface. This confers an understanding of the scene's extent, as well as the shape and location of its objects.[56] Simple 3D Pose Features Support Human and Machine Social Scene Understanding
Wenshuo Qin,Leyla Isik
Main category: cs.CV
TL;DR: 该研究发现人类在判断社交互动时依赖于显式的3D姿态信息,尤其是面部位置和朝向,这种信息在当前大多数AI视觉模型中缺失;通过提取3D关节位置和构建简化的3D社交姿态特征,可显著提升AI模型对人类社交判断的预测能力。
Details
Motivation: 理解人类如何从视觉输入中快速识别社交互动,并探究现有AI视觉系统在该任务上表现不佳的原因,特别是是否缺乏对3D姿态信息的有效利用。 Method: 结合先进的姿态和深度估计算法,从视频片段中提取人物的3D关节位置,构建包含面部位置和朝向的简化3D社交姿态特征,并与现有AI视觉模型的性能进行对比分析。 Result: 3D关节位置信息的表现优于多数现有AI模型;仅使用面部3D位置和朝向的简化特征即可达到同等预测效果,并能显著提升AI模型性能;AI模型中3D社交姿态特征的表征程度与其匹配人类判断的能力呈正相关。 Conclusion: 人类对社交场景的理解依赖于显式的3D姿态表征,且可通过简单的结构化视觉原语支持;将此类信息融入AI模型有助于缩小其与人类感知能力的差距。 Abstract: Humans can quickly and effortlessly extract a variety of information about others' social interactions from visual input, ranging from visuospatial cues like whether two people are facing each other to higher-level information. Yet, the computations supporting these abilities remain poorly understood, and social interaction recognition continues to challenge even the most advanced AI vision systems. Here, we hypothesized that humans rely on 3D visuospatial pose information to make social interaction judgments, which is absent in most AI vision models. To test this, we combined state-of-the-art pose and depth estimation algorithms to extract 3D joint positions of people in short video clips depicting everyday human actions and compared their ability to predict human social interaction judgments with current AI vision models. Strikingly, 3D joint positions outperformed most current AI vision models, revealing that key social information is available in explicit body position but not in the learned features of most vision models, including even the layer-wise embeddings of the pose models used to extract joint positions. To uncover the critical pose features humans use to make social judgments, we derived a compact set of 3D social pose features describing only the 3D position and direction of faces in the videos. We found that these minimal descriptors matched the predictive strength of the full set of 3D joints and significantly improved the performance of off-the-shelf AI vision models when combined with their embeddings. Moreover, the degree to which 3D social pose features were represented in each off-the-shelf AI vision model predicted the model's ability to match human social judgments. Together, our findings provide strong evidence that human social scene understanding relies on explicit representations of 3D pose and can be supported by simple, structured visuospatial primitives.[57] CaRF: Enhancing Multi-View Consistency in Referring 3D Gaussian Splatting Segmentation
Yuwen Tao,Kanglei Zhou,Xin Tan,Yuan Xie
Main category: cs.CV
TL;DR: 本文提出了CaRF框架,用于在3D高斯空间中实现语言驱动的3D区域分割,并通过相机感知编码和配对视图监督提升多视角一致性。
Details
Motivation: 现有方法依赖2D渲染伪监督和视图特定特征学习,导致跨视角不一致问题。 Method: 提出Gaussian Field Camera Encoding (GFCE) 将相机几何融入高斯文本交互,并设计In-Training Paired View Supervision (ITPVS) 对齐不同视角下的高斯logits以增强一致性。 Result: 在Ref LERF、LERF OVS和3D OVS数据集上mIoU分别平均提升16.8%、4.3%和2.0%。 Conclusion: CaRF实现了更可靠且具多视角一致性的3D场景理解,对具身AI、AR/VR和自主感知具有应用潜力。 Abstract: Referring 3D Gaussian Splatting Segmentation (R3DGS) aims to interpret free-form language expressions and localize the corresponding 3D regions in Gaussian fields. While recent advances have introduced cross-modal alignment between language and 3D geometry, existing pipelines still struggle with cross-view consistency due to their reliance on 2D rendered pseudo supervision and view specific feature learning. In this work, we present Camera Aware Referring Field (CaRF), a fully differentiable framework that operates directly in the 3D Gaussian space and achieves multi view consistency. Specifically, CaRF introduces Gaussian Field Camera Encoding (GFCE), which incorporates camera geometry into Gaussian text interactions to explicitly model view dependent variations and enhance geometric reasoning. Building on this, In Training Paired View Supervision (ITPVS) is proposed to align per Gaussian logits across calibrated views during training, effectively mitigating single view overfitting and exposing inter view discrepancies for optimization. Extensive experiments on three representative benchmarks demonstrate that CaRF achieves average improvements of 16.8%, 4.3%, and 2.0% in mIoU over state of the art methods on the Ref LERF, LERF OVS, and 3D OVS datasets, respectively. Moreover, this work promotes more reliable and view consistent 3D scene understanding, with potential benefits for embodied AI, AR/VR interaction, and autonomous perception.[58] PhysCorr: Dual-Reward DPO for Physics-Constrained Text-to-Video Generation with Automated Preference Selection
Peiyao Wang,Weining Wang,Qi Li
Main category: cs.CV
TL;DR: 本文提出了PhysCorr,一个统一的框架,用于建模、评估和优化视频生成中的物理一致性。
Details
Motivation: 现有的文本到视频生成模型在感知质量上取得了进展,但生成内容常常违反物理合理性,限制了其在具身AI、机器人和仿真等领域的应用。 Method: 提出PhysicsRM,一种双维度奖励模型,量化对象内部稳定性和对象间交互;在此基础上开发PhyDPO,一种利用对比反馈和物理感知重加权的直接偏好优化 pipeline。该方法具有模型无关性和可扩展性,可集成到多种视频扩散和基于Transformer的模型中。 Result: 在多个基准上的实验表明,PhysCorr显著提升了生成视频的物理真实感,同时保持了视觉质量和语义对齐。 Conclusion: PhysCorr为实现物理上合理且可信的视频生成迈出了关键一步。 Abstract: Recent advances in text-to-video generation have achieved impressive perceptual quality, yet generated content often violates fundamental principles of physical plausibility - manifesting as implausible object dynamics, incoherent interactions, and unrealistic motion patterns. Such failures hinder the deployment of video generation models in embodied AI, robotics, and simulation-intensive domains. To bridge this gap, we propose PhysCorr, a unified framework for modeling, evaluating, and optimizing physical consistency in video generation. Specifically, we introduce PhysicsRM, the first dual-dimensional reward model that quantifies both intra-object stability and inter-object interactions. On this foundation, we develop PhyDPO, a novel direct preference optimization pipeline that leverages contrastive feedback and physics-aware reweighting to guide generation toward physically coherent outputs. Our approach is model-agnostic and scalable, enabling seamless integration into a wide range of video diffusion and transformer-based backbones. Extensive experiments across multiple benchmarks demonstrate that PhysCorr achieves significant improvements in physical realism while preserving visual fidelity and semantic alignment. This work takes a critical step toward physically grounded and trustworthy video generation.[59] GNN-MoE: Context-Aware Patch Routing using GNNs for Parameter-Efficient Domain Generalization
Mahmoud Soliman,Omar Abdelaziz,Ahmed Radwan,Anand,Mohamed Shehata
Main category: cs.CV
TL;DR: 提出GNN-MoE方法,结合图神经网络路由与Mixture-of-Experts框架,通过Kronecker适配器实现高效参数微调,在域泛化任务中实现高性能且参数高效。
Details
Motivation: 解决预训练Vision Transformer在域泛化中标准微调成本高且损害泛化能力的问题,提升参数效率和模型鲁棒性。 Method: 设计基于图神经网络(GNN)的路由器(如GCN、GAT、SAGE),在块间图上操作,动态将图像块分配给专业化专家,利用块间关系进行上下文感知路由,并采用Kronecker适配器实现轻量级参数调整。 Result: 在多个域泛化基准上达到最先进或具有竞争力的性能,同时保持高参数效率,验证了图基上下文路由的有效性。 Conclusion: GNN-MoE通过引入图结构感知的动态路由机制,显著提升了ViT在未见域上的泛化能力和效率,为轻量级域泛化提供了新思路。 Abstract: Domain generalization (DG) seeks robust Vision Transformer (ViT) performance on unseen domains. Efficiently adapting pretrained ViTs for DG is challenging; standard fine-tuning is costly and can impair generalization. We propose GNN-MoE, enhancing Parameter-Efficient Fine-Tuning (PEFT) for DG with a Mixture-of-Experts (MoE) framework using efficient Kronecker adapters. Instead of token-based routing, a novel Graph Neural Network (GNN) router (GCN, GAT, SAGE) operates on inter-patch graphs to dynamically assign patches to specialized experts. This context-aware GNN routing leverages inter-patch relationships for better adaptation to domain shifts. GNN-MoE achieves state-of-the-art or competitive DG benchmark performance with high parameter efficiency, highlighting the utility of graph-based contextual routing for robust, lightweight DG.[60] MedDChest: A Content-Aware Multimodal Foundational Vision Model for Thoracic Imaging
Mahmoud Soliman,Islam Osman,Mohamed S. Shehata,Rasika Rajapakshe
Main category: cs.CV
TL;DR: MedDChest是一种专为胸部影像设计的新型视觉Transformer模型,通过在大规模医学图像数据集上从零训练,并结合一种内容感知的数据增强方法(Guided Random Resized Crops),显著优于ImageNet预训练模型。
Details
Motivation: 现有视觉模型在医学影像中表现受限,主要因为其基于自然图像预训练的骨干网络与医学图像存在领域差异。 Method: 提出MedDChest,一种专用于胸部影像的ViT模型,使用超过120万张来自10个公开来源的多模态医学图像(如X光和CT)进行从头预训练,并引入Guided Random Resized Crops这一内容感知的数据增强策略,优先采样解剖相关区域。 Result: 在多个下游诊断任务上微调后,MedDChest显著优于现有的ImageNet预训练模型,验证了大规模领域内预训练与领域特定数据增强的有效性。 Conclusion: MedDChest通过领域内预训练和针对性数据增强,为胸部影像分析提供了更强大、鲁棒的特征提取器,是各类胸科诊断任务的更优起点,模型权重将公开以促进后续研究。 Abstract: The performance of vision models in medical imaging is often hindered by the prevailing paradigm of fine-tuning backbones pre-trained on out-of-domain natural images. To address this fundamental domain gap, we propose MedDChest, a new foundational Vision Transformer (ViT) model optimized specifically for thoracic imaging. We pre-trained MedDChest from scratch on a massive, curated, multimodal dataset of over 1.2 million images, encompassing different modalities including Chest X-ray and Computed Tomography (CT) compiled from 10 public sources. A core technical contribution of our work is Guided Random Resized Crops, a novel content-aware data augmentation strategy that biases sampling towards anatomically relevant regions, overcoming the inefficiency of standard cropping techniques on medical scans. We validate our model's effectiveness by fine-tuning it on a diverse set of downstream diagnostic tasks. Comprehensive experiments empirically demonstrate that MedDChest significantly outperforms strong, publicly available ImageNet-pretrained models. By establishing the superiority of large-scale, in-domain pre-training combined with domain-specific data augmentation, MedDChest provides a powerful and robust feature extractor that serves as a significantly better starting point for a wide array of thoracic diagnostic tasks. The model weights will be made publicly available to foster future research and applications.[61] Near-Lossless 3D Voxel Representation Free from Iso-surface
Yihao Luo,Xianglong He,Chuanyu Pan,Yiwen Chen,Jiaqi Wu,Yangguang Li,Wanli Ouyang,Yuanming Hu,Guang Yang,ChoonHwai Yap
Main category: cs.CV
TL;DR: 提出了一种名为Faithful Contouring的稀疏体素化表示方法,支持高分辨率且无需水密化或等值面提取,实现了接近无损的几何保真度。
Details
Motivation: 现有基于等值面的3D网格体素化表示方法依赖水密化或渲染优化,导致几何保真度下降,难以准确表达复杂几何与拓扑结构。 Method: 提出Faithful Contouring,一种无需将网格转换为场函数或在重网格化中提取等值面的稀疏体素化表示;设计双模式自编码器以实现可扩展且细节保持的形状重建。 Result: 在直接表示中达到10^{-5}级别的距离误差;在网格重建中Chamfer Distance降低93%,F-score提升35%;有效保留尖锐特征和内部结构,支持高分辨率(2048+)。 Conclusion: Faithful Contouring在精度和效率上优于现有方法,是一种适用于3D学习任务的高保真、灵活且可扩展的网格表示方法。 Abstract: Accurate and efficient voxelized representations of 3D meshes are the foundation of 3D reconstruction and generation. However, existing representations based on iso-surface heavily rely on water-tightening or rendering optimization, which inevitably compromise geometric fidelity. We propose Faithful Contouring, a sparse voxelized representation that supports 2048+ resolutions for arbitrary meshes, requiring neither converting meshes to field functions nor extracting the isosurface during remeshing. It achieves near-lossless fidelity by preserving sharpness and internal structures, even for challenging cases with complex geometry and topology. The proposed method also shows flexibility for texturing, manipulation, and editing. Beyond representation, we design a dual-mode autoencoder for Faithful Contouring, enabling scalable and detail-preserving shape reconstruction. Extensive experiments show that Faithful Contouring surpasses existing methods in accuracy and efficiency for both representation and reconstruction. For direct representation, it achieves distance errors at the $10^{-5}$ level; for mesh reconstruction, it yields a 93\% reduction in Chamfer Distance and a 35\% improvement in F-score over strong baselines, confirming superior fidelity as a representation for 3D learning tasks.[62] A Hybrid Deep Learning Model for Robust Biometric Authentication from Low-Frame-Rate PPG Signals
Arfina Rahman,Mahesh Banavar
Main category: cs.CV
TL;DR: 本文提出了一种基于低帧率指尖视频提取PPG信号的轻量级生物识别认证框架,结合连续小波变换与混合深度学习模型CVT-ConvMixer-LSTM,在CFIHSR数据集上实现了98%的认证准确率。
Details
Motivation: PPG信号具有非侵入性、天然活体检测能力,适用于低成本可穿戴设备,但在实际应用中受运动伪影、光照变化和个体间生理差异影响,需提升信号鲁棒性。 Method: 采用标准预处理流程(去基线漂移、PCA去噪、带通滤波、傅里叶重采样和归一化),将1D PPG信号通过连续小波变换转化为2D时频图,并设计融合CVT、ConvMixer和LSTM的混合深度学习模型提取时空特征。 Result: 在46名受试者的CFIHSR数据集上达到98%的认证准确率,验证了模型对噪声和个体差异的鲁棒性。 Conclusion: 所提方法高效、可扩展且具备活体检测能力,适合应用于移动和嵌入式生物识别安全系统。 Abstract: Photoplethysmography (PPG) signals, which measure changes in blood volume in the skin using light, have recently gained attention in biometric authentication because of their non-invasive acquisition, inherent liveness detection, and suitability for low-cost wearable devices. However, PPG signal quality is challenged by motion artifacts, illumination changes, and inter-subject physiological variability, making robust feature extraction and classification crucial. This study proposes a lightweight and cost-effective biometric authentication framework based on PPG signals extracted from low-frame-rate fingertip videos. The CFIHSR dataset, comprising PPG recordings from 46 subjects at a sampling rate of 14 Hz, is employed for evaluation. The raw PPG signals undergo a standard preprocessing pipeline involving baseline drift removal, motion artifact suppression using Principal Component Analysis (PCA), bandpass filtering, Fourier-based resampling, and amplitude normalization. To generate robust representations, each one-dimensional PPG segment is converted into a two-dimensional time-frequency scalogram via the Continuous Wavelet Transform (CWT), effectively capturing transient cardiovascular dynamics. We developed a hybrid deep learning model, termed CVT-ConvMixer-LSTM, by combining spatial features from the Convolutional Vision Transformer (CVT) and ConvMixer branches with temporal features from a Long Short-Term Memory network (LSTM). The experimental results on 46 subjects demonstrate an authentication accuracy of 98%, validating the robustness of the model to noise and variability between subjects. Due to its efficiency, scalability, and inherent liveness detection capability, the proposed system is well-suited for real-world mobile and embedded biometric security applications.[63] Unveiling Deep Semantic Uncertainty Perception for Language-Anchored Multi-modal Vision-Brain Alignment
Zehui Feng,Chenqi Zhang,Mingru Wang,Minuo Wei,Shiwei Cheng,Cuntai Guan,Ting Han
Main category: cs.CV
TL;DR: 本文提出了Bratrix,首个实现语言锚定的视觉-脑对齐的端到端框架,通过解耦视觉刺激的层次化语义成分并引入不确定性感知模块,在EEG、MEG和fMRI任务中显著提升了检索、重建和描述性能。
Details
Motivation: 由于个体差异和视觉特征的纠缠,从EEG、MEG、fMRI等神经信号中揭示视觉语义仍具挑战性;现有方法仅对齐神经活动与视觉嵌入,缺乏语义维度,限制了解释性和鲁棒性。 Method: 提出Bratrix框架,将视觉刺激解耦为视觉和语言语义成分,映射到共享潜在空间,形成对齐的视觉-语言和脑-语言嵌入;引入不确定性感知模块进行加权对齐,并采用语言锚定语义矩阵增强跨模态相关性,结合单模态预训练与多模态微调的两阶段训练策略。 Result: 在EEG、MEG和fMRI基准上实验表明,Bratrix在检索、重建和描述任务上优于现有方法,200类EEG检索任务中性能提升14.3%。 Conclusion: Bratrix通过语言锚定的多模态对齐和不确定性建模,有效提升了神经信号到视觉语义的解码精度与鲁棒性,推动了脑信号理解的可解释性发展。 Abstract: Unveiling visual semantics from neural signals such as EEG, MEG, and fMRI remains a fundamental challenge due to subject variability and the entangled nature of visual features. Existing approaches primarily align neural activity directly with visual embeddings, but visual-only representations often fail to capture latent semantic dimensions, limiting interpretability and deep robustness. To address these limitations, we propose Bratrix, the first end-to-end framework to achieve multimodal Language-Anchored Vision-Brain alignment. Bratrix decouples visual stimuli into hierarchical visual and linguistic semantic components, and projects both visual and brain representations into a shared latent space, enabling the formation of aligned visual-language and brain-language embeddings. To emulate human-like perceptual reliability and handle noisy neural signals, Bratrix incorporates a novel uncertainty perception module that applies uncertainty-aware weighting during alignment. By leveraging learnable language-anchored semantic matrices to enhance cross-modal correlations and employing a two-stage training strategy of single-modality pretraining followed by multimodal fine-tuning, Bratrix-M improves alignment precision. Extensive experiments on EEG, MEG, and fMRI benchmarks demonstrate that Bratrix improves retrieval, reconstruction, and captioning performance compared to state-of-the-art methods, specifically surpassing 14.3% in 200-way EEG retrieval task. Code and model are available.[64] Adversarial and Score-Based CT Denoising: CycleGAN vs Noise2Score
Abu Hanif Muhammad Syarubany
Main category: cs.CV
TL;DR: 本文研究了在无配对和自监督条件下CT图像去噪的两种高效训练方法:基于CycleGAN的残差翻译器和Noise2Score(N2S)得分匹配去噪器。实验表明,CycleGAN在图像质量上表现最佳,而Noise2Score在缺乏干净配对数据时仍具有较强去噪能力。
Details
Motivation: 在缺乏配对干净-噪声CT图像的情况下,探索高效的无监督和自监督去噪方法,以提升临床实用性和数据利用效率。 Method: 采用CycleGAN-based残差翻译器和Noise2Score得分匹配模型,在统一评估协议下进行对比实验,并通过参数搜索确定最优配置(如U-Net结构、损失权重等),最终完成收敛训练。 Result: CycleGAN将输入从34.66 dB / 0.9234 SSIM提升至38.913 dB / 0.971 SSIM,在Kaggle未见数据集上得分为1.9343;Noise2Score虽PSNR/SSIM略低,但在极噪输入下提升显著,展现出强鲁棒性。 Conclusion: CycleGAN提供最优图像质量,适合高质量重建需求;Noise2Score作为无需配对数据的方法,性能具竞争力,是实际应用中可靠的替代方案。 Abstract: We study CT image denoising in the unpaired and self-supervised regimes by evaluating two strong, training-data-efficient paradigms: a CycleGAN-based residual translator and a Noise2Score (N2S) score-matching denoiser. Under a common evaluation protocol, a configuration sweep identifies a simple standard U-Net backbone within CycleGAN (lambda_cycle = 30, lambda_iden = 2, ngf = ndf = 64) as the most reliable setting; we then train it to convergence with a longer schedule. The selected CycleGAN improves the noisy input from 34.66 dB / 0.9234 SSIM to 38.913 dB / 0.971 SSIM and attains an estimated score of 1.9441 and an unseen-set (Kaggle leaderboard) score of 1.9343. Noise2Score, while slightly behind in absolute PSNR / SSIM, achieves large gains over very noisy inputs, highlighting its utility when clean pairs are unavailable. Overall, CycleGAN offers the strongest final image quality, whereas Noise2Score provides a robust pair-free alternative with competitive performance. Source code is available at https://github.com/hanifsyarubany/CT-Scan-Image-Denoising-using-CycleGAN-and-Noise2Score.[65] When Swin Transformer Meets KANs: An Improved Transformer Architecture for Medical Image Segmentation
Nishchal Sapkota,Haoyan Shi,Yejia Zhang,Xianshi Ma,Bofang Zheng,Danny Z. Chen
Main category: cs.CV
TL;DR: 本文提出了一种名为UKAST的U-Net-like架构,将基于有理函数的Kolmogorov-Arnold Networks(KANs)集成到Swin Transformer编码器中,提升了医学图像分割的性能,尤其在数据稀缺场景下表现优异。
Details
Motivation: 医学图像分割面临复杂解剖结构和标注数据有限的挑战,传统CNN难以建模长距离依赖,而Transformer则数据需求大且计算成本高。 Method: 设计UKAST架构,利用有理基函数和分组有理KAN(GR-KAN)替代传统的样条KAN,结合Swin Transformer的全局建模能力,在保持参数量小幅增加的同时显著降低FLOPs。 Result: 在四个2D和3D医学图像分割基准上达到最先进水平,性能优于CNN和Transformer基线模型,尤其在数据稀缺情况下表现出更强的准确性。 Conclusion: KAN增强的Transformer能够有效提升医学图像分割的数据效率和模型性能,具有广泛应用潜力。 Abstract: Medical image segmentation is critical for accurate diagnostics and treatment planning, but remains challenging due to complex anatomical structures and limited annotated training data. CNN-based segmentation methods excel at local feature extraction, but struggle with modeling long-range dependencies. Transformers, on the other hand, capture global context more effectively, but are inherently data-hungry and computationally expensive. In this work, we introduce UKAST, a U-Net like architecture that integrates rational-function based Kolmogorov-Arnold Networks (KANs) into Swin Transformer encoders. By leveraging rational base functions and Group Rational KANs (GR-KANs) from the Kolmogorov-Arnold Transformer (KAT), our architecture addresses the inefficiencies of vanilla spline-based KANs, yielding a more expressive and data-efficient framework with reduced FLOPs and only a very small increase in parameter count compared to SwinUNETR. UKAST achieves state-of-the-art performance on four diverse 2D and 3D medical image segmentation benchmarks, consistently surpassing both CNN- and Transformer-based baselines. Notably, it attains superior accuracy in data-scarce settings, alleviating the data-hungry limitations of standard Vision Transformers. These results show the potential of KAN-enhanced Transformers to advance data-efficient medical image segmentation. Code is available at: https://github.com/nsapkota417/UKAST[66] SpatialLock: Precise Spatial Control in Text-to-Image Synthesis
Biao Liu,Yuanzhi Liang
Main category: cs.CV
TL;DR: 提出了一种名为SpatialLock的新框架,通过结合感知信号和定位信息来精确控制文本到图像生成中的对象位置,包含位置感知注入(PoI)和位置引导学习(PoG)两个组件,在多个数据集上实现了超过0.9的IOU分数,达到最先进的对象定位性能。
Details
Motivation: 现有文本到图像生成方法在利用位置信息方面不足,导致对对象空间布局的理解不够充分,难以实现精确的对象定位控制。 Method: 提出SpatialLock框架,包含两个核心组件:1)位置感知注入(PoI),通过注意力层直接融合空间信息以增强模型对定位信息的学习;2)位置引导学习(PoG),采用基于感知的监督机制进一步优化对象定位。二者协同提升生成图像中对象的空间准确性和视觉质量。 Result: 实验表明,SpatialLock在多个数据集上实现了超过0.9的IOU分数,显著优于现有方法,有效提升了对象定位精度和图像生成质量。 Conclusion: SpatialLock通过联合利用感知信号与定位信息,显著提高了文本到图像生成中对象空间布局的控制能力,为需要精确对象定位的应用提供了新的解决方案。 Abstract: Text-to-Image (T2I) synthesis has made significant advancements in recent years, driving applications such as generating datasets automatically. However, precise control over object localization in generated images remains a challenge. Existing methods fail to fully utilize positional information, leading to an inadequate understanding of object spatial layouts. To address this issue, we propose SpatialLock, a novel framework that leverages perception signals and grounding information to jointly control the generation of spatial locations. SpatialLock incorporates two components: Position-Engaged Injection (PoI) and Position-Guided Learning (PoG). PoI directly integrates spatial information through an attention layer, encouraging the model to learn the grounding information effectively. PoG employs perception-based supervision to further refine object localization. Together, these components enable the model to generate objects with precise spatial arrangements and improve the visual quality of the generated images. Experiments show that SpatialLock sets a new state-of-the-art for precise object positioning, achieving IOU scores above 0.9 across multiple datasets.[67] Tortoise and Hare Guidance: Accelerating Diffusion Model Inference with Multirate Integration
Yunghee Lee,Byeonghyun Pak,Junwha Hong,Hoseong Kim
Main category: cs.CV
TL;DR: 提出了一种无需训练的加速扩散采样方法Tortoise and Hare Guidance (THG),通过多速率ODE系统减少计算量,在保持生成质量的同时显著提升效率。
Details
Motivation: 现有扩散模型在采样时需要大量函数评估,导致推理速度慢,而传统求解器未能充分利用不同分支对误差的敏感性差异来优化计算。 Method: 将分类器无关引导(CFG)的ODE重新表述为多速率系统,发现附加引导项对近似更鲁棒;在此基础上,THG在细粒度网格上更新噪声估计(龟方程),而在粗网格上更新附加引导(兔方程),并引入误差感知的时间步采样和引导尺度调度策略。 Result: 相比现有方法,在相同计算预算下最多减少30%的函数评估次数(NFE),生成质量几乎无损(ΔImageReward ≤ 0.032),且优于当前最先进的无需训练的CFG加速方法。 Conclusion: 多速率建模范式有助于提升扩散模型的采样效率,THG为无需重新训练模型即可实现高质量实时图像生成提供了有效途径。 Abstract: In this paper, we propose Tortoise and Hare Guidance (THG), a training-free strategy that accelerates diffusion sampling while maintaining high-fidelity generation. We demonstrate that the noise estimate and the additional guidance term exhibit markedly different sensitivity to numerical error by reformulating the classifier-free guidance (CFG) ODE as a multirate system of ODEs. Our error-bound analysis shows that the additional guidance branch is more robust to approximation, revealing substantial redundancy that conventional solvers fail to exploit. Building on this insight, THG significantly reduces the computation of the additional guidance: the noise estimate is integrated with the tortoise equation on the original, fine-grained timestep grid, while the additional guidance is integrated with the hare equation only on a coarse grid. We also introduce (i) an error-bound-aware timestep sampler that adaptively selects step sizes and (ii) a guidance-scale scheduler that stabilizes large extrapolation spans. THG reduces the number of function evaluations (NFE) by up to 30% with virtually no loss in generation fidelity ($\Delta$ImageReward $\leq$ 0.032) and outperforms state-of-the-art CFG-based training-free accelerators under identical computation budgets. Our findings highlight the potential of multirate formulations for diffusion solvers, paving the way for real-time high-quality image synthesis without any model retraining. The source code is available at https://github.com/yhlee-add/THG.[68] Text to Sketch Generation with Multi-Styles
Tengjie Li,Shikui Tu,Lei Xu
Main category: cs.CV
TL;DR: 提出一种无需训练的扩散模型框架,通过文本提示和参考风格草图实现精确的草图风格控制,有效减少内容泄露并支持多风格可控生成。
Details
Motivation: 现有草图生成方法缺乏对风格的精细控制,难以在保持内容独立的同时实现多样化风格迁移。 Method: 基于扩散模型,引入参考特征作为辅助信息,采用线性平滑和风格-内容引导机制,并通过联合AdaIN模块支持多风格融合。 Result: 实验表明该方法在风格对齐准确性、生成质量及低结构相似度场景下的表现均优于现有方法。 Conclusion: 所提框架实现了高质量、高灵活性的草图风格控制生成,且无需训练,具备良好的应用潜力。 Abstract: Recent advances in vision-language models have facilitated progress in sketch generation. However, existing specialized methods primarily focus on generic synthesis and lack mechanisms for precise control over sketch styles. In this work, we propose a training-free framework based on diffusion models that enables explicit style guidance via textual prompts and referenced style sketches. Unlike previous style transfer methods that overwrite key and value matrices in self-attention, we incorporate the reference features as auxiliary information with linear smoothing and leverage a style-content guidance mechanism. This design effectively reduces content leakage from reference sketches and enhances synthesis quality, especially in cases with low structural similarity between reference and target sketches. Furthermore, we extend our framework to support controllable multi-style generation by integrating features from multiple reference sketches, coordinated via a joint AdaIN module. Extensive experiments demonstrate that our approach achieves high-quality sketch generation with accurate style alignment and improved flexibility in style control. The official implementation of M3S is available at https://github.com/CMACH508/M3S.[69] Automated Tennis Player and Ball Tracking with Court Keypoints Detection (Hawk Eye System)
Venkata Manikanta Desu,Syed Fawaz Ali
Main category: cs.CV
TL;DR: 提出了一种基于深度学习的网球比赛自动分析管道,可实时检测和跟踪球员、网球及场地关键点,并生成详细的运动表现分析。
Details
Motivation: 为了提供一种自动化、精确且实时的网球比赛分析方法,帮助教练、广播员和运动员更好地理解比赛动态。 Method: 结合YOLOv8进行球员检测、自定义训练的YOLOv5进行网球跟踪,以及基于ResNet50的模型检测场地关键点,构建完整的分析流程。 Result: 系统在不同场地条件和比赛场景下表现出良好的鲁棒性,能够输出带标注的视频和详细的性能指标,如球员移动模式、球速、击球准确率和反应时间。 Conclusion: 该框架为网球比赛提供了高效、可扩展的自动化分析解决方案,具有实际应用价值。 Abstract: This study presents a complete pipeline for automated tennis match analysis. Our framework integrates multiple deep learning models to detect and track players and the tennis ball in real time, while also identifying court keypoints for spatial reference. Using YOLOv8 for player detection, a custom-trained YOLOv5 model for ball tracking, and a ResNet50-based architecture for court keypoint detection, our system provides detailed analytics including player movement patterns, ball speed, shot accuracy, and player reaction times. The experimental results demonstrate robust performance in varying court conditions and match scenarios. The model outputs an annotated video along with detailed performance metrics, enabling coaches, broadcasters, and players to gain actionable insights into the dynamics of the game.[70] DMSORT: An efficient parallel maritime multi-object tracking architecture for unmanned vessel platforms
Shengyu Tang,Zeyuan Lu,Jiazhi Dong,Changdong Yu,Xiaoyu Wang,Yaohui Lyu,Weihao Xia
Main category: cs.CV
TL;DR: 提出一种高效的双分支海上多目标跟踪方法DMSORT,结合目标检测、重识别与相机运动估计,显著提升复杂海况下的跟踪性能。
Details
Motivation: 复杂的海上环境导致相机运动和视觉退化,对现有海上多目标跟踪方法造成挑战。 Method: 设计双分支并行跟踪框架:一枝采用可逆列检测网络(RCDN)和轻量Transformer外观提取器(Li-TAE)进行检测与重识别;另一枝通过投影变换估计平台运动并在卡尔曼滤波中补偿,稳定轨迹;最后通过聚类优化的特征融合模块融合运动与外观信息。 Result: 在新加坡海事数据集上达到最先进性能,是现有基于ReID方法中运行速度最快的方法,同时具备高身份一致性及对抖动和遮挡的鲁棒性。 Conclusion: DMSORT有效解决了海上环境中相机运动带来的跟踪退化问题,在精度、速度和鲁棒性之间实现了良好平衡,适用于实际海上监控与航行安全应用。 Abstract: Accurate perception of the marine environment through robust multi-object tracking (MOT) is essential for ensuring safe vessel navigation and effective maritime surveillance. However, the complicated maritime environment often causes camera motion and subsequent visual degradation, posing significant challenges to MOT. To address this challenge, we propose an efficient Dual-branch Maritime SORT (DMSORT) method for maritime MOT. The core of the framework is a parallel tracker with affine compensation, which incorporates an object detection and re-identification (ReID) branch, along with a dedicated branch for dynamic camera motion estimation. Specifically, a Reversible Columnar Detection Network (RCDN) is integrated into the detection module to leverage multi-level visual features for robust object detection. Furthermore, a lightweight Transformer-based appearance extractor (Li-TAE) is designed to capture global contextual information and generate robust appearance features. Another branch decouples platform-induced and target-intrinsic motion by constructing a projective transformation, applying platform-motion compensation within the Kalman filter, and thereby stabilizing true object trajectories. Finally, a clustering-optimized feature fusion module effectively combines motion and appearance cues to ensure identity consistency under noise, occlusion, and drift. Extensive evaluations on the Singapore Maritime Dataset demonstrate that DMSORT achieves state-of-the-art performance. Notably, DMSORT attains the fastest runtime among existing ReID-based MOT frameworks while maintaining high identity consistency and robustness to jitter and occlusion. Code is available at: https://github.com/BiscuitsLzy/DMSORT-An-efficient-parallel-maritime-multi-object-tracking-architecture-.[71] Learning from Online Videos at Inference Time for Computer-Use Agents
Yujian Liu,Ze Wang,Hao Chen,Ximeng Sun,Xiaodong Yu,Jialian Wu,Jiang Liu,Emad Barsoum,Zicheng Liu,Shiyu Chang
Main category: cs.CV
TL;DR: 本文提出了一种使计算机使用代理在推理时从在线视频中学习的框架,通过检索、过滤教程视频并将其转化为结构化示范轨迹,动态选择最相关的局部指导来提升任务执行效果。
Details
Motivation: 现有计算机使用代理在需要特定领域操作知识的任务上仍落后于人类,而人类可通过观看视频教程快速学习。因此,研究如何让代理有效利用在线视频进行实时学习具有重要意义。 Method: 提出一个包含视频检索与过滤、视频转为结构化示范轨迹、以及动态选择轨迹作为上下文指引的框架;使用视觉语言模型(VLM)推断UI动作、分割视频为短动作序列,并为每个子序列分配文本目标,通过两阶段选择机制在每一步动态选取最佳轨迹。 Result: 在两个广泛使用的基准上实验表明,该框架 consistently 优于强基线代理及仅使用文本教程或转录本的变体,验证了轨迹分割与选择、动作过滤和视觉信息的重要性。 Conclusion: 在线视频可被系统地提炼为可操作的指导信息,显著提升计算机使用代理在推理时的表现,为代理获取 procedural knowledge 提供了新途径。 Abstract: Computer-use agents can operate computers and automate laborious tasks, but despite recent rapid progress, they still lag behind human users, especially when tasks require domain-specific procedural knowledge about particular applications, platforms, and multi-step workflows. Humans can bridge this gap by watching video tutorials: we search, skim, and selectively imitate short segments that match our current subgoal. In this paper, we study how to enable computer-use agents to learn from online videos at inference time effectively. We propose a framework that retrieves and filters tutorial videos, converts them into structured demonstration trajectories, and dynamically selects trajectories as in-context guidance during execution. Particularly, using a VLM, we infer UI actions, segment videos into short subsequences of actions, and assign each subsequence a textual objective. At inference time, a two-stage selection mechanism dynamically chooses a single trajectory to add in context at each step, focusing the agent on the most helpful local guidance for its next decision. Experiments on two widely used benchmarks show that our framework consistently outperforms strong base agents and variants that use only textual tutorials or transcripts. Analyses highlight the importance of trajectory segmentation and selection, action filtering, and visual information, suggesting that abundant online videos can be systematically distilled into actionable guidance that improves computer-use agents at inference time. Our code is available at https://github.com/UCSB-NLP-Chang/video_demo.[72] Seeing Straight: Document Orientation Detection for Efficient OCR
Suranjan Goswami,Abhinav Ravi,Raja Kolla,Ali Faraz,Shaharukh Khan,Akash,Chandra Khatri,Shubham Agarwal
Main category: cs.CV
TL;DR: 本文提出了OCR-Rotation-Bench(ORB)基准,用于评估OCR对图像旋转的鲁棒性,并构建了一个基于Phi-3.5-Vision模型的快速、鲁棒且轻量级的旋转分类流水线,在ORB-En和ORB-Indic数据集上分别达到96%和92%的准确率,显著提升了OCR性能。
Details
Motivation: 在现实场景中,由于拍摄时相机方向错误等用户操作问题,文档图像常出现旋转,影响OCR等下游任务性能,因此需要有效的旋转校正方法。 Method: 提出OCR-Rotation-Bench(ORB)基准,包含英语和11种印度语系的多语言数据集;基于Phi-3.5-Vision视觉编码器构建带动态裁剪的轻量级旋转分类模型,并进行独立微调以实现四类旋转角度识别。 Result: 所提方法在ORB-En和ORB-Indic数据集上分别达到96%和92%的旋转分类准确率;在模拟真实场景中,显著提升闭源OCR模型性能最多14%,开源模型性能最多达4倍。 Conclusion: 准确的旋转校正对OCR至关重要,本文提出的基准和轻量级模型有效解决了多语言文档的旋转识别问题,并显著增强OCR系统的鲁棒性。 Abstract: Despite significant advances in document understanding, determining the correct orientation of scanned or photographed documents remains a critical pre-processing step in the real world settings. Accurate rotation correction is essential for enhancing the performance of downstream tasks such as Optical Character Recognition (OCR) where misalignment commonly arises due to user errors, particularly incorrect base orientations of the camera during capture. In this study, we first introduce OCR-Rotation-Bench (ORB), a new benchmark for evaluating OCR robustness to image rotations, comprising (i) ORB-En, built from rotation-transformed structured and free-form English OCR datasets, and (ii) ORB-Indic, a novel multilingual set spanning 11 Indic mid to low-resource languages. We also present a fast, robust and lightweight rotation classification pipeline built on the vision encoder of Phi-3.5-Vision model with dynamic image cropping, fine-tuned specifically for 4-class rotation task in a standalone fashion. Our method achieves near-perfect 96% and 92% accuracy on identifying the rotations respectively on both the datasets. Beyond classification, we demonstrate the critical role of our module in boosting OCR performance: closed-source (up to 14%) and open-weights models (up to 4x) in the simulated real-world setting.[73] Systematic Evaluation of Preprocessing Techniques for Accurate Image Registration in Digital Pathology
Fatemehzahra Darzi,Rodrigo Escobar Diaz Guerrero,Thomas Bocklitz
Main category: cs.CV
TL;DR: 本研究探讨了不同颜色转换技术对H&E染色图像与非线性多模态图像配准的影响,发现CycleGAN颜色转换在两种场景下均显著降低了配准误差,表明预处理中的颜色转换可提升数字病理学中跨模态图像配准的准确性。
Details
Motivation: 提高数字病理学中不同模态图像(如不同染色方法)之间的配准精度,以支持生物标志物分析和组织重建等应用。 Method: 使用20对组织样本,比较CycleGAN、Macenko、Reinhard和Vahadane等颜色转换方法,并结合图像反转、对比度调整等预处理;采用VALIS方法进行刚性与非刚性配准,评估指标包括相对目标配准误差(rTRE)的MMrTRE和AMrTRE,以及手动选取10个关键点的点基评估。 Result: CycleGAN颜色转换在原始和反转多模态图像两种场景下均取得最低的配准误差,优于其他方法;所有颜色转换方法相比未处理图像均提升了配准性能。 Conclusion: 在图像配准前应用颜色转换(尤其是CycleGAN)能有效改善不同模态病理图像的对齐效果,有助于提升数字病理学中多模态图像融合与分析的可靠性。 Abstract: Image registration refers to the process of spatially aligning two or more images by mapping them into a common coordinate system, so that corresponding anatomical or tissue structures are matched across images. In digital pathology, registration enables direct comparison and integration of information from different stains or imaging modalities, sup-porting applications such as biomarker analysis and tissue reconstruction. Accurate registration of images from different modalities is an essential step in digital pathology. In this study, we investigated how various color transformation techniques affect image registration between hematoxylin and eosin (H&E) stained images and non-linear multimodal images. We used a dataset of 20 tissue sample pairs, with each pair undergoing several preprocessing steps, including different color transformation (CycleGAN, Macenko, Reinhard, Vahadane), inversion, contrast adjustment, intensity normalization, and denoising. All images were registered using the VALIS registration method, which first applies rigid registration and then performs non-rigid registration in two steps on both low and high-resolution images. Registration performance was evaluated using the relative Target Registration Error (rTRE). We reported the median of median rTRE values (MMrTRE) and the average of median rTRE values (AMrTRE) for each method. In addition, we performed a custom point-based evaluation using ten manually selected key points. Registration was done separately for two scenarios, using either the original or inverted multimodal images. In both scenarios, CycleGAN color transformation achieved the lowest registration errors, while the other methods showed higher errors. These findings show that applying color transformation before registration improves alignment between images from different modalities and supports more reliable analysis in digital pathology.[74] Covariance Descriptors Meet General Vision Encoders: Riemannian Deep Learning for Medical Image Classification
Josef Mayr,Anna Reithmeir,Maxime Di Folco,Julia A. Schnabel
Main category: cs.CV
TL;DR: 本研究探讨了基于预训练视觉编码器(如DINOv2和MedSAM)提取的协方差描述符在医学图像分类中的有效性,并结合专为对称正定矩阵设计的SPDNet网络进行评估。结果表明,基于深度特征的协方差描述符优于手工设计的描述符,且DINOv2与SPDNet组合达到了当前最优性能。
Details
Motivation: 协方差描述符在通用计算机视觉中表现良好,但在医学影像中尚未被充分探索。本文旨在评估其在传统和基于学习的医学图像分类中的潜力,特别是结合对称正定矩阵网络(SPDNet)的应用效果。 Method: 从预训练的通用视觉编码器(GVEs,包括DINOv2和MedSAM)提取特征,构建协方差描述符,并与手工设计的描述符进行比较;在MedMNIST基准的11个二分类和多分类数据集上评估SPDNet及其他方法的性能。 Result: 基于GVE特征的协方差描述符 consistently 优于手工特征;DINOv2与SPDNet结合时表现最佳,超越现有方法。 Conclusion: 结合强大的预训练视觉编码器与协方差描述符可显著提升医学图像分类性能,展现出该方向的巨大潜力。 Abstract: Covariance descriptors capture second-order statistics of image features. They have shown strong performance in general computer vision tasks, but remain underexplored in medical imaging. We investigate their effectiveness for both conventional and learning-based medical image classification, with a particular focus on SPDNet, a classification network specifically designed for symmetric positive definite (SPD) matrices. We propose constructing covariance descriptors from features extracted by pre-trained general vision encoders (GVEs) and comparing them with handcrafted descriptors. Two GVEs - DINOv2 and MedSAM - are evaluated across eleven binary and multi-class datasets from the MedMNSIT benchmark. Our results show that covariance descriptors derived from GVE features consistently outperform those derived from handcrafted features. Moreover, SPDNet yields superior performance to state-of-the-art methods when combined with DINOv2 features. Our findings highlight the potential of combining covariance descriptors with powerful pretrained vision encoders for medical image analysis.[75] AStF: Motion Style Transfer via Adaptive Statistics Fusor
Hanmo Chen,Chenghao Xu,Jiexi Yan,Cheng Deng
Main category: cs.CV
TL;DR: 本文提出了一种新的自适应统计融合器(AStF),通过引入偏度和峰度来增强运动风格迁移的效果,相较于仅使用均值和方差的传统方法,在捕捉动态模式和时空一致性方面表现更优。
Details
Motivation: 由于图像与运动数据之间的本质差异,仅依赖均值和方差无法充分捕捉运动数据的复杂动态模式和时空相干性,因此需要引入更高阶的统计量来提升运动风格迁移的质量。 Method: 提出了包含风格解耦模块(SDM)和高阶多统计注意力机制(HOS-Attn)的自适应统计融合器(AStF),并结合运动一致性正则化(MCR)判别器进行训练。 Result: 实验结果表明,所提AStF在运动风格迁移任务中优于现有最先进方法,能够更全面地建模动态风格中的时空统计模式。 Conclusion: 通过引入偏度和峰度等高阶统计特征,AStF有效提升了运动风格迁移的逼真度和质量,为未来运动风格建模提供了新思路。 Abstract: Human motion style transfer allows characters to appear less rigidity and more realism with specific style. Traditional arbitrary image style transfer typically process mean and variance which is proved effective. Meanwhile, similar methods have been adapted for motion style transfer. However, due to the fundamental differences between images and motion, relying on mean and variance is insufficient to fully capture the complex dynamic patterns and spatiotemporal coherence properties of motion data. Building upon this, our key insight is to bring two more coefficient, skewness and kurtosis, into the analysis of motion style. Specifically, we propose a novel Adaptive Statistics Fusor (AStF) which consists of Style Disentanglement Module (SDM) and High-Order Multi-Statistics Attention (HOS-Attn). We trained our AStF in conjunction with a Motion Consistency Regularization (MCR) discriminator. Experimental results show that, by providing a more comprehensive model of the spatiotemporal statistical patterns inherent in dynamic styles, our proposed AStF shows proficiency superiority in motion style transfers over state-of-the-arts. Our code and model are available at https://github.com/CHMimilanlan/AStF.[76] MedSapiens: Taking a Pose to Rethink Medical Imaging Landmark Detection
Marawan Elbatel,Anbang Wang,Keyuan Liu,Kaouther Mouheb,Enrique Almar-Munoz,Lizhuo Lin,Yanqi Yang,Karim Lekadir,Xiaomeng Li
Main category: cs.CV
TL;DR: 本论文提出将人类中心的基础模型Sapiens迁移到医学图像的解剖标志检测任务中,通过多数据集预训练构建MedSapiens模型,在多个数据集上达到新SOTA,并在少样本场景下表现优异。
Details
Motivation: 传统解剖标志检测依赖领域专用模型,而大规模视觉基础模型的发展提供了新机遇。作者旨在探索人类姿态估计模型在医学影像中的潜力,填补该方向的研究空白。 Method: 基于人类中心的基础模型Sapiens,通过多数据集预训练策略将其适应于医学图像的解剖标志检测任务,构建MedSapiens模型,并在多种数据集和少样本设置下进行评估。 Result: MedSapiens在平均成功检测率(SDR)上比通用模型最高提升5.26%,比专用模型最高提升21.81%;在少样本设置下比当前最优方法提升2.69%。 Conclusion: 人类中心的基础模型在解剖标志检测任务中具有强大先验能力,MedSapiens有效挖掘了这一潜力,为医学图像分析提供了高效且可迁移的新基线。 Abstract: This paper does not introduce a novel architecture; instead, it revisits a fundamental yet overlooked baseline: adapting human-centric foundation models for anatomical landmark detection in medical imaging. While landmark detection has traditionally relied on domain-specific models, the emergence of large-scale pre-trained vision models presents new opportunities. In this study, we investigate the adaptation of Sapiens, a human-centric foundation model designed for pose estimation, to medical imaging through multi-dataset pretraining, establishing a new state of the art across multiple datasets. Our proposed model, MedSapiens, demonstrates that human-centric foundation models, inherently optimized for spatial pose localization, provide strong priors for anatomical landmark detection, yet this potential has remained largely untapped. We benchmark MedSapiens against existing state-of-the-art models, achieving up to 5.26% improvement over generalist models and up to 21.81% improvement over specialist models in the average success detection rate (SDR). To further assess MedSapiens adaptability to novel downstream tasks with few annotations, we evaluate its performance in limited-data settings, achieving 2.69% improvement over the few-shot state of the art in SDR. Code and model weights are available at https://github.com/xmed-lab/MedSapiens .[77] Proto-LeakNet: Towards Signal-Leak Aware Attribution in Synthetic Human Face Imagery
Claudio Giusti,Luca Guarnera,Sebastiano Battiato
Main category: cs.CV
TL;DR: 本文提出了一种名为Proto-LeakNet的可解释性AI图像与深度伪造溯源框架,利用扩散模型在潜在空间中的信号泄露特征,实现对已知和未知生成器的高效识别。
Details
Motivation: 随着合成图像和深度伪造技术的进步,验证图像真实性变得愈发重要。现有方法难以应对未见过的生成模型,因此需要一种无需重新训练即可泛化到未知生成器的溯源方法。 Method: Proto-LeakNet在扩散模型的潜在空间中运行,通过重模拟部分前向扩散过程来暴露生成器特有的残留线索。采用时间注意力编码器聚合多步潜在特征,并设计特征加权原型头来构建可解释的嵌入空间,结合闭集分类与基于密度的开集评估,实现对未见生成器的分析。 Result: 该方法在仅使用闭集数据训练的情况下,达到98.13%的Macro AUC,优于当前最先进的方法,且在后处理下仍保持鲁棒性,能够有效区分已知与未知生成器。 Conclusion: 建模潜在空间中的信号泄露偏差可为AI生成图像和深度伪造检测提供可靠且可解释的解决方案,具有良好的泛化性和实际应用前景。 Abstract: The growing sophistication of synthetic image and deepfake generation models has turned source attribution and authenticity verification into a critical challenge for modern computer vision systems. Recent studies suggest that diffusion pipelines unintentionally imprint persistent statistical traces, known as signal leaks, within their outputs, particularly in latent representations. Building on this observation, we propose Proto-LeakNet, a signal-leak-aware and interpretable attribution framework that integrates closed-set classification with a density-based open-set evaluation on the learned embeddings, enabling analysis of unseen generators without retraining. Operating in the latent domain of diffusion models, our method re-simulates partial forward diffusion to expose residual generator-specific cues. A temporal attention encoder aggregates multi-step latent features, while a feature-weighted prototype head structures the embedding space and enables transparent attribution. Trained solely on closed data and achieving a Macro AUC of 98.13%, Proto-LeakNet learns a latent geometry that remains robust under post-processing, surpassing state-of-the-art methods, and achieves strong separability between known and unseen generators. These results demonstrate that modeling signal-leak bias in latent space enables reliable and interpretable AI-image and deepfake forensics. The code for the whole work will be available upon submission.[78] DINOv2 Driven Gait Representation Learning for Video-Based Visible-Infrared Person Re-identification
Yujie Yang,Shuang Li,Jun Ye,Neng Dong,Fan Li,Huafeng Li
Main category: cs.CV
TL;DR: 提出DinoGRL框架,利用DINOv2驱动的步态表征学习,通过SASGL模型和PBMGE模块增强跨模态视频行人重识别的时空一致性与特征判别性。
Details
Motivation: 现有方法忽视了具有时序动态且模态不变的步态特征,限制了跨模态视频匹配中的时空一致性建模能力。 Method: 提出DinoGRL框架,包括语义感知轮廓与步态学习(SASGL)模型和渐进式双向多粒度增强(PBMGE)模块,利用DINOv2先验知识联合优化步态与外观特征。 Result: 在HITSZ-VCM和BUPT数据集上显著优于现有最先进方法,验证了所提方法的有效性。 Conclusion: DinoGRL通过融合DINOv2先验与步态特征,有效提升了视频可见光-红外行人重识别的性能。 Abstract: Video-based Visible-Infrared person re-identification (VVI-ReID) aims to retrieve the same pedestrian across visible and infrared modalities from video sequences. Existing methods tend to exploit modality-invariant visual features but largely overlook gait features, which are not only modality-invariant but also rich in temporal dynamics, thus limiting their ability to model the spatiotemporal consistency essential for cross-modal video matching. To address these challenges, we propose a DINOv2-Driven Gait Representation Learning (DinoGRL) framework that leverages the rich visual priors of DINOv2 to learn gait features complementary to appearance cues, facilitating robust sequence-level representations for cross-modal retrieval. Specifically, we introduce a Semantic-Aware Silhouette and Gait Learning (SASGL) model, which generates and enhances silhouette representations with general-purpose semantic priors from DINOv2 and jointly optimizes them with the ReID objective to achieve semantically enriched and task-adaptive gait feature learning. Furthermore, we develop a Progressive Bidirectional Multi-Granularity Enhancement (PBMGE) module, which progressively refines feature representations by enabling bidirectional interactions between gait and appearance streams across multiple spatial granularities, fully leveraging their complementarity to enhance global representations with rich local details and produce highly discriminative features. Extensive experiments on HITSZ-VCM and BUPT datasets demonstrate the superiority of our approach, significantly outperforming existing state-of-the-art methods.[79] FastGS: Training 3D Gaussian Splatting in 100 Seconds
Shiwei Ren,Tianci Wen,Yongchun Fang,Biao Lu
Main category: cs.CV
TL;DR: 本文提出了一种名为FastGS的新型加速框架,通过基于多视图一致性的稠密化与剪枝策略,显著提升了3D高斯溅射的训练速度,同时保持了良好的渲染质量。
Details
Motivation: 现有的3D高斯溅射加速方法在训练过程中无法有效控制高斯数量,导致计算冗余和时间开销过大。 Method: 设计了一种基于多视图一致性的高斯重要性评估机制,并提出无需预算机制的稠密化与剪枝策略。 Result: 在多个数据集上实现了显著的训练加速,如在Mip-NeRF 360上比DashGaussian快3.32倍,在Deep Blending上比原始3DGS快15.45倍,并在多种任务中展现出2-7倍的加速效果。 Conclusion: FastGS是一种高效、通用的3D高斯溅射加速框架,能够在保持渲染质量的同时大幅缩短训练时间。 Abstract: The dominant 3D Gaussian splatting (3DGS) acceleration methods fail to properly regulate the number of Gaussians during training, causing redundant computational time overhead. In this paper, we propose FastGS, a novel, simple, and general acceleration framework that fully considers the importance of each Gaussian based on multi-view consistency, efficiently solving the trade-off between training time and rendering quality. We innovatively design a densification and pruning strategy based on multi-view consistency, dispensing with the budgeting mechanism. Extensive experiments on Mip-NeRF 360, Tanks & Temples, and Deep Blending datasets demonstrate that our method significantly outperforms the state-of-the-art methods in training speed, achieving a 3.32$\times$ training acceleration and comparable rendering quality compared with DashGaussian on the Mip-NeRF 360 dataset and a 15.45$\times$ acceleration compared with vanilla 3DGS on the Deep Blending dataset. We demonstrate that FastGS exhibits strong generality, delivering 2-7$\times$ training acceleration across various tasks, including dynamic scene reconstruction, surface reconstruction, sparse-view reconstruction, large-scale reconstruction, and simultaneous localization and mapping. The project page is available at https://fastgs.github.io/[80] Vision Foundation Models in Agriculture: Toward Domain-Specific Adaptation for Weed Herbicide Trials Assessment
Leire Benito-Del-Valle,Artzai Picón,Daniel Mugica,Manuel Ramos,Eva Portillo,Javier Romero,Carlos Javier Jimenez,Ramón Navarra-Mestre
Main category: cs.CV
TL;DR: 本文提出了一种针对除草剂田间试验的领域专用视觉基础模型,通过自监督学习在大规模农业数据集上训练,显著提升了植物种类识别和除草剂损伤分类的性能,尤其在跨环境和无人机图像等域偏移场景下表现更优,并能大幅减少标注需求。
Details
Motivation: 通用视觉基础模型在农业中面临细粒度区分物种和损伤类型的挑战,难以满足除草剂田间试验对高精度识别的需求。 Method: 采用自监督学习方法,在大规模、经过整理的农业数据集上对通用视觉基础模型进行领域适应性训练,以学习适用于除草剂试验图像的丰富且可迁移的特征表示。 Result: 领域专用模型在物种识别(F1从0.91提升至0.94)和损伤分类(从0.26提升至0.33)上均显著优于通用模型;在新环境和时间条件下性能提升更明显;在无人机图像等域偏移场景下仍保持优势;同时在少样本情况下分割精度更高,仅用20%标注数据即可超越通用模型。 Conclusion: 领域专用基础模型具有更强的泛化能力,可显著降低人工标注成本,为除草剂田间试验提供可扩展、自动化的分析方案。 Abstract: Herbicide field trials require accurate identification of plant species and assessment of herbicide-induced damage across diverse environments. While general-purpose vision foundation models have shown promising results in complex visual domains, their performance can be limited in agriculture, where fine-grained distinctions between species and damage types are critical. In this work, we adapt a general-purpose vision foundation model to herbicide trial characterization. Trained using a self-supervised learning approach on a large, curated agricultural dataset, the model learns rich and transferable representations optimized for herbicide trials images. Our domain-specific model significantly outperforms the best general-purpose foundation model in both species identification (F1 score improvement from 0.91 to 0.94) and damage classification (from 0.26 to 0.33). Under unseen conditions (new locations and other time), it achieves even greater gains (species identification from 0.56 to 0.66; damage classification from 0.17 to 0.27). In domain-shift scenarios, such as drone imagery, it maintains strong performance (species classification from 0.49 to 0.60). Additionally, we show that domain-specific pretraining enhances segmentation accuracy, particularly in low-annotation regimes. An annotation-efficiency analysis reveals that, under unseen conditions, the domain-specific model achieves 5.4% higher F1 score than the general-purpose model, while using 80% fewer labeled samples. These results demonstrate the generalization capabilities of domain-specific foundation models and their potential to significantly reduce manual annotation efforts, offering a scalable and automated solution for herbicide trial analysis.[81] Deep learning-based object detection of offshore platforms on Sentinel-1 Imagery and the impact of synthetic training data
Robin Spanier,Thorsten Hoeser,Claudia Kuenzer
Main category: cs.CV
TL;DR: 本研究结合合成与真实Sentinel-1卫星影像,训练YOLOv10模型用于海上基础设施检测,验证了合成数据在提升模型性能和地理可迁移性方面的有效性。
Details
Motivation: 由于海上基础设施种类、形状和尺寸的样本稀缺且分布不均,现有检测模型受限于数据不平衡问题,亟需提升模型泛化能力与数据代表性。 Method: 采用YOLOv10深度学习目标检测模型,结合合成与真实Sentinel-1卫星图像进行训练,并在未参与训练的三个区域(墨西哥湾、北海、波斯湾)进行跨区域测试,评估模型的地理可迁移性。 Result: 模型F1分数从0.85提升至0.90,共检测到3,529个海上平台,其中北海411个、墨西哥湾1,519个、波斯湾1,593个,验证了模型的良好泛化能力。 Conclusion: 合成数据能有效改善类别不平衡问题,提升模型性能,有助于实现全球范围内可扩展的海上基础设施监测,凸显了深度学习与平衡数据集在遥感应用中的潜力。 Abstract: The recent and ongoing expansion of marine infrastructure, including offshore wind farms, oil and gas platforms, artificial islands, and aquaculture facilities, highlights the need for effective monitoring systems. The development of robust models for offshore infrastructure detection relies on comprehensive, balanced datasets, but falls short when samples are scarce, particularly for underrepresented object classes, shapes, and sizes. By training deep learning-based YOLOv10 object detection models with a combination of synthetic and real Sentinel-1 satellite imagery acquired in the fourth quarter of 2023 from four regions (Caspian Sea, South China Sea, Gulf of Guinea, and Coast of Brazil), this study investigates the use of synthetic training data to enhance model performance. We evaluated this approach by applying the model to detect offshore platforms in three unseen regions (Gulf of Mexico, North Sea, Persian Gulf) and thereby assess geographic transferability. This region-holdout evaluation demonstrated that the model generalises beyond the training areas. In total, 3,529 offshore platforms were detected, including 411 in the North Sea, 1,519 in the Gulf of Mexico, and 1,593 in the Persian Gulf. The model achieved an F1 score of 0.85, which improved to 0.90 upon incorporating synthetic data. We analysed how synthetic data enhances the representation of unbalanced classes and overall model performance, taking a first step toward globally transferable detection of offshore infrastructure. This study underscores the importance of balanced datasets and highlights synthetic data generation as an effective strategy to address common challenges in remote sensing, demonstrating the potential of deep learning for scalable, global offshore infrastructure monitoring.[82] RISE-T2V: Rephrasing and Injecting Semantics with LLM for Expansive Text-to-Video Generation
Xiangjun Zhang,Litong Gong,Yinglin Zheng,Yansong Liu,Wentao Jiang,Mingyi Xu,Biao Wang,Tiezheng Ge,Ming Zeng
Main category: cs.CV
TL;DR: 提出RISE-T2V框架,通过集成提示重写与语义特征提取,提升文本到视频生成模型对用户意图的理解和生成质量。
Details
Motivation: 现有T2V模型依赖预训练文本编码器,但对简短提示理解不足,且无法在线重写提示以更好对齐用户意图,限制了模型的可扩展性与可用性。 Method: 提出RISE-T2V框架,引入重写适配器(Rephrasing Adapter),将提示重写与语义特征提取融合为一步,利用LLM在下一词预测中的隐藏状态作为视频生成条件,并支持多种预训练LLM和视频扩散模型。 Result: 实验证明RISE-T2V可广泛适用于不同视频扩散模型架构,在生成质量和语义对齐方面显著提升,支持更广泛的T2V任务。 Conclusion: RISE-T2V通过统一提示重写与特征提取,增强了T2V模型对用户意图的理解能力,提升了生成视频的质量与一致性,具有良好的通用性和应用前景。 Abstract: Most text-to-video(T2V) diffusion models depend on pre-trained text encoders for semantic alignment, yet they often fail to maintain video quality when provided with concise prompts rather than well-designed ones. The primary issue lies in their limited textual semantics understanding. Moreover, these text encoders cannot rephrase prompts online to better align with user intentions, which limits both the scalability and usability of the models, To address these challenges, we introduce RISE-T2V, which uniquely integrates the processes of prompt rephrasing and semantic feature extraction into a single and seamless step instead of two separate steps. RISE-T2V is universal and can be applied to various pre-trained LLMs and video diffusion models(VDMs), significantly enhancing their capabilities for T2V tasks. We propose an innovative module called the Rephrasing Adapter, enabling diffusion models to utilize text hidden states during the next token prediction of the LLM as a condition for video generation. By employing a Rephrasing Adapter, the video generation model can implicitly rephrase basic prompts into more comprehensive representations that better match the user's intent. Furthermore, we leverage the powerful capabilities of LLMs to enable video generation models to accomplish a broader range of T2V tasks. Extensive experiments demonstrate that RISE-T2V is a versatile framework applicable to different video diffusion model architectures, significantly enhancing the ability of T2V models to generate high-quality videos that align with user intent. Visual results are available on the webpage at https://rise-t2v.github.io.[83] Submanifold Sparse Convolutional Networks for Automated 3D Segmentation of Kidneys and Kidney Tumours in Computed Tomography
Saúl Alonso-Monsalve,Leigh H. Whitehead,Adam Aurisano,Lorena Escudero Sanchez
Main category: cs.CV
TL;DR: 提出一种基于体素稀疏化和子流形稀疏卷积网络的两阶段方法,用于高分辨率3D医学图像中肿瘤的自动分割,在保证精度的同时显著降低计算资源消耗。
Details
Motivation: 传统卷积神经网络在处理高分辨率3D医学图像时面临计算量大、显存占用高的问题,限制了其在临床中的应用。 Method: 采用两阶段方法:首先进行体素稀疏化,然后使用子流形稀疏卷积网络进行分割,支持高分辨率输入和原生3D模型架构。 Result: 在KiTS23肾癌CT数据集上达到与竞赛优胜者相当的性能,肾脏+病灶Dice系数95.8%,肿瘤+囊肿85.7%,单独肿瘤80.3%;推理时间最多减少60%,显存占用最多减少75%。 Conclusion: 该方法在保持高精度的同时显著提升了计算效率,适用于临床环境中高分辨率3D医学图像的自动化肿瘤分割。 Abstract: The accurate delineation of tumours in radiological images like Computed Tomography is a very specialised and time-consuming task, and currently a bottleneck preventing quantitative analyses to be performed routinely in the clinical setting. For this reason, developing methods for the automated segmentation of tumours in medical imaging is of the utmost importance and has driven significant efforts in recent years. However, challenges regarding the impracticality of 3D scans, given the large amount of voxels to be analysed, usually requires the downsampling of such images or using patches thereof when applying traditional convolutional neural networks. To overcome this problem, in this paper we propose a new methodology that uses, divided into two stages, voxel sparsification and submanifold sparse convolutional networks. This method allows segmentations to be performed with high-resolution inputs and a native 3D model architecture, obtaining state-of-the-art accuracies while significantly reducing the computational resources needed in terms of GPU memory and time. We studied the deployment of this methodology in the context of Computed Tomography images of renal cancer patients from the KiTS23 challenge, and our method achieved results competitive with the challenge winners, with Dice similarity coefficients of 95.8% for kidneys + masses, 85.7% for tumours + cysts, and 80.3% for tumours alone. Crucially, our method also offers significant computational improvements, achieving up to a 60% reduction in inference time and up to a 75\% reduction in VRAM usage compared to an equivalent dense architecture, across both CPU and various GPU cards tested.[84] Comparative Study of CNN Architectures for Binary Classification of Horses and Motorcycles in the VOC 2008 Dataset
Muhammad Annas Shaikh,Hamza Zaman,Arbaz Asif
Main category: cs.CV
TL;DR: 本文评估了九种卷积神经网络在VOC 2008数据集上进行马与摩托车二分类的表现,重点解决类别不平衡问题。通过引入少数类数据增强技术,比较了包括ResNet-50、ConvNeXt-Tiny、DenseNet-121和Vision Transformer在内的多种现代架构。实验结果表明,ConvNeXt-Tiny在马和摩托车检测中表现最佳,平均精度分别达到95.53%和89.12%,且数据增强显著提升了少数类的检测性能,尤其对深层网络更为明显。
Details
Motivation: 解决VOC 2008数据集中马与摩托车二分类任务中的类别不平衡问题,并系统评估不同CNN架构在该任务上的表现差异。 Method: 采用少数类数据增强技术处理类别不平衡,对比九种CNN架构(如ResNet-50、ConvNeXt-Tiny、DenseNet-121和Vision Transformer)在多个性能指标下的表现。 Result: ConvNeXt-Tiny取得最优性能,马和摩托车的平均精度分别为95.53%和89.12%;数据增强显著提升少数类检测效果,尤其改善深层模型的表现。 Conclusion: 在类别不平衡的二分类检测任务中,选择合适的网络架构(如ConvNeXt-Tiny)并结合数据增强策略可显著提升模型性能,本研究为类似任务提供了架构选择和增强策略的有效参考。 Abstract: This paper presents a comprehensive evaluation of nine convolutional neural network architectures for binary classification of horses and motorcycles in the VOC 2008 dataset. We address the significant class imbalance problem by implementing minority-class augmentation techniques. Our experiments compare modern architectures including ResNet-50, ConvNeXt-Tiny, DenseNet-121, and Vision Transformer across multiple performance metrics. Results demonstrate substantial performance variations, with ConvNeXt-Tiny achieving the highest Average Precision (AP) of 95.53% for horse detection and 89.12% for motorcycle detection. We observe that data augmentation significantly improves minority class detection, particularly benefiting deeper architectures. This study provides insights into architecture selection for imbalanced binary classification tasks and quantifies the impact of data augmentation strategies in mitigating class imbalance issues in object detection.[85] Evaluating the Impact of Weather-Induced Sensor Occlusion on BEVFusion for 3D Object Detection
Sanjay Kumar,Tim Brophy,Eoin Martino Grua,Ganesh Sistu,Valentina Donzella,Ciaran Eising
Main category: cs.CV
TL;DR: 本文研究了在BEVFusion架构下,相机和LiDAR传感器遮挡对3D目标检测性能的影响,发现系统更依赖LiDAR,尤其是在严重遮挡时,LiDAR性能显著下降,而融合模式下遮挡相机影响较小。结果表明需要开发更具鲁棒性的遮挡感知融合方法。
Details
Motivation: 尽管BEV融合架构在多模态感知中表现良好,但环境遮挡(如雾、霾或物理障碍)对3D检测性能的影响尚未充分研究,因此需系统分析不同传感器在遮挡条件下的表现差异。 Method: 采用BEVFusion架构,在nuScenes数据集上评估相机和LiDAR在不同遮挡程度下的3D检测性能,使用mAP和NDS作为评价指标,分别测试单模态与融合模式下的表现。 Result: 中等相机遮挡导致纯相机检测mAP下降41.3%;严重LiDAR遮挡使其mAP下降47.3%,且严重影响远距离检测;融合模式下,遮挡相机仅使mAP下降4.1%,而遮挡LiDAR导致26.8%的下降,显示模型更依赖LiDAR。 Conclusion: 当前BEV融合模型对LiDAR输入高度依赖,在LiDAR受遮挡时性能显著下降,未来需发展能应对传感器部分失效或退化的鲁棒融合方法和遮挡感知评估机制。 Abstract: Accurate 3D object detection is essential for automated vehicles to navigate safely in complex real-world environments. Bird's Eye View (BEV) representations, which project multi-sensor data into a top-down spatial format, have emerged as a powerful approach for robust perception. Although BEV-based fusion architectures have demonstrated strong performance through multimodal integration, the effects of sensor occlusions, caused by environmental conditions such as fog, haze, or physical obstructions, on 3D detection accuracy remain underexplored. In this work, we investigate the impact of occlusions on both camera and Light Detection and Ranging (LiDAR) outputs using the BEVFusion architecture, evaluated on the nuScenes dataset. Detection performance is measured using mean Average Precision (mAP) and the nuScenes Detection Score (NDS). Our results show that moderate camera occlusions lead to a 41.3% drop in mAP (from 35.6% to 20.9%) when detection is based only on the camera. On the other hand, LiDAR sharply drops in performance only under heavy occlusion, with mAP falling by 47.3% (from 64.7% to 34.1%), with a severe impact on long-range detection. In fused settings, the effect depends on which sensor is occluded: occluding the camera leads to a minor 4.1% drop (from 68.5% to 65.7%), while occluding LiDAR results in a larger 26.8% drop (to 50.1%), revealing the model's stronger reliance on LiDAR for the task of 3D object detection. Our results highlight the need for future research into occlusion-aware evaluation methods and improved sensor fusion techniques that can maintain detection accuracy in the presence of partial sensor failure or degradation due to adverse environmental conditions.[86] A MATLAB tutorial on deep feature extraction combined with chemometrics for analytical applications
Puneet Mishra,Martijntje Vollebregt,Yizhou Ma,Maria Font-i-Furnols
Main category: cs.CV
TL;DR: 本教程旨在通过提供逐步指导,帮助分析化学领域研究人员利用现有的开源深度学习模型从成像数据中提取空间信息,并将其与其他数据(如光谱信息)结合,以增强数据分析能力。
Details
Motivation: 尽管深度学习在图像处理方面取得了显著进展,但在分析化学中的应用仍受限于缺乏系统、详细的实施指南。传统化学计量方法难以有效提取复杂的多尺度空间信息,因此需要更高效的工具和方法。 Method: 本教程不侧重于训练深度学习模型,而是展示如何使用现成的开源深度学习模型(结合MATLAB代码)从多种成像模态中提取深层特征,并将这些特征与光谱等其他数据源融合分析。 Result: 提供了可在不同成像模态下运行的MATLAB代码示例,帮助读者在自己的数据集上实现空间信息的提取与整合,提升对复杂化学数据的理解和预测能力。 Conclusion: 通过结构化的实践教程,降低了深度学习技术在分析化学中的应用门槛,促进了先进图像处理技术在该领域的普及和应用。 Abstract: Background In analytical chemistry, spatial information about materials is commonly captured through imaging techniques, such as traditional color cameras or with advanced hyperspectral cameras and microscopes. However, efficiently extracting and analyzing this spatial information for exploratory and predictive purposes remains a challenge, especially when using traditional chemometric methods. Recent advances in deep learning and artificial intelligence have significantly enhanced image processing capabilities, enabling the extraction of multiscale deep features that are otherwise challenging to capture with conventional image processing techniques. Despite the wide availability of open-source deep learning models, adoption in analytical chemistry remains limited because of the absence of structured, step-by-step guidance for implementing these models. Results This tutorial aims to bridge this gap by providing a step-by-step guide for applying deep learning approaches to extract spatial information from imaging data and integrating it with other data sources, such as spectral information. Importantly, the focus of this work is not on training deep learning models for image processing but on using existing open source models to extract deep features from imaging data. Significance The tutorial provides MATLAB code tutorial demonstrations, showcasing the processing of imaging data from various imaging modalities commonly encountered in analytical chemistry. Readers must run the tutorial steps on their own datasets using the codes presented in this tutorial.[87] Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
Jingqi Tong,Yurong Mou,Hangcheng Li,Mingzhe Li,Yongzhuo Yang,Ming Zhang,Qiguang Chen,Tianyi Liang,Xiaomeng Hu,Yining Zheng,Xinchi Chen,Jun Zhao,Xuanjing Huang,Xipeng Qiu
Main category: cs.CV
TL;DR: 提出“视频思维”新范式,利用视频生成模型(如Sora-2)统一文本与视觉推理,克服图文分离和静态图像局限,在视觉和文本任务上均表现出色。
Details
Motivation: 现有“文本思维”和“图像思维”范式无法有效处理动态过程且图文分离,限制了多模态统一理解与生成。 Method: 提出“思考用视频”新范式,结合视频生成模型在时间维度上统一图文推理,并构建VideoThinkBench基准测试,包含视觉主导和文本主导两类任务。 Result: Sora-2在视觉任务上媲美甚至超越SOTA VLMs,在文本任务上取得92% MATH准确率和75.53% MMMU准确率,且自洽性和上下文学习可进一步提升性能。 Conclusion: 视频生成模型具备成为统一多模态理解与生成模型的潜力,“视频思维”是一种有前景的统一多模态推理范式。 Abstract: "Thinking with Text" and "Thinking with Images" paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. To overcome these limitations, we introduce "Thinking with Video", a new paradigm that leverages video generation models, such as Sora-2, to bridge visual and textual reasoning in a unified temporal framework. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2's performance. In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positions "thinking with video" as a unified multimodal reasoning paradigm.[88] Multi-Task Learning for Visually Grounded Reasoning in Gastrointestinal VQA
Itbaan Safwan,Muhammad Annas Shaikh,Muhammad Haaris,Ramail Khan,Muhammad Atif Tahir
Main category: cs.CV
TL;DR: 提出了一种基于LoRA微调Florence-2模型的多任务框架,用于同时进行视觉问答、解释生成和视觉定位,在MediaEval Medico 2025挑战赛中显著优于单任务基线。
Details
Motivation: 为了提升医学视觉问答系统的准确性和可解释性,解决单任务模型在跨模态理解与视觉定位上的局限性。 Method: 采用LoRA微调Florence-2模型,结合三个精心构建的数据集(Kvasir-VQA-x1、合成增强的解释数据集和文本到区域配对数据集),实现多任务联合学习。 Result: 在答案准确性和视觉定位方面均显著优于单任务基线,验证了多任务学习在医学VQA中的有效性。 Conclusion: 所提出的多任务框架通过联合学习视觉接地、推理和解释,能够有效提升医学图像理解任务的性能和可解释性。 Abstract: We present a multi-task framework for the MediaEval Medico 2025 challenge, leveraging a LoRA-tuned Florence-2 model for simultaneous visual question answering (VQA), explanation generation, and visual grounding. The proposed system integrates three curated datasets: (1) Kvasir-VQA-x1 for question-answer learning, (2) a synthetically enriched explanation dataset offering structured medical reasoning, and (3) text-to-region pairs linking visual features with segmentation masks. This multi-task setup enables the model to jointly learn visual grounding, reasoning, and interpretation, producing responses that are both accurate and interpretable. Extensive evaluation demonstrates that our approach substantially improves over single-task baselines in both answer accuracy and visual localization, highlighting the effectiveness of grounded multi-task learning for medical VQA applications.[89] BoRe-Depth: Self-supervised Monocular Depth Estimation with Boundary Refinement for Embedded Systems
Chang Liu,Juan Li,Sheng Zhang,Chang Liu,Jie Li,Xu Zhang
Main category: cs.CV
TL;DR: 提出一种轻量级单目深度估计模型BoRe-Depth(仅8.7M参数),在嵌入式系统上实现高效推理(50.7 FPS),显著提升边界质量和整体性能。
Details
Motivation: 现有单目深度估计方法在嵌入式系统上存在深度估计性能差和物体边界模糊的问题,亟需高效且精确的解决方案。 Method: 设计了增强型特征自适应融合模块(EFAF)以增强边界细节表征,并在编码器中引入语义知识以提升物体识别与边界感知能力。 Result: BoRe-Depth在多个具有挑战性的数据集上显著优于先前的轻量级模型,部署于NVIDIA Jetson Orin上可达50.7 FPS,且边界质量明显改善。 Conclusion: BoRe-Depth在保持低参数量和高推理速度的同时,有效提升了深度估计精度与边界清晰度,适用于资源受限的无人系统3D感知应用。 Abstract: Depth estimation is one of the key technologies for realizing 3D perception in unmanned systems. Monocular depth estimation has been widely researched because of its low-cost advantage, but the existing methods face the challenges of poor depth estimation performance and blurred object boundaries on embedded systems. In this paper, we propose a novel monocular depth estimation model, BoRe-Depth, which contains only 8.7M parameters. It can accurately estimate depth maps on embedded systems and significantly improves boundary quality. Firstly, we design an Enhanced Feature Adaptive Fusion Module (EFAF) which adaptively fuses depth features to enhance boundary detail representation. Secondly, we integrate semantic knowledge into the encoder to improve the object recognition and boundary perception capabilities. Finally, BoRe-Depth is deployed on NVIDIA Jetson Orin, and runs efficiently at 50.7 FPS. We demonstrate that the proposed model significantly outperforms previous lightweight models on multiple challenging datasets, and we provide detailed ablation studies for the proposed methods. The code is available at https://github.com/liangxiansheng093/BoRe-Depth.[90] DORAEMON: A Unified Library for Visual Object Modeling and Representation Learning at Scale
Ke Du,Yimin Peng,Chao Gao,Fan Zhou,Siqiao Xue
Main category: cs.CV
TL;DR: DORAEMON是一个开源的PyTorch库,统一了多尺度视觉对象建模和表示学习,支持分类、检索和度量学习,提供超过1000个预训练骨干网络,并可通过单命令导出至ONNX或HuggingFace,促进研究到部署的快速转化。
Details
Motivation: 为了整合视觉识别与表示学习中的数据集、模型和训练技术,实现高效的研究实验与实际应用之间的迁移。 Method: 采用单一YAML驱动的工作流,集成多种任务(如分类、检索、度量学习),并通过timm兼容接口提供大量预训练模型,结合模块化损失、数据增强和分布式训练工具。 Result: 在ImageNet-1K、MS-Celeb-1M和Stanford Online Products等数据集上复现或超越了基准结果,并支持一键导出模型用于部署。 Conclusion: DORAEMON为视觉表示学习提供了一个可扩展的统一平台,显著提升了研究效率和实际应用的便捷性。 Abstract: DORAEMON is an open-source PyTorch library that unifies visual object modeling and representation learning across diverse scales. A single YAML-driven workflow covers classification, retrieval and metric learning; more than 1000 pretrained backbones are exposed through a timm-compatible interface, together with modular losses, augmentations and distributed-training utilities. Reproducible recipes match or exceed reference results on ImageNet-1K, MS-Celeb-1M and Stanford online products, while one-command export to ONNX or HuggingFace bridges research and deployment. By consolidating datasets, models, and training techniques into one platform, DORAEMON offers a scalable foundation for rapid experimentation in visual recognition and representation learning, enabling efficient transfer of research advances to real-world applications. The repository is available at https://github.com/wuji3/DORAEMON.[91] HideAndSeg: an AI-based tool with automated prompting for octopus segmentation in natural habitats
Alan de Aguiar,Michaella Pereira Andrade,Charles Morphy D. Santos,João Paulo Gois
Main category: cs.CV
TL;DR: 本文提出了一种名为HideAndSeg的半监督AI工具,用于在自然环境中自动分割章鱼视频,结合SAM2与自训练YOLOv11检测器,并引入两个无监督评估指标,在减少人工干预的同时实现了鲁棒的分割性能。
Details
Motivation: 由于章鱼具有伪装能力、快速变色、非刚体形变和频繁遮挡等特点,加之水下光照和浑浊度变化,使其在自然环境中的分析极为困难;现有方法缺乏大规模标注数据集,难以实现自动化分割。 Method: HideAndSeg首先利用用户提供的点坐标通过SAM2生成初始分割掩码,这些掩码用于训练YOLOv11检测器;随后系统通过YOLO输出的边界框作为SAM2的提示,实现全流程自动化;并提出两个无监督指标DICE_t和NC_t来评估和优化分割质量。 Result: 实验表明,HideAndSeg相比手动提示方法显著减少了分割噪声,能够在完全遮挡后重新识别并分割章鱼,且在真实自然场景中表现出良好的鲁棒性和连续性。 Conclusion: 该方法有效降低了对人工标注的依赖,为野生头足类动物的行为研究提供了一个高效、实用的视频分析工具。 Abstract: Analyzing octopuses in their natural habitats is challenging due to their camouflage capability, rapid changes in skin texture and color, non-rigid body deformations, and frequent occlusions, all of which are compounded by variable underwater lighting and turbidity. Addressing the lack of large-scale annotated datasets, this paper introduces HideAndSeg, a novel, minimally supervised AI-based tool for segmenting videos of octopuses. It establishes a quantitative baseline for this task. HideAndSeg integrates SAM2 with a custom-trained YOLOv11 object detector. First, the user provides point coordinates to generate the initial segmentation masks with SAM2. These masks serve as training data for the YOLO model. After that, our approach fully automates the pipeline by providing a bounding box prompt to SAM2, eliminating the need for further manual intervention. We introduce two unsupervised metrics - temporal consistency $DICE_t$ and new component count $NC_t$ - to quantitatively evaluate segmentation quality and guide mask refinement in the absence of ground-truth data, i.e., real-world information that serves to train, validate, and test AI models. Results show that HideAndSeg achieves satisfactory performance, reducing segmentation noise compared to the manually prompted approach. Our method can re-identify and segment the octopus even after periods of complete occlusion in natural environments, a scenario in which the manually prompted model fails. By reducing the need for manual analysis in real-world scenarios, this work provides a practical tool that paves the way for more efficient behavioral studies of wild cephalopods.[92] Solving Convex Partition Visual Jigsaw Puzzles
Yaniv Ohayon,Ofir Itzhak Shahar,Ohad Ben-Shahar
Main category: cs.CV
TL;DR: 本文提出了一种针对凸多边形拼图的自动求解方法,结合几何和图像兼容性,开发了贪心算法,并发布了首个此类拼图的基准数据集。
Details
Motivation: 现有研究主要集中在方形拼图上,限制了实际应用,因此需要扩展到更广泛的拼图类型,如凸多边形拼图。 Method: 利用几何和图像兼容性特征,设计了一种贪心求解算法,并构建了首个凸分割拼图的数据集用于评估。 Result: 成功实现了对凸多边形拼图的自动求解,报告了多种性能指标,并提供了基准数据集。 Conclusion: 该方法显著扩展了可计算处理的拼图类型,为更广泛的实际应用奠定了基础。 Abstract: Jigsaw puzzle solving requires the rearrangement of unordered pieces into their original pose in order to reconstruct a coherent whole, often an image, and is known to be an intractable problem. While the possible impact of automatic puzzle solvers can be disruptive in various application domains, most of the literature has focused on developing solvers for square jigsaw puzzles, severely limiting their practical use. In this work, we significantly expand the types of puzzles handled computationally, focusing on what is known as Convex Partitions, a major subset of polygonal puzzles whose pieces are convex. We utilize both geometrical and pictorial compatibilities, introduce a greedy solver, and report several performance measures next to the first benchmark dataset of such puzzles.[93] V-Thinker: Interactive Thinking with Images
Runqi Qiao,Qiuna Tan,Minghan Yang,Guanting Dong,Peiqing Yang,Shiqiang Lang,Enhui Wan,Xiaowan Wang,Yida Xu,Lan Yang,Chong Sun,Chen Li,Honggang Zhang
Main category: cs.CV
TL;DR: 本文提出了V-Thinker,一种通过端到端强化学习实现图像交互式推理的通用多模态推理助手,并构建了数据进化飞轮和视觉渐进训练课程以提升模型在多样、高质量和高难度任务上的表现。
Details
Motivation: 现有大型多模态模型在将图像交互与长视野推理深度结合方面存在局限,且受限于视觉工具空间不足和特定任务的工作流设计,难以实现真正的图像为中心的推理。 Method: 提出V-Thinker,包含两个核心组件:数据进化飞轮(自动合成、演化和验证交互推理数据集)和视觉渐进训练课程(先通过点级监督对齐感知,再通过两阶段强化学习整合交互推理)。同时构建了专家验证的基准VTBench。 Result: 实验表明,V-Thinker在通用和交互式推理场景中均优于强大的基于LMM的基线模型,在图像交互推理任务上表现出显著优势。 Conclusion: V-Thinker为推动以图像为中心的交互式推理提供了有效框架,展示了通过强化学习实现多模态模型深层图像交互与复杂推理融合的潜力。 Abstract: Empowering Large Multimodal Models (LMMs) to deeply integrate image interaction with long-horizon reasoning capabilities remains a long-standing challenge in this field. Recent advances in vision-centric reasoning explore a promising "Thinking with Images" paradigm for LMMs, marking a shift from image-assisted reasoning to image-interactive thinking. While this milestone enables models to focus on fine-grained image regions, progress remains constrained by limited visual tool spaces and task-specific workflow designs. To bridge this gap, we present V-Thinker, a general-purpose multimodal reasoning assistant that enables interactive, vision-centric thinking through end-to-end reinforcement learning. V-Thinker comprises two key components: (1) a Data Evolution Flywheel that automatically synthesizes, evolves, and verifies interactive reasoning datasets across three dimensions-diversity, quality, and difficulty; and (2) a Visual Progressive Training Curriculum that first aligns perception via point-level supervision, then integrates interactive reasoning through a two-stage reinforcement learning framework. Furthermore, we introduce VTBench, an expert-verified benchmark targeting vision-centric interactive reasoning tasks. Extensive experiments demonstrate that V-Thinker consistently outperforms strong LMM-based baselines in both general and interactive reasoning scenarios, providing valuable insights for advancing image-interactive reasoning applications.[94] Landslide Hazard Mapping with Geospatial Foundation Models: Geographical Generalizability, Data Scarcity, and Band Adaptability
Wenwen Li,Sizhe Wang,Hyunho Lee,Chenyan Lu,Sujit Roy,Rahul Ramachandran,Chia-Yu Hsu
Main category: cs.CV
TL;DR: 本研究提出了一种基于地理空间基础模型(GeoFMs)Prithvi-EO-2.0的三轴分析框架(传感器、标签、域),用于滑坡制图,展现出在跨传感器、跨区域和少样本条件下的优越性能和泛化能力。
Details
Motivation: 传统深度学习模型在不同传感器、区域或标注数据有限的情况下表现不佳,难以满足滑坡灾害快速响应与风险防控的需求。 Method: 采用基于全球预训练、自监督学习和可适应微调的地理空间基础模型Prithvi-EO-2.0,构建传感器、标签和域三个维度的适应性分析框架,并通过多组实验对比任务特定CNN、视觉Transformer及其他GeoFMs的性能。 Result: Prithvi-EO-2.0在多种条件下均优于U-Net、Segformer等模型,表现出对光谱变化的鲁棒性、在标签稀缺情况下的高准确性以及跨地理区域的良好泛化能力,但也面临计算成本高和可用标注数据少的挑战。 Conclusion: 地理空间基础模型为滑坡制图提供了更鲁棒、可扩展的解决方案,是实现灾害风险降低和环境监测的重要进展。 Abstract: Landslides cause severe damage to lives, infrastructure, and the environment, making accurate and timely mapping essential for disaster preparedness and response. However, conventional deep learning models often struggle when applied across different sensors, regions, or under conditions of limited training data. To address these challenges, we present a three-axis analytical framework of sensor, label, and domain for adapting geospatial foundation models (GeoFMs), focusing on Prithvi-EO-2.0 for landslide mapping. Through a series of experiments, we show that it consistently outperforms task-specific CNNs (U-Net, U-Net++), vision transformers (Segformer, SwinV2-B), and other GeoFMs (TerraMind, SatMAE). The model, built on global pretraining, self-supervision, and adaptable fine-tuning, proved resilient to spectral variation, maintained accuracy under label scarcity, and generalized more reliably across diverse datasets and geographic settings. Alongside these strengths, we also highlight remaining challenges such as computational cost and the limited availability of reusable AI-ready training data for landslide research. Overall, our study positions GeoFMs as a step toward more robust and scalable approaches for landslide risk reduction and environmental monitoring.[95] THEval. Evaluation Framework for Talking Head Video Generation
Nabyl Quignon,Baptiste Chopin,Yaohui Wang,Antitza Dantcheva
Main category: cs.CV
TL;DR: 提出了一种新的评估框架,包含8个与质量、自然度和同步性相关的指标,用于更全面地评估说话头生成视频的质量。
Details
Motivation: 现有的评估指标有限,主要集中在视频质量、唇形同步和用户研究上,无法充分评估快速发展的视频生成技术。 Method: 设计了一个包含三个维度(质量、自然度、同步性)的评估框架,并选择了8个高效且与人类偏好对齐的指标,重点分析头部、嘴巴和眉毛的细粒度动态以及面部质量。 Result: 在17种最先进模型生成的85,000个视频上的实验表明,尽管许多算法在唇形同步方面表现良好,但在生成表情丰富性和无伪影细节方面仍存在挑战。 Conclusion: 所提出的基准框架有助于评估生成方法的进步,代码、数据集和排行榜将公开发布并定期更新,以反映该领域的进展。 Abstract: Video generation has achieved remarkable progress, with generated videos increasingly resembling real ones. However, the rapid advance in generation has outpaced the development of adequate evaluation metrics. Currently, the assessment of talking head generation primarily relies on limited metrics, evaluating general video quality, lip synchronization, and on conducting user studies. Motivated by this, we propose a new evaluation framework comprising 8 metrics related to three dimensions (i) quality, (ii) naturalness, and (iii) synchronization. In selecting the metrics, we place emphasis on efficiency, as well as alignment with human preferences. Based on this considerations, we streamline to analyze fine-grained dynamics of head, mouth, and eyebrows, as well as face quality. Our extensive experiments on 85,000 videos generated by 17 state-of-the-art models suggest that while many algorithms excel in lip synchronization, they face challenges with generating expressiveness and artifact-free details. These videos were generated based on a novel real dataset, that we have curated, in order to mitigate bias of training data. Our proposed benchmark framework is aimed at evaluating the improvement of generative methods. Original code, dataset and leaderboards will be publicly released and regularly updated with new methods, in order to reflect progress in the field.[96] Learning from Single Timestamps: Complexity Estimation in Laparoscopic Cholecystectomy
Dimitrios Anastasiou,Santiago Barbarisi,Lucy Culshaw,Jayna Patel,Evangelos B. Mazomenos,Imanol Luengo,Danail Stoyanov
Main category: cs.CV
TL;DR: 本文提出了一种名为STC-Net的新框架,用于在腹腔镜胆囊切除术(LC)中基于Parkland分级量表(PGS)自动评估手术复杂性,能够在弱时间监督下直接处理完整手术视频,实现了62.11%的准确率和61.42%的F1分数,显著优于基线方法。
Details
Motivation: 准确评估LC手术复杂性对临床决策至关重要,但现有方法多依赖静态图像或手动剪辑视频片段,难以应用于真实场景中的完整手术视频,因此需要一种能在无先验标注情况下自动化评估炎症严重程度的方法。 Method: 提出STC-Net,包含定位、窗口提议和分级模块,采用联合时间定位与分级策略,并引入结合硬性和软性定位目标及背景感知分级监督的新型损失函数,在弱时间监督下直接对完整LC视频进行单时间戳复杂性估计。 Result: 在1,859个LC视频组成的私有数据集上,STC-Net达到62.11%的准确率和61.42%的F1-score,比非定位基线模型高出10%以上,验证了弱监督在手术复杂性评估中的有效性。 Conclusion: STC-Net为基于PGS的手术复杂性自动化评估提供了一个可扩展且有效的方法,适用于完整LC视频分析,具有用于术后分析和外科培训的潜力。 Abstract: Purpose: Accurate assessment of surgical complexity is essential in Laparoscopic Cholecystectomy (LC), where severe inflammation is associated with longer operative times and increased risk of postoperative complications. The Parkland Grading Scale (PGS) provides a clinically validated framework for stratifying inflammation severity; however, its automation in surgical videos remains largely unexplored, particularly in realistic scenarios where complete videos must be analyzed without prior manual curation. Methods: In this work, we introduce STC-Net, a novel framework for SingleTimestamp-based Complexity estimation in LC via the PGS, designed to operate under weak temporal supervision. Unlike prior methods limited to static images or manually trimmed clips, STC-Net operates directly on full videos. It jointly performs temporal localization and grading through a localization, window proposal, and grading module. We introduce a novel loss formulation combining hard and soft localization objectives and background-aware grading supervision. Results: Evaluated on a private dataset of 1,859 LC videos, STC-Net achieves an accuracy of 62.11% and an F1-score of 61.42%, outperforming non-localized baselines by over 10% in both metrics and highlighting the effectiveness of weak supervision for surgical complexity assessment. Conclusion: STC-Net demonstrates a scalable and effective approach for automated PGS-based surgical complexity estimation from full LC videos, making it promising for post-operative analysis and surgical training.[97] UniSplat: Unified Spatio-Temporal Fusion via 3D Latent Scaffolds for Dynamic Driving Scene Reconstruction
Chen Shi,Shaoshuai Shi,Xiaoyang Lyu,Chunyang Liu,Kehua Sheng,Bo Zhang,Li Jiang
Main category: cs.CV
TL;DR: UniSplat提出了一种用于自动驾驶中动态场景3D重建的通用前馈框架,通过统一的潜在时空融合实现鲁棒重建。
Details
Motivation: 现有方法在稀疏、非重叠相机视图和复杂场景动态下表现不佳,难以实现高质量的动态场景重建。 Method: 构建一个3D潜在支架,利用预训练基础模型捕捉几何和语义上下文;引入高效的融合机制,在3D支架内进行跨时空信息整合;设计双分支解码器生成动态感知的高斯表示,并维护静态高斯的持久记忆以支持流式场景补全。 Result: 在真实世界数据集上实验表明,UniSplat在新视角合成任务中达到最先进性能,且对超出原始相机覆盖范围的视角仍能提供鲁棒、高质量的渲染结果。 Conclusion: UniSplat通过统一的潜在时空融合和持久化场景表示,有效解决了稀疏视角与动态场景下的3D重建难题,具备强泛化能力和应用潜力。 Abstract: Feed-forward 3D reconstruction for autonomous driving has advanced rapidly, yet existing methods struggle with the joint challenges of sparse, non-overlapping camera views and complex scene dynamics. We present UniSplat, a general feed-forward framework that learns robust dynamic scene reconstruction through unified latent spatio-temporal fusion. UniSplat constructs a 3D latent scaffold, a structured representation that captures geometric and semantic scene context by leveraging pretrained foundation models. To effectively integrate information across spatial views and temporal frames, we introduce an efficient fusion mechanism that operates directly within the 3D scaffold, enabling consistent spatio-temporal alignment. To ensure complete and detailed reconstructions, we design a dual-branch decoder that generates dynamic-aware Gaussians from the fused scaffold by combining point-anchored refinement with voxel-based generation, and maintain a persistent memory of static Gaussians to enable streaming scene completion beyond current camera coverage. Extensive experiments on real-world datasets demonstrate that UniSplat achieves state-of-the-art performance in novel view synthesis, while providing robust and high-quality renderings even for viewpoints outside the original camera coverage.[98] PixCLIP: Achieving Fine-grained Visual Language Understanding via Any-granularity Pixel-Text Alignment Learning
Yicheng Xiao,Yu Chen,Haoxuan Ma,Jiale Hong,Caorui Li,Lingxiang Wu,Haiyun Guo,Jinqiao Wang
Main category: cs.CV
TL;DR: 本文提出了PixCLIP,一个能够同时处理视觉提示和长文本描述的新型框架,以提升CLIP在细粒度图文对齐方面的能力。作者构建了LongGRIT数据集,并采用三分支像素-文本对齐学习框架,在像素级交互和长文本处理上实现了突破。
Details
Motivation: 现有CLIP模型受限于文本编码器的长度限制,难以处理包含细粒度信息的长文本描述;同时,尽管视觉提示可增强局部感知,但缺乏与细粒度文本的协同优化机制。因此,需要一种能同时提升视觉和文本细粒度处理能力的方法。 Method: 1)构建自动化标注流程生成像素级定位的长文本描述,创建包含150万样本的LongGRIT数据集;2)用大语言模型(LLM)替换CLIP原始文本编码器;3)提出三分支像素-文本对齐学习框架,实现图像区域与任意粒度文本描述的细粒度对齐。 Result: PixCLIP在像素级图文交互和长文本处理方面表现出显著优势,实验结果显示其在多个细粒度视觉-语言任务上达到最先进的性能。 Conclusion: PixCLIP通过联合增强视觉和文本处理的细粒度,有效提升了图文对齐能力,为下一代多模态模型提供了可行的技术路径。 Abstract: While the Contrastive Language-Image Pretraining(CLIP) model has achieved remarkable success in a variety of downstream vison language understanding tasks, enhancing its capability for fine-grained image-text alignment remains an active research focus. To this end, most existing works adopt the strategy of explicitly increasing the granularity of visual information processing, e.g., incorporating visual prompts to guide the model focus on specific local regions within the image. Meanwhile, researches on Multimodal Large Language Models(MLLMs) have demonstrated that training with long and detailed textual descriptions can effectively improve the model's fine-grained vision-language alignment. However, the inherent token length limitation of CLIP's text encoder fundamentally limits CLIP to process more granular textual information embedded in long text sequences. To synergistically leverage the advantages of enhancing both visual and textual content processing granularity, we propose PixCLIP, a novel framework designed to concurrently accommodate visual prompt inputs and process lengthy textual descriptions. Specifically, we first establish an automated annotation pipeline capable of generating pixel-level localized, long-form textual descriptions for images. Utilizing this pipeline, we construct LongGRIT, a high-quality dataset comprising nearly 1.5 million samples. Secondly, we replace CLIP's original text encoder with the LLM and propose a three-branch pixel-text alignment learning framework, facilitating fine-grained alignment between image regions and corresponding textual descriptions at arbitrary granularity. Experiments demonstrate that PixCLIP showcases breakthroughs in pixel-level interaction and handling long-form texts, achieving state-of-the-art performance.[99] Building Trust in Virtual Immunohistochemistry: Automated Assessment of Image Quality
Tushar Kataria,Shikha Dubey,Mary Bronner,Jolanta Jedrzkiewicz,Ben J. Brintz,Shireen Y. Elhabian,Beatrice S. Knudsen
Main category: cs.CV
TL;DR: 提出了一种基于准确性的自动化框架来评估虚拟免疫组化(IHC)染色图像的质量,使用像素级染色准确性指标(如Dice、IoU、Hausdorff距离),发现传统图像保真度指标与实际染色准确性和病理学家评估相关性差,配对模型表现更优,且全切片图像评估比局部补丁评估更能揭示性能下降。
Details
Motivation: 现有基于纹理和分布的图像质量评估指标无法准确反映虚拟IHC染色的准确性,缺乏无需人工标注的自动化、可重复的评估方法。 Method: 采用颜色去卷积生成真实和虚拟IHC的棕色染色像素掩码,计算Dice、IoU和Hausdorff距离等指标评估染色准确性,并比较16种配对或非配对图像翻译模型在补丁和全切片级别上的表现。 Result: 传统指标(FID、PSNR、SSIM)与染色准确性及病理学家评估相关性差;配对模型(如PyramidPix2Pix、AdaptiveNCE)染色准确性最高;全切片评估揭示了补丁评估中无法发现的性能下降。 Conclusion: 所提出的框架为虚拟IHC模型提供了可重复、基于准确性的质量评估方法,是推动其在病理学常规应用中的关键步骤。 Abstract: Deep learning models can generate virtual immunohistochemistry (IHC) stains from hematoxylin and eosin (H&E) images, offering a scalable and low-cost alternative to laboratory IHC. However, reliable evaluation of image quality remains a challenge as current texture- and distribution-based metrics quantify image fidelity rather than the accuracy of IHC staining. Here, we introduce an automated and accuracy grounded framework to determine image quality across sixteen paired or unpaired image translation models. Using color deconvolution, we generate masks of pixels stained brown (i.e., IHC-positive) as predicted by each virtual IHC model. We use the segmented masks of real and virtual IHC to compute stain accuracy metrics (Dice, IoU, Hausdorff distance) that directly quantify correct pixel - level labeling without needing expert manual annotations. Our results demonstrate that conventional image fidelity metrics, including Frechet Inception Distance (FID), peak signal-to-noise ratio (PSNR), and structural similarity (SSIM), correlate poorly with stain accuracy and pathologist assessment. Paired models such as PyramidPix2Pix and AdaptiveNCE achieve the highest stain accuracy, whereas unpaired diffusion- and GAN-based models are less reliable in providing accurate IHC positive pixel labels. Moreover, whole-slide images (WSI) reveal performance declines that are invisible in patch-based evaluations, emphasizing the need for WSI-level benchmarks. Together, this framework defines a reproducible approach for assessing the quality of virtual IHC models, a critical step to accelerate translation towards routine use by pathologists.[100] NovisVQ: A Streaming Convolutional Neural Network for No-Reference Opinion-Unaware Frame Quality Assessment
Kylie Cancilla,Alexander Moore,Amar Saini,Carmen Carrano
Main category: cs.CV
TL;DR: 提出一种无参考、意见无关的流式视频质量评估模型,利用合成退化数据训练时序感知卷积网络直接预测全参考指标,优于图像基线和BRISQUE。
Details
Motivation: 现有视频质量评估方法依赖参考视频或人工评分,且多数无参考方法忽略关键时序信息,难以适用于真实场景的可扩展视频质量评估。 Method: 在DAVIS数据集上引入合成退化,训练一个时序感知的卷积架构,通过流式方式预测LPIPS、PSNR、SSIM等全参考指标,推理时无需参考视频。 Result: 该方法在多种退化类型上优于图像基线,并比BRISQUE与全参考指标具有更高的相关性,验证了时序建模的有效性。 Conclusion: 所提出的流式、无参考、意见无关VQA模型通过时序建模提升了真实视觉系统中视频质量评估的可扩展性和性能。 Abstract: Video quality assessment (VQA) is vital for computer vision tasks, but existing approaches face major limitations: full-reference (FR) metrics require clean reference videos, and most no-reference (NR) models depend on training on costly human opinion labels. Moreover, most opinion-unaware NR methods are image-based, ignoring temporal context critical for video object detection. In this work, we present a scalable, streaming-based VQA model that is both no-reference and opinion-unaware. Our model leverages synthetic degradations of the DAVIS dataset, training a temporal-aware convolutional architecture to predict FR metrics (LPIPS , PSNR, SSIM) directly from degraded video, without references at inference. We show that our streaming approach outperforms our own image-based baseline by generalizing across diverse degradations, underscoring the value of temporal modeling for scalable VQA in real-world vision systems. Additionally, we demonstrate that our model achieves higher correlation with full-reference metrics compared to BRISQUE, a widely-used opinion-aware image quality assessment baseline, validating the effectiveness of our temporal, opinion-unaware approach.[101] Polarization-resolved imaging improves eye tracking
Mantas Žurauskas,Tom Bu,Sanaz Alali,Beyza Kalkanli,Derek Shi,Fernando Alamos,Gauresh Pandit,Christopher Mei,Ali Behrooz,Ramin Mirjalili,Dave Stronks,Alexander Fix,Dmitri Model
Main category: cs.CV
TL;DR: 本文提出了一种基于偏振分辨近红外成像的偏振增强型眼动追踪(PET)系统,通过结合偏振滤光阵列相机和线性偏振近红外光源,提升了眼动追踪的精度与鲁棒性。
Details
Motivation: 传统强度图像在眼动追踪中存在特征不足的问题,尤其是在眼睑遮挡、眼距变化和瞳孔大小变化等挑战下性能受限,因此需要引入新的光学对比机制以提升追踪效果。 Method: 采用偏振滤光阵列相机与线性偏振近红外照明器组成PET系统,利用角膜和巩膜反射光的偏振状态差异提取更多可追踪特征,并使用卷积神经网络进行 gaze 误差预测。 Result: 在346名参与者的数据上,PET系统相比强度基线模型在多种实际干扰条件下将中位95%绝对注视误差降低了10-16%。 Conclusion: 偏振成像能有效增强眼动追踪系统的性能,是一种简单且鲁棒的感知模态,有望应用于未来的可穿戴设备中。 Abstract: Polarization-resolved near-infrared imaging adds a useful optical contrast mechanism to eye tracking by measuring the polarization state of light reflected by ocular tissues in addition to its intensity. In this paper we demonstrate how this contrast can be used to enable eye tracking. Specifically, we demonstrate that a polarization-enabled eye tracking (PET) system composed of a polarization--filter--array camera paired with a linearly polarized near-infrared illuminator can reveal trackable features across the sclera and gaze-informative patterns on the cornea, largely absent in intensity-only images. Across a cohort of 346 participants, convolutional neural network based machine learning models trained on data from PET reduced the median 95th-percentile absolute gaze error by 10--16\% relative to capacity-matched intensity baselines under nominal conditions and in the presence of eyelid occlusions, eye-relief changes, and pupil-size variation. These results link light--tissue polarization effects to practical gains in human--computer interaction and position PET as a simple, robust sensing modality for future wearable devices.[102] Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts
Ellis Brown,Jihan Yang,Shusheng Yang,Rob Fergus,Saining Xie
Main category: cs.CV
TL;DR: 本文提出了一种诊断和去偏见的框架,用于评估多模态大语言模型中的视觉理解能力,发现许多视觉基准测试存在非视觉偏差,并通过测试集压力测试和迭代偏差剪枝方法改进了基准测试。
Details
Motivation: 现有的多模态基准测试可能被模型利用语言先验或表面模式绕过视觉理解,导致高估模型的视觉能力,因此需要更鲁棒的诊断方法来识别和缓解这些偏差。 Method: 提出Test-set Stress-Test (TsT) 方法,使用k折交叉验证在纯文本输入上微调大语言模型以检测捷径性能,并结合随机森林进行可解释审计;进一步采用Iterative Bias Pruning (IBP) 去除高偏差样本。 Result: 在VSI-Bench、CV-Bench、MMMU和VideoMME四个基准上发现了普遍存在的非视觉偏差,并构建了VSI-Bench-Debiased,显著降低了非视觉可解性并扩大了视觉盲区性能差距。 Conclusion: 基准设计者应主动‘攻击’自己的测试集以发现漏洞,所提框架有助于构建更可靠、真正依赖视觉理解的多模态基准。 Abstract: Robust benchmarks are crucial for evaluating Multimodal Large Language Models (MLLMs). Yet we find that models can ace many multimodal benchmarks without strong visual understanding, instead exploiting biases, linguistic priors, and superficial patterns. This is especially problematic for vision-centric benchmarks that are meant to require visual inputs. We adopt a diagnostic principle for benchmark design: if a benchmark can be gamed, it will be. Designers should therefore try to ``game'' their own benchmarks first, using diagnostic and debiasing procedures to systematically identify and mitigate non-visual biases. Effective diagnosis requires directly ``training on the test set'' -- probing the released test set for its intrinsic, exploitable patterns. We operationalize this standard with two components. First, we diagnose benchmark susceptibility using a ``Test-set Stress-Test'' (TsT) methodology. Our primary diagnostic tool involves fine-tuning a powerful Large Language Model via $k$-fold cross-validation on exclusively the non-visual, textual inputs of the test set to reveal shortcut performance and assign each sample a bias score $s(x)$. We complement this with a lightweight Random Forest-based diagnostic operating on hand-crafted features for fast, interpretable auditing. Second, we debias benchmarks by filtering high-bias samples using an ``Iterative Bias Pruning'' (IBP) procedure. Applying this framework to four benchmarks -- VSI-Bench, CV-Bench, MMMU, and VideoMME -- we uncover pervasive non-visual biases. As a case study, we apply our full framework to create VSI-Bench-Debiased, demonstrating reduced non-visual solvability and a wider vision-blind performance gap than the original.[103] SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding
Ellis Brown,Arijit Ray,Ranjay Krishna,Ross Girshick,Rob Fergus,Saining Xie
Main category: cs.CV
TL;DR: 本文提出了一种名为SIMS-V的系统性数据生成框架,利用3D模拟器的特权信息生成富含空间信息的视频训练数据,以提升多模态语言模型在现实世界中的空间推理能力。
Details
Motivation: 现有的多模态语言模型在时空空间推理方面表现不佳,且依赖真实世界视频数据进行空间训练面临获取多样化、精确标注数据的瓶颈。 Method: 提出SIMS-V框架,利用3D模拟器生成具有丰富空间信息的视频数据,并通过系统性消融实验研究问题类型、组合和规模对现实世界迁移效果的影响。 Result: 仅使用2.5万个模拟样本微调的7B参数视频大模型,超过了更大的72B基线模型,并在现实世界空间推理基准上达到与专有模型相当的性能。 Conclusion: SIMS-V通过最小化的三类问题(度量测量、视角依赖推理和时间跟踪)实现了高效训练和强大的泛化能力,在保持一般视频理解性能的同时显著提升了具身和现实空间任务的表现。 Abstract: Despite impressive high-level video comprehension, multimodal language models struggle with spatial reasoning across time and space. While current spatial training approaches rely on real-world video data, obtaining diverse footage with precise spatial annotations remains a bottleneck. To alleviate this bottleneck, we present SIMS-V -- a systematic data-generation framework that leverages the privileged information of 3D simulators to create spatially-rich video training data for multimodal language models. Using this framework, we investigate which properties of simulated data drive effective real-world transfer through systematic ablations of question types, mixes, and scales. We identify a minimal set of three question categories (metric measurement, perspective-dependent reasoning, and temporal tracking) that prove most effective for developing transferable spatial intelligence, outperforming comprehensive coverage despite using fewer question types. These insights enable highly efficient training: our 7B-parameter video LLM fine-tuned on just 25K simulated examples outperforms the larger 72B baseline and achieves competitive performance with proprietary models on rigorous real-world spatial reasoning benchmarks. Our approach demonstrates robust generalization, maintaining performance on general video understanding while showing substantial improvements on embodied and real-world spatial tasks.[104] Cambrian-S: Towards Spatial Supersensing in Video
Shusheng Yang,Jihan Yang,Pinzhi Huang,Ellis Brown,Zihao Yang,Yue Yu,Shengbang Tong,Zihan Zheng,Yifan Xu,Muhan Wang,Daohan Lu,Rob Fergus,Yann LeCun,Li Fei-Fei,Saining Xie
Main category: cs.CV
TL;DR: 本文提出了一种超越纯语言理解的“超感知”范式,强调空间超感知的四个阶段:语义感知、流事件认知、隐式3D空间认知和预测性世界建模。作者构建了VSI-SUPER基准(含VSR和VSC任务)以推动该领域发展,并通过大规模训练Cambrian-S模型验证数据扩展的局限性。最终提出“预测性感知”作为解决方案,利用自监督的下一潜在帧预测器,基于预测误差驱动记忆与事件分割,在VSI-SUPER上显著优于现有基线。
Details
Motivation: 当前多模态系统多依赖任务驱动和长上下文堆砌,缺乏对真实空间认知能力的全面评估与建模。作者认为真正的多模态智能需要具备持续感知、记忆、空间推理和预测能力,因此需建立新的范式与基准来推动这一方向的发展。 Method: 提出空间超感知的四阶段框架;构建VSI-SUPER两部分基准测试(视觉空间回忆VSR与连续视觉空间计数VSC);训练Cambrian-S模型测试数据扩展效果;设计基于自监督下一潜在帧预测的预测性感知模型,利用预测误差进行记忆更新与事件分割。 Result: Cambrian-S在VSI-Bench上提升30%,但在VSI-SUPER上表现仍有限,表明单纯扩大规模不足;所提预测性感知方法在VSI-SUPER上显著优于主流闭源基线,验证了预测机制对空间超感知的重要性。 Conclusion: 实现真正的多模态智能需要从被动反应转向主动预测,模型不仅要能看,还需具备预测、选择和组织经验的能力;未来应聚焦于构建具有内在世界模型的系统,而非依赖 brute-force 上下文扩展。 Abstract: We argue that progress in true multimodal intelligence calls for a shift from reactive, task-driven systems and brute-force long context towards a broader paradigm of supersensing. We frame spatial supersensing as four stages beyond linguistic-only understanding: semantic perception (naming what is seen), streaming event cognition (maintaining memory across continuous experiences), implicit 3D spatial cognition (inferring the world behind pixels), and predictive world modeling (creating internal models that filter and organize information). Current benchmarks largely test only the early stages, offering narrow coverage of spatial cognition and rarely challenging models in ways that require true world modeling. To drive progress in spatial supersensing, we present VSI-SUPER, a two-part benchmark: VSR (long-horizon visual spatial recall) and VSC (continual visual spatial counting). These tasks require arbitrarily long video inputs yet are resistant to brute-force context expansion. We then test data scaling limits by curating VSI-590K and training Cambrian-S, achieving +30% absolute improvement on VSI-Bench without sacrificing general capabilities. Yet performance on VSI-SUPER remains limited, indicating that scale alone is insufficient for spatial supersensing. We propose predictive sensing as a path forward, presenting a proof-of-concept in which a self-supervised next-latent-frame predictor leverages surprise (prediction error) to drive memory and event segmentation. On VSI-SUPER, this approach substantially outperforms leading proprietary baselines, showing that spatial supersensing requires models that not only see but also anticipate, select, and organize experience.[105] InfinityStar: Unified Spacetime AutoRegressive Modeling for Visual Generation
Jinlai Liu,Jian Han,Bin Yan,Hui Wu,Fengda Zhu,Xing Wang,Yi Jiang,Bingyue Peng,Zehuan Yuan
Main category: cs.CV
TL;DR: InfinityStar是一个统一的时空自回归框架,用于高分辨率图像和动态视频合成,能够在多种生成任务中表现出色,并以比扩散模型快约10倍的速度生成720p视频。
Details
Motivation: 现有的视频生成模型在效率和分辨率上存在局限,缺乏一个能够统一处理空间和时间依赖性的高效离散自回归模型。 Method: 提出InfinityStar,采用纯离散的时空自回归方法,在单一架构中联合建模空间与时间依赖性,支持文本到图像、文本到视频、图像到视频等多种任务。 Result: 在VBench上得分为83.74,显著优于现有自回归模型,甚至超过部分扩散模型(如HunyuanVideo),且无需额外优化即可快速生成5秒720p视频。 Conclusion: InfinityStar是首个能生成工业级720p视频的离散自回归模型,具备高效、高质量的视频生成能力,推动了自回归方法在视频生成中的应用。 Abstract: We introduce InfinityStar, a unified spacetime autoregressive framework for high-resolution image and dynamic video synthesis. Building on the recent success of autoregressive modeling in both vision and language, our purely discrete approach jointly captures spatial and temporal dependencies within a single architecture. This unified design naturally supports a variety of generation tasks such as text-to-image, text-to-video, image-to-video, and long interactive video synthesis via straightforward temporal autoregression. Extensive experiments demonstrate that InfinityStar scores 83.74 on VBench, outperforming all autoregressive models by large margins, even surpassing some diffusion competitors like HunyuanVideo. Without extra optimizations, our model generates a 5s, 720p video approximately 10x faster than leading diffusion-based methods. To our knowledge, InfinityStar is the first discrete autoregressive video generator capable of producing industrial level 720p videos. We release all code and models to foster further research in efficient, high-quality video generation.[106] Tracking and Understanding Object Transformations
Yihong Sun,Xinyu Yang,Jennifer J. Sun,Bharath Hariharan
Main category: cs.CV
TL;DR: 本文提出了“任意状态跟踪”(Track Any State)任务,旨在通过对象的形态变化持续跟踪目标,并检测和描述状态变化。为此,作者构建了新数据集VOST-TAS,并提出TubeletGraph这一零样本方法,通过语义与空间先验恢复变换后丢失的对象,构建描述状态演化的图结构,在复杂对象变换中实现了领先的跟踪性能与更强的语义理解能力。
Details
Motivation: 现有跟踪方法在对象经历显著外观变化(如切割、蜕变)时容易丢失目标,缺乏对物体状态演变的理解,限制了对现实世界动态的建模能力。 Method: 提出TubeletGraph,一种零样本系统:首先识别可能被忽略的轨迹片段,基于语义和空间先验判断是否融合;然后推理新增轨迹,生成描述各状态转换过程的状态图。 Result: 在新提出的VOST-TAS数据集上,TubeletGraph在经历形态变换的物体跟踪中达到最先进性能,同时展现出对时间定位和复杂变换语义推理的优越能力。 Conclusion: TubeletGraph能够有效应对物体状态变换带来的跟踪挑战,不仅提升了跟踪鲁棒性,还通过状态图增强了对物体演化过程的理解,推动了对动态场景中物体行为的深层建模。 Abstract: Real-world objects frequently undergo state transformations. From an apple being cut into pieces to a butterfly emerging from its cocoon, tracking through these changes is important for understanding real-world objects and dynamics. However, existing methods often lose track of the target object after transformation, due to significant changes in object appearance. To address this limitation, we introduce the task of Track Any State: tracking objects through transformations while detecting and describing state changes, accompanied by a new benchmark dataset, VOST-TAS. To tackle this problem, we present TubeletGraph, a zero-shot system that recovers missing objects after transformation and maps out how object states are evolving over time. TubeletGraph first identifies potentially overlooked tracks, and determines whether they should be integrated based on semantic and proximity priors. Then, it reasons about the added tracks and generates a state graph describing each observed transformation. TubeletGraph achieves state-of-the-art tracking performance under transformations, while demonstrating deeper understanding of object transformations and promising capabilities in temporal grounding and semantic reasoning for complex object transformations. Code, additional results, and the benchmark dataset are available at https://tubelet-graph.github.io.[107] Carousel: A High-Resolution Dataset for Multi-Target Automatic Image Cropping
Rafe Loya,Andrew Hamara,Benjamin Estell,Benjamin Kilpatrick,Andrew C. Freeman
Main category: cs.CV
TL;DR: 本文探讨了自动图像裁剪中生成多个具有美学吸引力的不同裁剪区域的问题,提出了针对社交媒体应用的新数据集,并评估了多种单裁剪模型结合图像分割算法的效果。