Skip to content

Table of Contents

cs.CL [Back]

[1] PHANTOM RECALL: When Familiar Puzzles Fool Smart Models

Souradeep Mukhopadhyay,Rishabh Baral,Nimeesh Mahajan,Samhitha Harish,Aswin RRV,Mihir Parmar,Mutsumi Nakamura,Chitta Baral

Main category: cs.CL

TL;DR: 本文提出了PHANTOM RECALL基准,用于评估大语言模型在逻辑谜题及其表层修改版本上的推理能力,发现模型在原始谜题上表现良好,但在修改后显著下降,主要问题在于“幻觉性回忆”和过度推理解释。

Details Motivation: 研究大语言模型是否真正具备从第一原理出发的推理能力,还是仅仅依赖记忆化的模板来回答逻辑问题。 Method: 构建包含25个经典逻辑谜题和149个保持推理结构但改变表面细节的变体的PHANTOM RECALL基准,评估11个主流大模型,并提出自动化逻辑等价判断器、错误分类体系和基于提示的缓解框架。 Result: 模型在原始谜题上接近完美,但在扰动版本上性能大幅下降,普遍存在‘幻觉性回忆’(phantom recall)现象,即自信地输出不再适用的记忆答案或错误推理链。 Conclusion: 当前大语言模型缺乏在上下文变化时重新进行逻辑推理的能力,揭示了语言流畅性与真正逻辑理解之间的差距。 Abstract: Large language models (LLMs) such as GPT, Gemini, and Claude often appear adept at solving classic logic puzzles--but how much genuine reasoning underlies their answers? Recent evidence suggests that these models frequently rely on memorized templates rather than reasoning from first principles. When puzzles are slightly modified, their performance collapses, revealing a striking fragility. In particular, we asked: Have LLMs addressed these issues? To what extent? How about perturbations to other puzzles? Is there a general way of reformulating the prompt so that the models do better? To examine these things systematically, we introduce PHANTOM RECALL, a benchmark comprising 25 well-known logic puzzles and 149 carefully designed perturbations that preserve reasoning structure but alter superficial details and solutions. We evaluate eleven leading LLMs and identify a recurring failure mode--phantom recall--where models confidently reproduce memorized solutions or spurious rationales that no longer fit the altered scenario. To probe and mitigate this issue, we contribute three tools: (i) an automated logical-equivalence judge to detect reasoning mismatches, (ii) a taxonomy of fine-grained reasoning error categories, and (iii) a prompting-based mitigation framework guided by these categories. Despite near-perfect accuracy on unmodified puzzles, models significantly underperform humans on perturbed ones, exhibiting both phantom recall and over-elaboration. Our findings reveal a crucial limitation: LLMs often fail to re-reason when contextual cues shift--highlighting the gap between linguistic fluency and logical understanding.

[2] R-WoM: Retrieval-augmented World Model For Computer-use Agents

Kai Mei,Jiang Guo,Shuaichen Chang,Mingwen Dong,Dongkyu Lee,Xing Niu,Jiarong Jiang

Main category: cs.CL

TL;DR: 该论文研究了大语言模型(LLM)作为世界模型在数字环境中进行决策的潜力,发现其在短期状态预测上表现良好,但在长期规划中性能迅速下降。为此,作者提出了一种检索增强的世界模型(R-WoM),通过引入外部教程中的实时知识来提升模拟的准确性,在长视野任务中显著优于基线方法。

Details Motivation: 大语言模型虽可模拟未来状态以辅助决策,但存在幻觉和依赖静态知识的问题,导致长期模拟误差累积。因此需要系统评估其作为世界模型的适用性,并改进其建模能力。 Method: 通过三个任务(下一状态识别、完整流程规划对齐、里程碑转换识别)评估LLM在预测未来状态和奖励估计方面的能力,并提出检索增强的世界模型(R-WoM),利用外部教程知识增强LLM的模拟过程。 Result: 实验表明,LLM能有效捕捉即时状态转移,但在完整流程规划中性能显著下降;所提出的R-WoM在OSWorld和WebArena数据集上分别比基线提升最多25.3%和18.1%,尤其在长视野模拟中表现更优。 Conclusion: LLM在短期状态预测中具备潜力,但难以可靠地建模长视野环境动态;引入外部知识的R-WoM可有效缓解这一问题,提升世界模型的准确性和实用性。 Abstract: Large Language Models (LLMs) can serve as world models to enhance agent decision-making in digital environments by simulating future states and predicting action outcomes, potentially eliminating costly trial-and-error exploration. However, this capability is fundamentally limited by LLMs' tendency toward hallucination and their reliance on static training knowledge, which can lead to compounding errors that inhibit long-horizon simulations. To systematically investigate whether LLMs are appropriate for world modeling, we probe two core capabilities of world models--future state prediction and reward estimation--through three tasks: next-state identification, full-procedure planning alignment, and milestone transition recognition. Our analysis shows that while LLMs effectively capture immediate next states and identify meaningful state transitions, their performance rapidly degrades in full-procedure planning. This highlights LLMs' limitations in reliably modeling environment dynamics over long horizons. To address these limitations, we propose the Retrieval-augmented World Model (R-WoM), which grounds LLM simulations by incorporating factual, up-to-date knowledge retrieved from external tutorials. Experiments show that R-WoM achieves substantial improvements of up to 25.3% (OSWorld) and 18.1% (WebArena) compared to baselines, with particular advantages in longer-horizon simulations.

[3] LLM Knowledge is Brittle: Truthfulness Representations Rely on Superficial Resemblance

Patrick Haller,Mark Ibrahim,Polina Kirichenko,Levent Sagun,Samuel J. Bell

Main category: cs.CL

TL;DR: 该研究探讨了大语言模型(LLM)在面对输入的表面变化(如拼写错误或句式重构)时,其内部知识表征的鲁棒性问题。结果表明,随着输入偏离训练数据分布,模型对真实与虚假陈述的区分能力显著下降,说明其知识表征脆弱且依赖表面形式,限制了泛化能力。

Details Motivation: 为了使大语言模型可靠,它们需要具备在多样化场景中通用的稳健知识。然而现有研究表明,LLM性能对输入的微小变化过于敏感。本文旨在探究这种脆弱性是否源于不稳定的内部知识表征。 Method: 基于LLM能够编码陈述真实性(真假可分)的发现,研究通过语义保持的扰动(如拼写错误、句式改写)使输入变为分布外(OOD),评估四种LLM家族、五个数据集和三种知识探测方法下表示可分性的退化程度。 Result: 随着输入与训练数据的表面形式差异增大,LLM内部对真实与虚假陈述的表示可分性显著崩溃;模型仅在输入接近预训练数据时表现良好,其判断高度依赖具体表述形式。 Conclusion: LLM可能学习的是浅层且非鲁棒的知识表征,导致泛化能力有限。这一发现解释了其在基准测试中的脆弱表现,并对真实性探测方法的有效性提出了根本挑战,呼吁加强对知识表征鲁棒性的研究。 Abstract: For Large Language Models (LLMs) to be reliable, they must learn robust knowledge that can be generally applied in diverse settings -- often unlike those seen during training. Yet, extensive research has shown that LLM performance can be brittle, with models exhibiting excessive sensitivity to trivial input variations. In this work, we explore whether this brittleness is a direct result of unstable internal knowledge representations. To explore this question, we build on previous work showing that LLM representations encode statement truthfulness -- i.e., true, factual statements can be easily separated from false, inaccurate ones. Specifically, we test the robustness of learned knowledge by evaluating representation separability on samples that have undergone superficial transformations to drive them out-of-distribution (OOD), such as typos or reformulations. By applying semantically-preserving perturbations, we study how separability degrades as statements become more OOD, across four LLM families, five evaluation datasets, and three knowledge probing methods. Our results reveal that internal representations of statement truthfulness collapse as the samples' presentations become less similar to those seen during pre-training. While LLMs can often distinguish between true and false statements when they closely resemble the pre-training data, this ability is highly dependent on the statement's exact surface form. These findings offer a possible explanation for brittle benchmark performance: LLMs may learn shallow, non-robust knowledge representations that allow for only limited generalizability. Our work presents a fundamental challenge for the utility of truthfulness probes, and more broadly, calls for further research on improving the robustness of learned knowledge representations.

[4] LLM Reasoning for Machine Translation: Synthetic Data Generation over Thinking Tokens

Armel Zebaze,Rachel Bawden,Benoît Sagot

Main category: cs.CL

TL;DR: 本文研究了大型推理模型(LRM)在机器翻译(MT)中使用中间“思考”标记的效果,发现单纯的链式思维(CoT)生成并不能提升翻译性能,但结合模块化提示策略的中间标记则可带来改进,表明翻译尝试的存在对中间标记的有效性至关重要。

Details Motivation: 探索大型推理模型在机器翻译中的潜力,尤其是在其已成功应用于数学与编程任务的背景下,理解其在不同资源水平语言对上的表现及中间推理步骤的作用。 Method: 通过在多个语言对和设置下分析生成中间标记(如链式思维)对机器翻译的影响,并比较标准微调与使用合成CoT微调及模块化提示策略组合方法的效果。 Result: 发现普通的‘思考’标记不能提升LRM的翻译性能;基于人类翻译实践构建的合成CoT微调也不优于标准输入输出微调;但采用模块化翻译特定提示策略组合生成的中间标记能带来性能提升。 Conclusion: 中间标记在机器翻译微调中的有效性取决于其中是否包含实际的翻译尝试;相比蒸馏教师模型的推理过程,利用教师模型优化目标翻译或扩展平行语料更为有效。 Abstract: Large reasoning models (LRMs) have led to new possibilities in terms of problem-solving, through the devising of a natural language thought process prior to answering a query. While their capabilities are well known across mathematics and coding tasks, their impact on the task of machine translation (MT) remains underexplored. In this work, we explore the benefits of the generation of intermediate tokens when performing MT across multiple language pairs of different levels of resourcedness and multiple setups. We find that "thinking tokens" do not help LRMs better perform MT. This result generalizes to models fine-tuned to reason before translating using distilled chain of thought (CoT) inspired by human translators' practices. Specifically, fine-tuning a model with synthetic CoT explanations detailing how to translate step-by-step does not outperform standard input-output fine-tuning. However, constructing the intermediate tokens by combining the outputs of modular translation-specific prompting strategies results in improvements. Our findings underscore that the contribution of intermediate tokens during fine-tuning highly depends on the presence of translation attempts within them. More broadly, our results suggest that using a teacher to refine target translations or to expand parallel corpora is more impactful than distilling their CoT explanations into "thinking" MT models.

[5] Discrepancy Detection at the Data Level: Toward Consistent Multilingual Question Answering

Lorena Calvo-Bartolomé,Valérie Aldana,Karla Cantarero,Alonso Madroñal de Mesa,Jerónimo Arenas-García,Jordan Boyd-Graber

Main category: cs.CL

TL;DR: 提出MIND,一种用户参与的多语言问答系统事实核查管道,用于检测事实和文化差异,并在母婴健康领域及其他领域的双语问答系统中验证其有效性。

Details Motivation: 确保多语言问答系统在不同语言间的事实一致性,同时考虑主观回答中的文化差异。 Method: 设计MIND事实核查管道,通过用户参与识别文化敏感问题中的答案差异,并在双语问答系统中进行评估,发布带有标注的双语数据集。 Result: MIND能够可靠地识别出多语言问答知识库中的事实和文化不一致,在母婴健康领域及其他领域均表现出良好的泛化能力。 Conclusion: MIND有助于构建更具文化意识且事实一致的多语言问答系统。 Abstract: Multilingual question answering (QA) systems must ensure factual consistency across languages, especially for objective queries such as What is jaundice?, while also accounting for cultural variation in subjective responses. We propose MIND, a user-in-the-loop fact-checking pipeline to detect factual and cultural discrepancies in multilingual QA knowledge bases. MIND highlights divergent answers to culturally sensitive questions (e.g., Who assists in childbirth?) that vary by region and context. We evaluate MIND on a bilingual QA system in the maternal and infant health domain and release a dataset of bilingual questions annotated for factual and cultural inconsistencies. We further test MIND on datasets from other domains to assess generalization. In all cases, MIND reliably identifies inconsistencies, supporting the development of more culturally aware and factually consistent QA systems.

[6] TopoAlign: A Framework for Aligning Code to Math via Topological Decomposition

Yupei Li,Philipp Borchert,Gerasimos Lampouras

Main category: cs.CL

TL;DR: 本文提出了TopoAlign框架,通过将代码库分解并重组为与形式化数学语句结构对齐的数据,用于训练数学大语言模型,显著提升了模型在多个基准上的表现。

Details Motivation: 当前数学大语言模型受限于缺乏大规模非正式与形式化数学语句配对的语料库,且自然语言到代码的训练难以有效迁移到形式化数学推理任务中。 Method: 提出TopoAlign框架,将代码分解为文档字符串、主函数和依赖函数,并重新组合成结构上类似形式化语句的类比数据,无需人工标注即可生成可用于训练的结构对齐代码数据。 Result: 在minif2f、Putnam和ProofNet基准上评估显示,TopoAlign使DeepSeek-Math性能大幅提升(BEq@10提升17.77%,typecheck@10提升68.82%),并对Herald模型也有轻微增益(BEq@10提升0.12%,typecheck@10提升1.09%)。 Conclusion: TopoAlign能够有效利用现有代码资源作为数学大语言模型的训练数据,即使对于专用模型也能带来性能提升,验证了结构对齐代码数据在autoformalisation任务中的价值。 Abstract: Large Language Models (LLMs) excel at both informal and formal (e.g. Lean 4) mathematical reasoning but still struggle with autoformalisation, the task of transforming informal into formal mathematical statements. Autoformalisation helps pair the informal reasoning of LLMs with formal proof assistants which enable machine-verifiable generation and mitigate hallucinations. Yet, the performance of current Math LLMs is constrained by the scarcity of large-scale corpora, particularly those containing pairs of informal and formal statements. Although current models are trained to generate code from natural language instructions, structural and syntactic differences between these and formal mathematics limit effective transfer learning. We propose TopoAlign, a framework that unlocks widely available code repositories as training resources for Math LLMs. TopoAlign decomposes code into docstrings, main functions, and dependency functions, and reassembles these components into analogues that structurally mirror formal statements. This produces structurally aligned code data that can be used for training Math LLMs without requiring additional human annotation. We train two state-of-the-art models, DeepSeek-Math and Herald, and evaluate them on the minif2f, Putnam, and ProofNet benchmarks. TopoAlign provides substantial gains for DeepSeek-Math, improving performance by 17.77% on BEq@10 and 68.82% on typecheck@10. Despite introducing no new mathematical knowledge, our framework achieves gains of 0.12% and 1.09% for Herald on BEq@10 and typecheck@10, respectively, demonstrating that training on aligned code data is beneficial even for specialized models.

[7] GRAVITY: A Framework for Personalized Text Generation via Profile-Grounded Synthetic Preferences

Priyanka Dey,Daniele Rosa,Wenqing Zheng,Daniel Barcklow,Jieyu Zhao,Emilio Ferrara

Main category: cs.CL

TL;DR: GRAVITY 是一种生成合成的、基于用户画像的偏好数据的框架,用于减少大模型个性化对人工标注的依赖,通过整合文化、心理等理论模型,在多文化场景下显著提升个性化内容生成效果。

Details Motivation: 现有的大语言模型个性化方法依赖昂贵的人类反馈或交互日志,难以扩展且忽视深层用户属性,因此需要一种可扩展、无需大量人工标注的方法来建模用户的兴趣、价值观和人格特质。 Method: 提出 GRAVITY 框架,结合 Hofstede 文化维度、Schwartz 价值理论、世界价值观调查和 Big Five OCEAN 人格模型等心理学与文化理论,生成基于用户画像的合成偏好数据对,并用于指导个性化内容生成。 Result: 在 400 名亚马逊用户的数据上评估显示,GRAVITY 在书本描述生成任务中相比基线方法平均提升超过 4% 的偏好得分,用户研究中 86% 以上情况下输出结果更受青睐,且在多文化环境下表现稳定。 Conclusion: 基于画像的合成数据能有效捕捉用户多样性,减少对昂贵标注的依赖,为大语言模型的个性化提供可扩展的解决方案。 Abstract: Personalization in LLMs often relies on costly human feedback or interaction logs, limiting scalability and neglecting deeper user attributes. To reduce the reliance on human annotations, we introduce GRAVITY (Generative Response with Aligned Values, Interests, and Traits of You), a framework for generating synthetic, profile-grounded preference data that captures users' interests, values, beliefs, and personality traits. By integrating demographic, cultural, and psychological frameworks -- including Hofstede's cultural dimensions, Schwartz's basic values, the World Values Survey, and Big Five OCEAN traits -- GRAVITY synthesizes preference pairs to guide personalized content generation. We evaluate GRAVITY on book descriptions for 400 Amazon users, comparing it to prompt-based conditioning, standard fine-tuning, and naive synthetic pair generation. Profile-grounded synthetic data consistently improves generation, especially across multiple cultures (USA, Brazil, Japan, India), achieving over 4% higher preference gains across baselines, with user studies showing that GRAVITY outputs are preferred over 86% of the time. Our results show that scenario-grounded synthetic data can capture richer user variation, reduce reliance on costly annotation, and produce more engaging, user-centered content, offering a scalable path for LLM personalization.

cs.CV [Back]

[8] Enhancing the Quality of 3D Lunar Maps Using JAXA's Kaguya Imagery

Yumi Iwashita,Haakon Moe,Yang Cheng,Adnan Ansar,Georgios Georgakis,Adrian Stoica,Kazuto Nakashima,Ryo Kurazume,Jim Torresen

Main category: cs.CV

TL;DR: 提出一种方法来减少由压缩图像引起的视差图中的残余噪声,从而提高月球3D地图的质量。

Details Motivation: Kaguya TC图像存在由于立体匹配误差和JPEG压缩伪影导致的高程不准确问题,影响了月球3D地图的质量,特别是在深色区域。 Method: 分析Kaguya TC影像的压缩行为,识别系统性的视差噪声模式,并提出一种方法来减少视差图中的残余噪声。 Result: 实验结果表明,所提出的方法能有效降低高程噪声,提升3D地图的精度。 Conclusion: 该方法提高了月球地形数据的安全性和可靠性,有助于支持未来的长距离探月任务。 Abstract: As global efforts to explore the Moon intensify, the need for high-quality 3D lunar maps becomes increasingly critical-particularly for long-distance missions such as NASA's Endurance mission concept, in which a rover aims to traverse 2,000 km across the South Pole-Aitken basin. Kaguya TC (Terrain Camera) images, though globally available at 10 m/pixel, suffer from altitude inaccuracies caused by stereo matching errors and JPEG-based compression artifacts. This paper presents a method to improve the quality of 3D maps generated from Kaguya TC images, focusing on mitigating the effects of compression-induced noise in disparity maps. We analyze the compression behavior of Kaguya TC imagery, and identify systematic disparity noise patterns, especially in darker regions. In this paper, we propose an approach to enhance 3D map quality by reducing residual noise in disparity images derived from compressed images. Our experimental results show that the proposed approach effectively reduces elevation noise, enhancing the safety and reliability of terrain data for future lunar missions.

[9] Data or Language Supervision: What Makes CLIP Better than DINO?

Yiming Liu,Yuhui Zhang,Dhruba Ghosh,Ludwig Schmidt,Serena Yeung-Levy

Main category: cs.CV

TL;DR: 研究比较了CLIP和DINO作为视觉编码器在视觉-语言模型中的表现,发现CLIP在文本密集型任务上更优,而DINO在视觉中心任务上略胜一筹,主要归因于CLIP对高级语义的捕捉能力。

Details Motivation: 探究CLIP优于自监督模型的原因是来自其语言监督还是更大的训练数据。 Method: 在相同架构、数据集和训练配置下预训练CLIP和DINO,并进行嵌入分析及在20个VQA基准上的评估。 Result: CLIP擅长捕捉高级语义,在文本密集型任务中表现更好;DINO对低级特征更敏感,在视觉中心任务上稍优。 Conclusion: CLIP的优势主要源于其语言监督带来的高级语义建模能力,而非仅训练数据量大,这对视觉编码器设计和VLM性能有重要启示。 Abstract: CLIP outperforms self-supervised models like DINO as vision encoders for vision-language models (VLMs), but it remains unclear whether this advantage stems from CLIP's language supervision or its much larger training data. To disentangle these factors, we pre-train CLIP and DINO under controlled settings -- using the same architecture, dataset, and training configuration -- achieving similar ImageNet accuracy. Embedding analysis shows that CLIP captures high-level semantics (e.g., object categories, text), while DINO is more responsive to low-level features like colors and styles. When integrated into VLMs and evaluated on 20 VQA benchmarks, CLIP excels at text-intensive tasks, while DINO slightly outperforms on vision-centric ones. Variants of language supervision (e.g., sigmoid loss, pre-trained language encoders) yield limited gains. Our findings provide scientific insights into vision encoder design and its impact on VLM performance.

[10] MammoDINO: Anatomically Aware Self-Supervision for Mammographic Images

Sicheng Zhou,Lei Wu,Cao Xiao,Parminder Bhatia,Taha Kass-Hout

Main category: cs.CV

TL;DR: MammoDINO是一种用于乳腺X线摄影的新型自监督学习框架,通过组织感知数据增强和跨切片对比学习,在140万图像上预训练,实现了多项乳腺癌筛查任务的最先进性能。

Details Motivation: 自监督学习在通用视觉领域取得成功,但在医学影像中因数据有限和领域偏差应用不足,尤其在乳腺X线分析中亟需高效的无标注预训练方法。 Method: 提出MammoDINO框架,引入乳腺组织感知的数据增强采样器,并设计跨切片对比学习目标,利用3D DBT结构指导2D模型预训练。 Result: 在五个基准数据集上表现优异,达到多个乳腺癌筛查任务的SOTA性能,且具有良好的泛化能力。 Conclusion: MammoDINO为乳腺X线分析提供了一个可扩展、无需标注的通用基础模型,有望推动多用途计算机辅助诊断工具的发展,减轻放射科医生负担并提升筛查效率。 Abstract: Self-supervised learning (SSL) has transformed vision encoder training in general domains but remains underutilized in medical imaging due to limited data and domain specific biases. We present MammoDINO, a novel SSL framework for mammography, pretrained on 1.4 million mammographic images. To capture clinically meaningful features, we introduce a breast tissue aware data augmentation sampler for both image-level and patch-level supervision and a cross-slice contrastive learning objective that leverages 3D digital breast tomosynthesis (DBT) structure into 2D pretraining. MammoDINO achieves state-of-the-art performance on multiple breast cancer screening tasks and generalizes well across five benchmark datasets. It offers a scalable, annotation-free foundation for multipurpose computer-aided diagnosis (CAD) tools for mammogram, helping reduce radiologists' workload and improve diagnostic efficiency in breast cancer screening.

[11] Task-Specific Dual-Model Framework for Comprehensive Traffic Safety Video Description and Analysis

Blessing Agyei Kyem,Neema Jakisa Owor,Andrews Danyo,Joshua Kofi Asamoah,Eugene Denteh,Tanner Muturi,Anthony Dontoh,Yaw Adu-Gyamfi,Armstrong Aboah

Main category: cs.CV

TL;DR: 提出一种双模型框架,利用VideoLLaMA和Qwen2.5-VL的互补优势,分别优化视频描述生成和视觉问答任务,以提升交通安全隐患分析性能。

Details Motivation: 为了更有效地捕捉细粒度行为模式并生成全面的安全分析描述,解决现有方法在多任务学习中的任务干扰问题。 Method: 采用双模型架构,将视频描述与视觉问答任务分离,分别使用VideoLLaMA和Qwen2.5-VL进行专项训练,发挥各自在时序推理和视觉理解上的优势。 Result: 在WTS数据集上,该方法在2025 AI City Challenge Track 2中取得45.7572的S2分数,排名第10;VideoLLaMA的CIDEr得分为1.1001,Qwen2.5-VL的VQA准确率为60.80%,消融实验显示分离训练比联合训练VQA准确率提高8.6%且保持描述质量。 Conclusion: 分离训练策略能有效提升多任务视频理解性能,充分发挥不同模型的专长,为交通安全隐患分析提供更可靠的解决方案。 Abstract: Traffic safety analysis requires complex video understanding to capture fine-grained behavioral patterns and generate comprehensive descriptions for accident prevention. In this work, we present a unique dual-model framework that strategically utilizes the complementary strengths of VideoLLaMA and Qwen2.5-VL through task-specific optimization to address this issue. The core insight behind our approach is that separating training for captioning and visual question answering (VQA) tasks minimizes task interference and allows each model to specialize more effectively. Experimental results demonstrate that VideoLLaMA is particularly effective in temporal reasoning, achieving a CIDEr score of 1.1001, while Qwen2.5-VL excels in visual understanding with a VQA accuracy of 60.80\%. Through extensive experiments on the WTS dataset, our method achieves an S2 score of 45.7572 in the 2025 AI City Challenge Track 2, placing 10th on the challenge leaderboard. Ablation studies validate that our separate training strategy outperforms joint training by 8.6\% in VQA accuracy while maintaining captioning quality.

[12] PanoTPS-Net: Panoramic Room Layout Estimation via Thin Plate Spline Transformation

Hatem Ibrahem,Ahmed Salem,Qinmin Vivian Hu,Guanghui Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为PanoTPS-Net的新模型,用于从单张全景图像中估计房间的3D布局。该模型结合卷积神经网络(CNN)和薄板样条(TPS)空间变换,分两阶段工作:首先用CNN提取高层特征并预测TPS参数,然后通过TPS变换将参考布局扭曲为目标布局。该方法能有效处理立方体和非立方体房间布局,在多个公开数据集上表现出色,3DIoU值分别达到85.49、86.16、81.76和91.98。

Details Motivation: 准确估计房间的3D布局在计算机视觉中至关重要,具有在机器人、增强现实和室内设计中的广泛应用。现有方法在处理非立方体布局时泛化能力有限,因此需要一种能同时适应多种房间结构的鲁棒方法。 Method: 提出PanoTPS-Net模型,采用两阶段架构:第一阶段使用CNN从输入全景图中提取特征并预测TPS变换参数;第二阶段利用这些参数通过TPS空间变换层将标准参考布局变形为实际房间布局。该方法充分利用TPS对形变的建模能力,适配全景图像特性。 Result: 在PanoContext、Stanford-2D3D、Matterport3DLayout和ZInD四个公开数据集上进行了实验,3DIoU指标分别为85.49、86.16、81.76和91.98,优于现有方法。结果表明该模型在立方体和非立方体房间布局估计方面均具有高精度和强鲁棒性,且TPS与全景图像的兼容性良好。 Conclusion: PanoTPS-Net通过结合CNN与TPS空间变换,实现了对多样化房间布局的准确估计,显著提升了在复杂非立方体场景下的泛化能力,为全景图的3D布局估计提供了一种有效且通用的解决方案。 Abstract: Accurately estimating the 3D layout of rooms is a crucial task in computer vision, with potential applications in robotics, augmented reality, and interior design. This paper proposes a novel model, PanoTPS-Net, to estimate room layout from a single panorama image. Leveraging a Convolutional Neural Network (CNN) and incorporating a Thin Plate Spline (TPS) spatial transformation, the architecture of PanoTPS-Net is divided into two stages: First, a convolutional neural network extracts the high-level features from the input images, allowing the network to learn the spatial parameters of the TPS transformation. Second, the TPS spatial transformation layer is generated to warp a reference layout to the required layout based on the predicted parameters. This unique combination empowers the model to properly predict room layouts while also generalizing effectively to both cuboid and non-cuboid layouts. Extensive experiments on publicly available datasets and comparisons with state-of-the-art methods demonstrate the effectiveness of the proposed method. The results underscore the model's accuracy in room layout estimation and emphasize the compatibility between the TPS transformation and panorama images. The robustness of the model in handling both cuboid and non-cuboid room layout estimation is evident with a 3DIoU value of 85.49, 86.16, 81.76, and 91.98 on PanoContext, Stanford-2D3D, Matterport3DLayout, and ZInD datasets, respectively. The source code is available at: https://github.com/HatemHosam/PanoTPS_Net.

[13] Prompt-Guided Spatial Understanding with RGB-D Transformers for Fine-Grained Object Relation Reasoning

Tanner Muturi,Blessing Agyei Kyem,Joshua Kofi Asamoah,Neema Jakisa Owor,Richard Dyzinela,Andrews Danyo,Yaw Adu-Gyamfi,Armstrong Aboah

Main category: cs.CV

TL;DR: 提出了一种通过在输入提示中嵌入边界框坐标来增强空间理解的框架,在AI City Challenge 2025 Track 3中取得了第四名的成绩。

Details Motivation: 现有模型在大规模3D环境中因场景杂乱、遮挡和缺乏显式空间定位而难以泛化,需提升视觉-语言系统的空间推理能力。 Method: 将掩码维度(边界框坐标)嵌入输入提示,并在四个任务类别上进行特定监督微调,同时在训练中添加标准化答案以提高与评估系统的一致性。 Result: 在Physical AI Spatial Intelligence Warehouse数据集上达到73.0606的最终得分,位列公共排行榜第4名。 Conclusion: 结构化提示增强和针对性优化能有效提升现实工业环境中的空间推理性能。 Abstract: Spatial reasoning in large-scale 3D environments such as warehouses remains a significant challenge for vision-language systems due to scene clutter, occlusions, and the need for precise spatial understanding. Existing models often struggle with generalization in such settings, as they rely heavily on local appearance and lack explicit spatial grounding. In this work, we introduce a dedicated spatial reasoning framework for the Physical AI Spatial Intelligence Warehouse dataset introduced in the Track 3 2025 AI City Challenge. Our approach enhances spatial comprehension by embedding mask dimensions in the form of bounding box coordinates directly into the input prompts, enabling the model to reason over object geometry and layout. We fine-tune the framework across four question categories namely: Distance Estimation, Object Counting, Multi-choice Grounding, and Spatial Relation Inference using task-specific supervision. To further improve consistency with the evaluation system, normalized answers are appended to the GPT response within the training set. Our comprehensive pipeline achieves a final score of 73.0606, placing 4th overall on the public leaderboard. These results demonstrate the effectiveness of structured prompt enrichment and targeted optimization in advancing spatial reasoning for real-world industrial environments.

[14] Evaluating the Explainability of Vision Transformers in Medical Imaging

Leili Barekatain,Ben Glocker

Main category: cs.CV

TL;DR: 本研究评估了不同Vision Transformer架构和预训练策略在医学图像分类任务中的可解释性,发现DINO结合Grad-CAM能提供最可靠且定位准确的解释。

Details Motivation: Vision Transformers在医学影像中表现优异,但其注意力机制复杂,缺乏可解释性,限制了临床信任与应用。因此需要评估不同ViT模型的解释方法有效性。 Method: 采用Gradient Attention Rollout和Grad-CAM两种可视化方法,对ViT、DeiT、DINO和Swin Transformer四种模型进行定量与定性分析,任务包括外周血细胞分类和乳腺超声图像分类。 Result: DINO结合Grad-CAM在多个数据集上生成最忠实且局部化的解释,Grad-CAM热图具有更强的类别区分性和空间精确性;即使在误分类样本中,也能突出显示具有临床相关性的形态特征。 Conclusion: DINO预训练模型配合Grad-CAM可显著提升Vision Transformers在医学影像中的可解释性,有助于推动其在关键临床诊断流程中的可信部署。 Abstract: Understanding model decisions is crucial in medical imaging, where interpretability directly impacts clinical trust and adoption. Vision Transformers (ViTs) have demonstrated state-of-the-art performance in diagnostic imaging; however, their complex attention mechanisms pose challenges to explainability. This study evaluates the explainability of different Vision Transformer architectures and pre-training strategies - ViT, DeiT, DINO, and Swin Transformer - using Gradient Attention Rollout and Grad-CAM. We conduct both quantitative and qualitative analyses on two medical imaging tasks: peripheral blood cell classification and breast ultrasound image classification. Our findings indicate that DINO combined with Grad-CAM offers the most faithful and localized explanations across datasets. Grad-CAM consistently produces class-discriminative and spatially precise heatmaps, while Gradient Attention Rollout yields more scattered activations. Even in misclassification cases, DINO with Grad-CAM highlights clinically relevant morphological features that appear to have misled the model. By improving model transparency, this research supports the reliable and explainable integration of ViTs into critical medical diagnostic workflows.

[15] APGNet: Adaptive Prior-Guided for Underwater Camouflaged Object Detection

Xinxin Huang,Han Sun,Junmin Cai,Ningzhong Liu,Huiyu Zhou

Main category: cs.CV

TL;DR: 本文提出了一种自适应先验引导网络APGNet,用于提升水下伪装物体检测的准确性和鲁棒性,通过结合多尺度视场模块和先验引导机制,在公开数据集上优于15种现有方法。

Details Motivation: 水下图像退化(如低对比度、颜色失真)和海洋生物的天然伪装使得现有伪装物体检测方法难以有效工作,且传统增强技术和陆地场景方法无法很好适应水下环境。 Method: 提出APGNet,采用Siamese结构与新的先验引导机制;使用MSRCR算法进行数据增强以减轻图像退化;设计扩展感受野(ERF)模块和多尺度渐进解码器(MPD)捕获上下文信息;引入自适应先验融合位置和边界先验,结合空间注意力和可变形卷积优化定位与轮廓。 Result: 在两个公开MAS数据集上的实验表明,APGNet在多个常用评价指标上优于15种最先进的方法。 Conclusion: APGNet能有效应对水下图像退化和伪装挑战,显著提升水下伪装物体检测性能。 Abstract: Detecting camouflaged objects in underwater environments is crucial for marine ecological research and resource exploration. However, existing methods face two key challenges: underwater image degradation, including low contrast and color distortion, and the natural camouflage of marine organisms. Traditional image enhancement techniques struggle to restore critical features in degraded images, while camouflaged object detection (COD) methods developed for terrestrial scenes often fail to adapt to underwater environments due to the lack of consideration for underwater optical characteristics. To address these issues, we propose APGNet, an Adaptive Prior-Guided Network, which integrates a Siamese architecture with a novel prior-guided mechanism to enhance robustness and detection accuracy. First, we employ the Multi-Scale Retinex with Color Restoration (MSRCR) algorithm for data augmentation, generating illumination-invariant images to mitigate degradation effects. Second, we design an Extended Receptive Field (ERF) module combined with a Multi-Scale Progressive Decoder (MPD) to capture multi-scale contextual information and refine feature representations. Furthermore, we propose an adaptive prior-guided mechanism that hierarchically fuses position and boundary priors by embedding spatial attention in high-level features for coarse localization and using deformable convolution to refine contours in low-level features. Extensive experimental results on two public MAS datasets demonstrate that our proposed method APGNet outperforms 15 state-of-art methods under widely used evaluation metrics.

[16] VIDMP3: Video Editing by Representing Motion with Pose and Position Priors

Sandeep Mishra,Oindrila Saha,Alan C. Bovik

Main category: cs.CV

TL;DR: 本文提出了一种名为VidMP3的新方法,利用姿态和位置先验从源视频中学习广义运动表示,实现运动保持的视频编辑,同时支持结构和语义上的灵活修改。

Details Motivation: 现有基于扩散模型的视频编辑方法在结构保持方面表现良好,但在处理结构可变编辑时存在时间不一致、主体身份漂移和需要人工干预等问题,因此需要一种能同时保持运动、结构和语义灵活性的方法。 Method: VidMP3利用姿态和位置先验来建模源视频中的运动信息,通过学习广义运动表示,在生成新视频时保持原始动作,同时支持对象的结构和语义替换。 Result: 实验结果表明,VidMP3在定性和定量评估中均优于现有方法,能够有效保持时间一致性与主体身份,并减少人工干预需求。 Conclusion: VidMP3为运动保持的视频编辑提供了一个有效的解决方案,显著提升了结构和语义可变场景下的编辑灵活性与生成质量。 Abstract: Motion-preserved video editing is crucial for creators, particularly in scenarios that demand flexibility in both the structure and semantics of swapped objects. Despite its potential, this area remains underexplored. Existing diffusion-based editing methods excel in structure-preserving tasks, using dense guidance signals to ensure content integrity. While some recent methods attempt to address structure-variable editing, they often suffer from issues such as temporal inconsistency, subject identity drift, and the need for human intervention. To address these challenges, we introduce VidMP3, a novel approach that leverages pose and position priors to learn a generalized motion representation from source videos. Our method enables the generation of new videos that maintain the original motion while allowing for structural and semantic flexibility. Both qualitative and quantitative evaluations demonstrate the superiority of our approach over existing methods. The code will be made publicly available at https://github.com/sandeep-sm/VidMP3.

[17] A Review on Domain Adaption and Generative Adversarial Networks(GANs)

Aashish Dhawan,Divyanshu Mudgal

Main category: cs.CV

TL;DR: 本文讨论了在图像分类等计算机视觉任务中,由于标注数据稀缺且获取成本高,如何利用领域自适应(Domain Adaptation)方法,将在一个数据集上训练的模型应用于另一个不同但相关的领域,以克服数据不足的问题。

Details Motivation: 由于高质量标注数据的缺乏,尤其是在需要大量人工标注的情况下,传统监督学习方法面临挑战,因此需要能够跨领域迁移知识的方法来提升模型泛化能力。 Method: 探讨领域自适应技术,通过将在源域(如绘画图像)上训练的模型迁移到目标域(如实拍图像),减少对目标域标注数据的依赖,可能采用特征对齐、对抗训练或无监督学习等策略。 Result: 展示了领域自适应在缓解数据稀缺问题上的潜力,能够在目标域上获得与基准方法相当的结果,而无需大量标注数据。 Conclusion: 领域自适应是一种有效应对标注数据不足的策略,能够提升模型在不同但相关领域间的泛化性能,具有重要的实际应用价值。 Abstract: The major challenge in today's computer vision scenario is the availability of good quality labeled data. In a field of study like image classification, where data is of utmost importance, we need to find more reliable methods which can overcome the scarcity of data to produce results comparable to previous benchmark results. In most cases, obtaining labeled data is very difficult because of the high cost of human labor and in some cases impossible. The purpose of this paper is to discuss Domain Adaptation and various methods to implement it. The main idea is to use a model trained on a particular dataset to predict on data from a different domain of the same kind, for example - a model trained on paintings of airplanes predicting on real images of airplanes