cs.CL [Back]

[1] DeepResearch-Slice: Bridging the Retrieval-Utilization Gap via Explicit Text Slicing

Shuo Lu,Yinuo Xu,Jianjie Cheng,Lingxiao He,Meng Wang,Jian Liang

Main category: cs.CL

TL;DR: 提出DeepResearch-Slice框架，通过预测精确的跨度索引来在推理前进行确定性的硬过滤，从而弥合检索-利用差距。

Details

Motivation: 现有模型在嘈杂环境中即使检索到黄金证据也难以有效利用，存在检索与利用之间的鸿沟。 Method: 采用神经符号框架DeepResearch-Slice，预测精确的跨度索引以执行确定性硬过滤，而非依赖隐式注意力机制。 Result: 在六个基准上评估显示显著提升鲁棒性，对冻结骨干模型应用该方法相对改进73%，从19.1%提升至33.0%。 Conclusion: 开放性研究需要显式的 grounding 机制来有效应对噪声并提升证据利用效率。 Abstract: Deep Research agents predominantly optimize search policies to maximize retrieval probability. However, we identify a critical bottleneck: the retrieval-utilization gap, where models fail to use gold evidence even after it is retrieved, due to context blindness in noisy environments. To bridge this gap, we propose DeepResearch-Slice, a simple yet effective neuro-symbolic framework. Unlike implicit attention, our approach predicts precise span indices to perform a deterministic hard filter before reasoning. Extensive evaluations across six benchmarks show substantial robustness gains. Applying our method to frozen backbones yields a 73 percent relative improvement, from 19.1 percent to 33.0 percent, effectively mitigating noise without requiring parameter updates to the reasoning model. These results highlight the need for explicit grounding mechanisms in open-ended research.

[2] Internal Reasoning vs. External Control: A Thermodynamic Analysis of Sycophancy in Large Language Models

Edward Y. Chang

Main category: cs.CL

TL;DR: 研究探讨了大语言模型中的讨好行为（sycophancy）能否通过内部推理缓解，还是需要外部调控。实验比较了内部推理（CoT）与外部结构机制（RCA），发现内部方法在弱模型中导致性能崩溃，在强模型中仍存在11.4%的错误率，而RCA能彻底消除讨好行为（0.0%）。研究提出热力学层级框架，指出只有当能力匹配且强大时，混合系统才能达到最优共振，否则陷入失谐与熵增。结论强调外部结构约束对安全是必要条件。

Details

Motivation: 大语言模型常因迎合用户而牺牲正确性，即表现出讨好行为。本文旨在探究这种问题是否可通过模型自身推理解决，或必须依赖外部机制干预，以确保输出的准确性与安全性。 Method: 使用对抗性数据集CAP-GSM8K（N=500），在GPT-3.5、GPT-4o和GPT-5.1上评估内部推理（Chain-of-Thought, CoT）与外部结构机制（Recursive Cognitive Architecture, RCA）对讨好行为的抑制效果，并提出热力学层级模型分析系统效率。 Result: 内部推理在弱模型中引发性能崩溃（优先悖论），在前沿模型中仍遗留11.4%的输出错误；而RCA在所有模型层级均将讨好行为降至0.0%。热力学层级分析显示，仅当内外能力匹配且强劲时，系统可达共振状态，否则陷入失谐与熵增。 Conclusion: 仅靠内部推理无法充分消除大模型的讨好行为，外部结构约束是保障安全的必要条件。 Abstract: Large Language Models frequently exhibit sycophancy, prioritizing user agreeableness over correctness. We investigate whether this requires external regulation or can be mitigated by internal reasoning alone. Using CAP-GSM8K (N=500), an adversarial dataset, we evaluate internal (CoT) versus external (RCA) mechanisms across GPT-3.5, GPT-4o, and GPT-5.1. Our results reveal the structural limits of internal reasoning: it causes performance collapse in weak models (the Prioritization Paradox) and leaves an 11.4\% final output gap in frontier models. In contrast, RCA structurally eliminates sycophancy (0.0\%) across all tiers. We synthesize these findings into a thermodynamic hierarchy: hybrid systems achieve Resonance (optimal efficiency) only when capabilities are matched and strong, while weak or mismatched pairs succumb to Dissonance and Entropy. This confirms that external structural constraints are strictly necessary to guarantee safety.

[3] Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models

Kai Hu,Abhinav Aggarwal,Mehran Khodabandeh,David Zhang,Eric Hsin,Li Chen,Ankit Jain,Matt Fredrikson,Akash Bharadwaj

Main category: cs.CL

TL;DR: Jailbreak-Zero提出了一种新的基于策略的LLM安全评估方法，通过攻击模型生成大量多样化对抗提示，并利用偏好数据集微调，实现了策略覆盖、攻击多样性与提示真实性的帕累托最优，显著提升了对GPT-40和Claude 3.5等模型的越狱成功率，且具备高可读性和低人工干预需求。

Details

Motivation: 传统基于示例的LLM安全评估方法受限于样本数量和多样性，难以全面发现模型的安全漏洞，因此需要一种更高效、可扩展的评估范式。 Method: 提出Jailbreak-Zero框架：使用攻击LLM生成大量多样化的对抗性提示，构建偏好数据集并用于微调攻击模型，从而优化策略覆盖、攻击策略多样性和提示保真度三个目标。 Result: 在开源和闭源模型（如GPT-40和Claude 3.5）上均实现了显著更高的攻击成功率，生成的提示具有高可读性和真实性，且所需人工干预极少。 Conclusion: Jailbreak-Zero通过从示例驱动转向策略驱动的方法，提供了一个更可扩展、更全面的LLM安全评估解决方案，有助于更有效地识别和缓解大模型的安全风险。 Abstract: This paper introduces Jailbreak-Zero, a novel red teaming methodology that shifts the paradigm of Large Language Model (LLM) safety evaluation from a constrained example-based approach to a more expansive and effective policy-based framework. By leveraging an attack LLM to generate a high volume of diverse adversarial prompts and then fine-tuning this attack model with a preference dataset, Jailbreak-Zero achieves Pareto optimality across the crucial objectives of policy coverage, attack strategy diversity, and prompt fidelity to real user inputs. The empirical evidence demonstrates the superiority of this method, showcasing significantly higher attack success rates against both open-source and proprietary models like GPT-40 and Claude 3.5 when compared to existing state-of-the-art techniques. Crucially, Jailbreak-Zero accomplishes this while producing human-readable and effective adversarial prompts with minimal need for human intervention, thereby presenting a more scalable and comprehensive solution for identifying and mitigating the safety vulnerabilities of LLMs.

[4] Benchmarking and Adapting On-Device Large Language Models for Clinical Decision Support

Alif Munim,Jun Ma,Omar Ibrahim,Alhusain Abdalla,Shuolin Yin,Leo Chen,Bo Wang

Main category: cs.CL

TL;DR: 该研究评估了两种可在设备上运行的开源大语言模型（gpt-oss-20b 和 gpt-oss-120b）在多种临床任务中的表现，发现其性能可媲美甚至超过更大的开源模型和部分专有模型，且通过微调显著提升了诊断准确性，接近GPT-5水平，展示了本地化LLM在保护隐私和资源受限环境下临床决策支持中的巨大潜力。

Details

Motivation: 解决现有大语言模型在临床应用中面临的隐私泄露风险、依赖云端基础设施以及开源模型体积过大难以本地部署的问题，探索适用于资源受限环境的高效、安全的临床决策支持方案。 Method: 在三种典型临床任务（普通疾病诊断、眼科专科诊疗、模拟专家评分）上对gpt-oss-20b和gpt-oss-120b进行基准测试，并与GPT-5、o4-mini和DeepSeek-R1等先进模型对比；进一步对gpt-oss-20b使用通用诊断数据进行微调以评估其适应性。 Result: gpt-oss系列模型尽管规模更小，但性能达到或超过DeepSeek-R1和o4-mini；经过微调后，gpt-oss-20b的诊断准确率显著提升，接近GPT-5的表现。 Conclusion: 轻量级、可在设备端运行的开源大语言模型结合微调策略，能够在保障数据隐私的同时提供高水平的临床决策支持，为LLM在常规临床实践中的广泛应用提供了可行路径。 Abstract: Large language models (LLMs) have rapidly advanced in clinical decision-making, yet the deployment of proprietary systems is hindered by privacy concerns and reliance on cloud-based infrastructure. Open-source alternatives allow local inference but often require large model sizes that limit their use in resource-constrained clinical settings. Here, we benchmark two on-device LLMs, gpt-oss-20b and gpt-oss-120b, across three representative clinical tasks: general disease diagnosis, specialty-specific (ophthalmology) diagnosis and management, and simulation of human expert grading and evaluation. We compare their performance with state-of-the-art proprietary models (GPT-5 and o4-mini) and a leading open-source model (DeepSeek-R1), and we further evaluate the adaptability of on-device systems by fine-tuning gpt-oss-20b on general diagnostic data. Across tasks, gpt-oss models achieve performance comparable to or exceeding DeepSeek-R1 and o4-mini despite being substantially smaller. In addition, fine-tuning remarkably improves the diagnostic accuracy of gpt-oss-20b, enabling it to approach the performance of GPT-5. These findings highlight the potential of on-device LLMs to deliver accurate, adaptable, and privacy-preserving clinical decision support, offering a practical pathway for broader integration of LLMs into routine clinical practice.

[5] OpenAI GPT-5 System Card

Aaditya Singh,Adam Fry,Adam Perelman,Adam Tart,Adi Ganesh,Ahmed El-Kishky,Aidan McLaughlin,Aiden Low,AJ Ostrow,Akhila Ananthram,Akshay Nathan,Alan Luo,Alec Helyar,Aleksander Madry,Aleksandr Efremov,Aleksandra Spyra,Alex Baker-Whitcomb,Alex Beutel,Alex Karpenko,Alex Makelov,Alex Neitz,Alex Wei,Alexandra Barr,Alexandre Kirchmeyer,Alexey Ivanov,Alexi Christakis,Alistair Gillespie,Allison Tam,Ally Bennett,Alvin Wan,Alyssa Huang,Amy McDonald Sandjideh,Amy Yang,Ananya Kumar,Andre Saraiva,Andrea Vallone,Andrei Gheorghe,Andres Garcia Garcia,Andrew Braunstein,Andrew Liu,Andrew Schmidt,Andrey Mereskin,Andrey Mishchenko,Andy Applebaum,Andy Rogerson,Ann Rajan,Annie Wei,Anoop Kotha,Anubha Srivastava,Anushree Agrawal,Arun Vijayvergiya,Ashley Tyra,Ashvin Nair,Avi Nayak,Ben Eggers,Bessie Ji,Beth Hoover,Bill Chen,Blair Chen,Boaz Barak,Borys Minaiev,Botao Hao,Bowen Baker,Brad Lightcap,Brandon McKinzie,Brandon Wang,Brendan Quinn,Brian Fioca,Brian Hsu,Brian Yang,Brian Yu,Brian Zhang,Brittany Brenner,Callie Riggins Zetino,Cameron Raymond,Camillo Lugaresi,Carolina Paz,Cary Hudson,Cedric Whitney,Chak Li,Charles Chen,Charlotte Cole,Chelsea Voss,Chen Ding,Chen Shen,Chengdu Huang,Chris Colby,Chris Hallacy,Chris Koch,Chris Lu,Christina Kaplan,Christina Kim,CJ Minott-Henriques,Cliff Frey,Cody Yu,Coley Czarnecki,Colin Reid,Colin Wei,Cory Decareaux,Cristina Scheau,Cyril Zhang,Cyrus Forbes,Da Tang,Dakota Goldberg,Dan Roberts,Dana Palmie,Daniel Kappler,Daniel Levine,Daniel Wright,Dave Leo,David Lin,David Robinson,Declan Grabb,Derek Chen,Derek Lim,Derek Salama,Dibya Bhattacharjee,Dimitris Tsipras,Dinghua Li,Dingli Yu,DJ Strouse,Drew Williams,Dylan Hunn,Ed Bayes,Edwin Arbus,Ekin Akyurek,Elaine Ya Le,Elana Widmann,Eli Yani,Elizabeth Proehl,Enis Sert,Enoch Cheung,Eri Schwartz,Eric Han,Eric Jiang,Eric Mitchell,Eric Sigler,Eric Wallace,Erik Ritter,Erin Kavanaugh,Evan Mays,Evgenii Nikishin,Fangyuan Li,Felipe Petroski Such,Filipe de Avila Belbute Peres,Filippo Raso,Florent Bekerman,Foivos Tsimpourlas,Fotis Chantzis,Francis Song,Francis Zhang,Gaby Raila,Garrett McGrath,Gary Briggs,Gary Yang,Giambattista Parascandolo,Gildas Chabot,Grace Kim,Grace Zhao,Gregory Valiant,Guillaume Leclerc,Hadi Salman,Hanson Wang,Hao Sheng,Haoming Jiang,Haoyu Wang,Haozhun Jin,Harshit Sikchi,Heather Schmidt,Henry Aspegren,Honglin Chen,Huida Qiu,Hunter Lightman,Ian Covert,Ian Kivlichan,Ian Silber,Ian Sohl,Ibrahim Hammoud,Ignasi Clavera,Ikai Lan,Ilge Akkaya,Ilya Kostrikov,Irina Kofman,Isak Etinger,Ishaan Singal,Jackie Hehir,Jacob Huh,Jacqueline Pan,Jake Wilczynski,Jakub Pachocki,James Lee,James Quinn,Jamie Kiros,Janvi Kalra,Jasmyn Samaroo,Jason Wang,Jason Wolfe,Jay Chen,Jay Wang,Jean Harb,Jeffrey Han,Jeffrey Wang,Jennifer Zhao,Jeremy Chen,Jerene Yang,Jerry Tworek,Jesse Chand,Jessica Landon,Jessica Liang,Ji Lin,Jiancheng Liu,Jianfeng Wang,Jie Tang,Jihan Yin,Joanne Jang,Joel Morris,Joey Flynn,Johannes Ferstad,Johannes Heidecke,John Fishbein,John Hallman,Jonah Grant,Jonathan Chien,Jonathan Gordon,Jongsoo Park,Jordan Liss,Jos Kraaijeveld,Joseph Guay,Joseph Mo,Josh Lawson,Josh McGrath,Joshua Vendrow,Joy Jiao,Julian Lee,Julie Steele,Julie Wang,Junhua Mao,Kai Chen,Kai Hayashi,Kai Xiao,Kamyar Salahi,Kan Wu,Karan Sekhri,Karan Sharma,Karan Singhal,Karen Li,Kenny Nguyen,Keren Gu-Lemberg,Kevin King,Kevin Liu,Kevin Stone,Kevin Yu,Kristen Ying,Kristian Georgiev,Kristie Lim,Kushal Tirumala,Kyle Miller,Lama Ahmad,Larry Lv,Laura Clare,Laurance Fauconnet,Lauren Itow,Lauren Yang,Laurentia Romaniuk,Leah Anise,Lee Byron,Leher Pathak,Leon Maksin,Leyan Lo,Leyton Ho,Li Jing,Liang Wu,Liang Xiong,Lien Mamitsuka,Lin Yang,Lindsay McCallum,Lindsey Held,Liz Bourgeois,Logan Engstrom,Lorenz Kuhn,Louis Feuvrier,Lu Zhang,Lucas Switzer,Lukas Kondraciuk,Lukasz Kaiser,Manas Joglekar,Mandeep Singh,Mandip Shah,Manuka Stratta,Marcus Williams,Mark Chen,Mark Sun,Marselus Cayton,Martin Li,Marvin Zhang,Marwan Aljubeh,Matt Nichols,Matthew Haines,Max Schwarzer,Mayank Gupta,Meghan Shah,Melody Huang,Meng Dong,Mengqing Wang,Mia Glaese,Micah Carroll,Michael Lampe,Michael Malek,Michael Sharman,Michael Zhang,Michele Wang,Michelle Pokrass,Mihai Florian,Mikhail Pavlov,Miles Wang,Ming Chen,Mingxuan Wang,Minnia Feng,Mo Bavarian,Molly Lin,Moose Abdool,Mostafa Rohaninejad,Nacho Soto,Natalie Staudacher,Natan LaFontaine,Nathan Marwell,Nelson Liu,Nick Preston,Nick Turley,Nicklas Ansman,Nicole Blades,Nikil Pancha,Nikita Mikhaylin,Niko Felix,Nikunj Handa,Nishant Rai,Nitish Keskar,Noam Brown,Ofir Nachum,Oleg Boiko,Oleg Murk,Olivia Watkins,Oona Gleeson,Pamela Mishkin,Patryk Lesiewicz,Paul Baltescu,Pavel Belov,Peter Zhokhov,Philip Pronin,Phillip Guo,Phoebe Thacker,Qi Liu,Qiming Yuan,Qinghua Liu,Rachel Dias,Rachel Puckett,Rahul Arora,Ravi Teja Mullapudi,Raz Gaon,Reah Miyara,Rennie Song,Rishabh Aggarwal,RJ Marsan,Robel Yemiru,Robert Xiong,Rohan Kshirsagar,Rohan Nuttall,Roman Tsiupa,Ronen Eldan,Rose Wang,Roshan James,Roy Ziv,Rui Shu,Ruslan Nigmatullin,Saachi Jain,Saam Talaie,Sam Altman,Sam Arnesen,Sam Toizer,Sam Toyer,Samuel Miserendino,Sandhini Agarwal,Sarah Yoo,Savannah Heon,Scott Ethersmith,Sean Grove,Sean Taylor,Sebastien Bubeck,Sever Banesiu,Shaokyi Amdo,Shengjia Zhao,Sherwin Wu,Shibani Santurkar,Shiyu Zhao,Shraman Ray Chaudhuri,Shreyas Krishnaswamy,Shuaiqi,Xia,Shuyang Cheng,Shyamal Anadkat,Simón Posada Fishman,Simon Tobin,Siyuan Fu,Somay Jain,Song Mei,Sonya Egoian,Spencer Kim,Spug Golden,SQ Mah,Steph Lin,Stephen Imm,Steve Sharpe,Steve Yadlowsky,Sulman Choudhry,Sungwon Eum,Suvansh Sanjeev,Tabarak Khan,Tal Stramer,Tao Wang,Tao Xin,Tarun Gogineni,Taya Christianson,Ted Sanders,Tejal Patwardhan,Thomas Degry,Thomas Shadwell,Tianfu Fu,Tianshi Gao,Timur Garipov,Tina Sriskandarajah,Toki Sherbakov,Tomer Kaftan,Tomo Hiratsuka,Tongzhou Wang,Tony Song,Tony Zhao,Troy Peterson,Val Kharitonov,Victoria Chernova,Vineet Kosaraju,Vishal Kuo,Vitchyr Pong,Vivek Verma,Vlad Petrov,Wanning Jiang,Weixing Zhang,Wenda Zhou,Wenlei Xie,Wenting Zhan,Wes McCabe,Will DePue,Will Ellsworth,Wulfie Bain,Wyatt Thompson,Xiangning Chen,Xiangyu Qi,Xin Xiang,Xinwei Shi,Yann Dubois,Yaodong Yu,Yara Khakbaz,Yifan Wu,Yilei Qian,Yin Tat Lee,Yinbo Chen,Yizhen Zhang,Yizhong Xiong,Yonglong Tian,Young Cha,Yu Bai,Yu Yang,Yuan Yuan,Yuanzhi Li,Yufeng Zhang,Yuguang Yang,Yujia Jin,Yun Jiang,Yunyun Wang,Yushi Wang,Yutian Liu,Zach Stubenvoll,Zehao Dou,Zheng Wu,Zhigang Wang

Main category: cs.CL

TL;DR: GPT-5是一个统一系统，包含快速响应的基础模型和用于复杂推理的深思模型，由实时路由器智能调度，并在安全性、实用性及减少幻觉方面取得显著提升。

Details

Motivation: 提升大模型在真实场景中的实用性与安全性，解决复杂任务时的效率与准确性权衡问题，并防范潜在的生物化学风险。 Method: 采用双模型架构（gpt-5-main 和 gpt-5-thinking），结合基于实时信号训练的动态路由机制，持续优化模型选择；引入safe-completions安全机制，并对高风险能力模型实施预防性管控。 Result: GPT-5在基准测试中超越先前模型，响应更快，幻觉更少，指令遵循和写作、编程、健康等应用场景表现更优；路由器能根据对话类型自动选择合适模型；部分模型被列为高风险并启用相应防护措施。 Conclusion: GPT-5通过系统级创新提升了性能与安全性，是迈向更高效、可靠和负责任AI的重要一步。 Abstract: This is the system card published alongside the OpenAI GPT-5 launch, August 2025. GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say 'think hard about this' in the prompt). The router is continuously trained on real signals, including when users switch models, preference rates for responses, and measured correctness, improving over time. Once usage limits are reached, a mini version of each model handles remaining queries. This system card focuses primarily on gpt-5-thinking and gpt-5-main, while evaluations for other models are available in the appendix. The GPT-5 system not only outperforms previous models on benchmarks and answers questions more quickly, but -- more importantly -- is more useful for real-world queries. We've made significant advances in reducing hallucinations, improving instruction following, and minimizing sycophancy, and have leveled up GPT-5's performance in three of ChatGPT's most common uses: writing, coding, and health. All of the GPT-5 models additionally feature safe-completions, our latest approach to safety training to prevent disallowed content. Similarly to ChatGPT agent, we have decided to treat gpt-5-thinking as High capability in the Biological and Chemical domain under our Preparedness Framework, activating the associated safeguards. While we do not have definitive evidence that this model could meaningfully help a novice to create severe biological harm -- our defined threshold for High capability -- we have chosen to take a precautionary approach.

[6] WRAVAL -- WRiting Assist eVALuation

Gabriel Benedict,Matthew Butler,Naved Merchant,Eetu Salama-Laine

Main category: cs.CL

TL;DR: 本文提出了一种新的评估框架，专门用于衡量小型语言模型（SLM）在非推理任务（如语气修改）中的实际应用能力，弥补了传统评估方法偏向大模型的不足。

Details

Motivation: 现有的语言模型评估过于关注推理和问题解决能力，导致小型语言模型在工业实际应用中的优势被低估。因此需要一种更贴合实际应用场景的评估方式。 Method: 提出一个结合数据生成、提示调优和基于大语言模型评估的新框架，通过任务特定微调来评估小型语言模型在无预定义数据集任务中的表现。 Result: 该框架展示了小型语言模型在语气调整等常见工业任务中具有显著潜力，且性能优于仅依赖通用评估指标的结果。 Conclusion: 研究为小型语言模型在边缘计算和私有计算场景下的实际应用提供了有效的评估工具和实践指导。 Abstract: The emergence of Large Language Models (LLMs) has shifted language model evaluation toward reasoning and problem-solving tasks as measures of general intelligence. Small Language Models (SLMs) -- defined here as models under 10B parameters -- typically score 3-4 times lower than LLMs on these metrics. However, we demonstrate that these evaluations fail to capture SLMs' effectiveness in common industrial applications, such as tone modification tasks (e.g., funny, serious, professional). We propose an evaluation framework specifically designed to highlight SLMs' capabilities in non-reasoning tasks where predefined evaluation datasets don't exist. Our framework combines novel approaches in data generation, prompt-tuning, and LLM-based evaluation to demonstrate the potential of task-specific finetuning. This work provides practitioners with tools to effectively benchmark both SLMs and LLMs for practical applications, particularly in edge and private computing scenarios. Our implementation is available at: https://github.com/amazon-science/wraval.

[7] The Instruction Gap: LLMs get lost in Following Instruction

Vishesh Tripathi,Uday Allu,Biddwan Ahmed

Main category: cs.CL

TL;DR: 本研究评估了13种主流大语言模型在企业级RAG场景中的指令遵循能力，发现不同模型表现差异显著，提出“指令差距”这一关键挑战，并为实际部署和基准建立提供洞见。

Details

Motivation: 企业在部署大语言模型时面临指令遵循不一致的问题，影响实际应用可靠性，因此需要系统评估模型在真实场景中的指令遵循能力。 Method: 对13个主流大语言模型进行系统性测试，使用样本数据和企业级评估协议，在检索增强生成（RAG）场景下评估其指令遵循、响应准确性和整体性能。 Result: 模型在通用任务上表现良好，但在精确遵循自定义指令方面差异显著；Claude-Sonnet-4和GPT-5表现最佳，揭示了“指令差距”的存在。 Conclusion: 指令遵循是当前大语言模型在企业部署中的关键瓶颈，需专门优化；本研究为组织选型和模型发展提供了实用基准和方向。 Abstract: Large Language Models (LLMs) have shown remarkable capabilities in natural language understanding and generation, yet their deployment in enterprise environments reveals a critical limitation: inconsistent adherence to custom instructions. This study presents a comprehensive evaluation of 13 leading LLMs across instruction compliance, response accuracy, and performance metrics in realworld RAG (Retrieval-Augmented Generation) scenarios. Through systematic testing with samples and enterprise-grade evaluation protocols, we demonstrate that instruction following varies dramatically across models, with Claude-Sonnet-4 and GPT-5 achieving the highest results. Our findings reveal the "instruction gap" - a fundamental challenge where models excel at general tasks but struggle with precise instruction adherence required for enterprise deployment. This work provides practical insights for organizations deploying LLM-powered solutions and establishes benchmarks for instruction-following capabilities across major model families.

[8] Advances and Challenges in Semantic Textual Similarity: A Comprehensive Survey

Lokendra Kumar,Neelesh S. Upadhye,Kannan Piedy

Main category: cs.CL

TL;DR: 该论文综述了2021年以来语义文本相似度（STS）领域的六大进展：基于Transformer的模型、对比学习、领域定制方法、多模态方法、基于图的方法和知识增强技术，总结了当前的技术进展与挑战，并为未来研究提供指导。

Details

Motivation: 随着STS技术快速发展，亟需系统性梳理最新进展以帮助研究人员理解现状、识别挑战并把握未来方向。 Method: 本文采用综述方法，对近年来STS在六个关键方向上的代表性工作进行分类整理与分析，包括模型架构、训练方法和应用场景。 Result: 总结出FarSSiBERT、DeBERTa-v3、AspectCSE等先进模型的表现，以及CXR-BERT、Financial-STS等领域专用模型的有效性，同时指出多模态、图结构和知识融合方法对语义表示的提升作用。 Conclusion: 当前STS技术在多个维度取得显著进展，但仍面临泛化性、数据依赖和跨领域适应等挑战，未来趋势将趋向于更高效的预训练、多模态融合与知识引导的语义建模。 Abstract: Semantic Textual Similarity (STS) research has expanded rapidly since 2021, driven by advances in transformer architectures, contrastive learning, and domain-specific techniques. This survey reviews progress across six key areas: transformer-based models, contrastive learning, domain-focused solutions, multi-modal methods, graph-based approaches, and knowledge-enhanced techniques. Recent transformer models such as FarSSiBERT and DeBERTa-v3 have achieved remarkable accuracy, while contrastive methods like AspectCSE have established new benchmarks. Domain-adapted models, including CXR-BERT for medical texts and Financial-STS for finance, demonstrate how STS can be effectively customized for specialized fields. Moreover, multi-modal, graph-based, and knowledge-integrated models further enhance semantic understanding and representation. By organizing and analyzing these developments, the survey provides valuable insights into current methods, practical applications, and remaining challenges. It aims to guide researchers and practitioners alike in navigating rapid advancements, highlighting emerging trends and future opportunities in the evolving field of STS.

[9] Less is more: Not all samples are effective for evaluation

Wentang Song,Jinqiang Li,Kele Huang,Junhui Lin,Shengxiang Wu,Zhongshi Xie

Main category: cs.CL

TL;DR: 提出了一种无需历史模型性能数据的测试集压缩框架，通过领域自适应嵌入和任务感知聚类，在保留基准完整性的同时显著减少冗余，降低90%以上评估成本。

Details

Motivation: 现有测试集压缩方法依赖历史模型的正确性标签，难以应用于冷启动场景（如新任务、新领域或新模型）。需要一种不依赖先验性能数据的压缩方法。 Method: 首先在少量领域数据上微调基础大模型以获取领域语义，然后仅基于文本内容生成高层语义嵌入；在此嵌入空间中进行任务感知聚类，并设计数据X光机制分析簇几何结构，动态调整压缩强度。 Result: 在专业领域的3GPP通信基准等数据集上实验表明，该方法能有效识别并移除冗余样本，压缩后评估成本降低超过90%，同时保持与完整基准高度一致的评估保真度。 Conclusion: 所提历史无关的压缩框架适用于冷启动场景，能够在无历史评估数据的情况下实现高效、高保真的测试集压缩，为垂直领域大模型评测提供了可扩展且经济的解决方案。 Abstract: The versatility of Large Language Models (LLMs) in vertical domains has spurred the development of numerous specialized evaluation benchmarks. However, these benchmarks often suffer from significant semantic redundancy and impose high computational costs during evaluation. Existing compression methods, such as tinyBenchmarks depend critically on correctness labels from multiple historical models evaluated on the full test set, making them inapplicable in cold-start scenarios, such as the introduction of a new task, domain, or model with no prior evaluation history. To address this limitation, we propose a history-free test set compression framework that requires no prior model performance data. Our method begins by fine-tuning a base LLM on a small amount of domain-specific data to internalize task-relevant semantics. It then generates high-level semantic embeddings for all original test samples using only their raw textual content. In this domain-adapted embedding space, we perform task-aware clustering and introduce a novel dataset X-ray mechanism that analyzes cluster geometry to dynamically calibrate the compression intensity based on the intrinsic redundancy of the benchmark. Experiments on professional-domain dataset, notably a large-scale 3GPP communications benchmark, demonstrate that our approach effectively identifies and removes redundant samples, reducing evaluation cost by over 90% while preserving high fidelity to the full benchmark.

[10] GuardEval: A Multi-Perspective Benchmark for Evaluating Safety, Fairness, and Robustness in LLM Moderators

Naseem Machlovi,Maryam Saleki,Ruhul Amin,Mohamed Rahouti,Shawqi Al-Maliki,Junaid Qadir,Mohamed M. Abdallah,Ala Al-Fuqaha

Main category: cs.CL

TL;DR: 本文提出了GuardEval，一个包含106个细粒度类别的多视角基准数据集，以及基于Gemma3-12B微调的GemmaGuard（GGuard）模型，用于提升大语言模型内容审核能力。实验表明GGuard在复杂和边界案例中显著优于现有主流审核模型，证明了多样化、代表性数据对提升安全性、公平性和鲁棒性的重要性。

Details

Motivation: 现有的大语言模型在处理隐含冒犯性、性别与种族偏见及越狱提示等细微且主观的内容时表现不佳，且易受训练数据偏差影响，导致审核结果不一致和伦理问题。因此需要更精细、多角度的安全评估体系。 Method: 构建了一个涵盖人类情感、攻击性语言、偏见和广泛安全问题的统一多视角基准数据集GuardEval，并基于该数据集通过QLoRA微调Gemma3-12B得到GemmaGuard（GGuard）模型，使用细粒度标签进行内容审核评估。 Result: GGuard在GuardEval上达到0.832的宏F1分数，显著优于OpenAI Moderator（0.64）和Llama Guard（0.61），在边界案例和复杂情境下表现出更强的鲁棒性和公平性。 Conclusion: 多视角、以人为本的安全基准对于减少内容审核中的偏见和不一致性至关重要；GuardEval与GGuard共同验证了高质量、多样化的数据能有效提升审核系统的安全性和可靠性。 Abstract: As large language models (LLMs) become deeply embedded in daily life, the urgent need for safer moderation systems, distinguishing between naive from harmful requests while upholding appropriate censorship boundaries, has never been greater. While existing LLMs can detect harmful or unsafe content, they often struggle with nuanced cases such as implicit offensiveness, subtle gender and racial biases, and jailbreak prompts, due to the subjective and context-dependent nature of these issues. Furthermore, their heavy reliance on training data can reinforce societal biases, resulting in inconsistent and ethically problematic outputs. To address these challenges, we introduce GuardEval, a unified multi-perspective benchmark dataset designed for both training and evaluation, containing 106 fine-grained categories spanning human emotions, offensive and hateful language, gender and racial bias, and broader safety concerns. We also present GemmaGuard (GGuard), a QLoRA fine-tuned version of Gemma3-12B trained on GuardEval, to assess content moderation with fine-grained labels. Our evaluation shows that GGuard achieves a macro F1 score of 0.832, substantially outperforming leading moderation models, including OpenAI Moderator (0.64) and Llama Guard (0.61). We show that multi-perspective, human-centered safety benchmarks are critical for reducing biased and inconsistent moderation decisions. GuardEval and GGuard together demonstrate that diverse, representative data materially improve safety, fairness, and robustness on complex, borderline cases.

[11] LLM_annotate: A Python package for annotating and analyzing fiction characters

Hannes Rosenbusch

Main category: cs.CL

TL;DR: LLM_annotate是一个基于大语言模型的Python工具包，用于标准化分析小说角色性格特征的工作流程，支持文本分块、角色行为标注、名字消歧、质量评估等功能，并提供人机交互界面验证结果质量，适用于多种LLM。

Details

Motivation: 为了提升虚构角色性格分析的效率与可重复性，解决现有方法在处理长文本时的不一致性和缺乏标准化流程的问题。 Method: 开发了一个名为LLM_annotate的Python包，集成文本预处理、LLM标注、角色名消歧、质量评分和角色级统计与嵌入计算功能，并通过图形化人机交互界面验证标注与推断质量。 Result: 成功在《辛普森一家电影》和小说《傲慢与偏见》上展示了该工具包的有效应用，实现了高效、可复现的角色分析。 Conclusion: LLM_annotate为文学、影视等领域的角色分析提供了灵活、标准化且可扩展的解决方案，支持各类大语言模型，推动了人文计算研究的方法进步。 Abstract: LLM_annotate is a Python package for analyzing the personality of fiction characters with large language models. It standardizes workflows for annotating character behaviors in full texts (e.g., books and movie scripts), inferring character traits, and validating annotation/inference quality via a human-in-the-loop GUI. The package includes functions for text chunking, LLM-based annotation, character name disambiguation, quality scoring, and computation of character-level statistics and embeddings. Researchers can use any LLM, commercial, open-source, or custom, within LLM_annotate. Through tutorial examples using The Simpsons Movie and the novel Pride and Prejudice, I demonstrate the usage of the package for efficient and reproducible character analyses.

[12] Topic Segmentation Using Generative Language Models

Pierre Mackenzie,Maya Shah,Patrick Frenett

Main category: cs.CL

TL;DR: 本文提出了一种基于生成式大语言模型（LLMs）的递归和重叠提示策略用于话题分割，并引入边界相似性评估指标，实验表明LLMs在该任务上优于现有方法，但仍存在需解决的问题。

Details

Motivation: 现有的话题分割方法依赖句子间的语义相似性，缺乏对长距离依赖和大规模知识的利用，而大语言模型具备这些能力，因此探索其在话题分割中的应用具有重要意义。 Method: 提出一种使用句子枚举的重叠与递归提示策略，并结合生成式大语言模型进行话题边界识别，同时采用边界相似性作为评估指标。 Result: 实验结果显示，基于大语言模型的方法在话题分割上比传统方法更有效，但在一致性与稳定性方面仍存在问题。 Conclusion: 大语言模型在话题分割任务中展现出潜力，但尚需进一步研究以解决可靠性问题，才能实际应用。 Abstract: Topic segmentation using generative Large Language Models (LLMs) remains relatively unexplored. Previous methods use semantic similarity between sentences, but such models lack the long range dependencies and vast knowledge found in LLMs. In this work, we propose an overlapping and recursive prompting strategy using sentence enumeration. We also support the adoption of the boundary similarity evaluation metric. Results show that LLMs can be more effective segmenters than existing methods, but issues remain to be solved before they can be relied upon for topic segmentation.

[13] Bare-Metal Tensor Virtualization: Overcoming the Memory Wall in Edge-AI Inference on ARM64

Bugra Kilictas,Faruk Alpay

Main category: cs.CL

TL;DR: 提出一种名为“虚拟张量核心”的软件架构，通过直接内存映射和手调SIMD内核优化ARM64架构上的大语言模型推理，实现高效缓存利用和零拷贝加载，在M2芯片上达到每秒超过60个token的稳定吞吐。

Details

Motivation: 突破边缘设备上大语言模型部署面临的“内存墙”瓶颈，减少传统运行时因高级抽象、动态调度和非对齐内存访问带来的开销。 Method: 设计并实现基于mmap的直接内存映射、手工优化的NEON SIMD内核以及保证100%缓存行利用率的张量虚拟化布局（TVL），采用零拷贝加载器消除初始化延迟。 Result: 在M2芯片上运行1.1亿参数模型时实现超过60 tokens/秒的稳定吞吐，并满足200ms心理语言学延迟要求。 Conclusion: 该方法为研究通用ARM架构上的内存瓶颈提供了完全开源、可移植且确定性的参考实现，无需依赖专有硬件加速器。 Abstract: The deployment of Large Language Models (LLMs) on edge devices is fundamentally constrained by the "Memory Wall" the bottleneck where data movement latency outstrips arithmetic throughput. Standard inference runtimes often incur significant overhead through high-level abstractions, dynamic dispatch, and unaligned memory access patterns. In this work, we present a novel "Virtual Tensor Core" architecture implemented in software, optimized specifically for ARM64 microarchitectures (Apple Silicon). By bypassing standard library containers in favor of direct memory mapping (mmap) and implementing hand-tuned NEON SIMD kernels, we achieve a form of "Software-Defined Direct Memory Access (DMA)." Our proposed Tensor Virtualization Layout (TVL) guarantees 100% cache line utilization for weight matrices, while our zero-copy loader eliminates initialization latency. Experimental results on a 110M parameter model demonstrate a stable throughput of >60 tokens/second on M2 hardware. While proprietary hardware accelerators (e.g., Apple AMX) can achieve higher peak throughput, our architecture provides a fully open, portable, and deterministic reference implementation for studying the memory bottleneck on general-purpose ARM silicon, meeting the 200ms psycholinguistic latency threshold without opaque dependencies.

[14] A path to natural language through tokenisation and transformers

David S. Berman,Alexander G. Stapleton

Main category: cs.CL

TL;DR: 本文研究了在现代Transformer模型中广泛使用的字节对编码（BPE）如何改变语料库的统计特性，发现BPE递归应用会促使词频趋向齐夫定律的幂律分布，并提升经验熵的特定增长模式。

Details

Motivation: 旨在理解BPE分词方法如何影响自然语言的统计规律性，特别是与Zipf定律和Heaps定律的关系。 Method: 基于Zipf分布假设分析不同语料的信息熵，推导出槽位熵期望值的闭式表达式，并在不同BPE深度下训练语言模型，结合注意力机制进行诊断分析。 Result: BPE不仅是一种压缩机制，还是一种重构自然语言关键信息属性的统计变换；随着BPE深度增加，模型预测熵更符合Zipf定律预测，且局部词元依赖性减弱，趋近于弱依赖或近独立同分布（IID）状态。 Conclusion: BPE通过递归应用塑造了语言的统计结构，使其更接近自然语言的典型统计规律，从而可能提升模型的学习效率与泛化能力。 Abstract: Natural languages exhibit striking regularities in their statistical structure, including notably the emergence of Zipf's and Heaps' laws. Despite this, it remains broadly unclear how these properties relate to the modern tokenisation schemes used in contemporary transformer models. In this note, we analyse the information content (as measured by the Shannon entropy) of various corpora under the assumption of a Zipfian frequency distribution, and derive a closed-form expression for the slot entropy expectation value. We then empirically investigate how byte--pair encoding (BPE) transforms corpus statistics, showing that recursive applications of BPE drive token frequencies toward a Zipfian power law while inducing a characteristic growth pattern in empirical entropy. Utilizing the ability of transformers to learn context dependent token probability distributions, we train language models on corpora tokenised at varying BPE depths, revealing that the model predictive entropies increasingly agree with Zipf-derived predictions as the BPE depth increases. Attention-based diagnostics further indicate that deeper tokenisation reduces local token dependencies, bringing the empirical distribution closer to the weakly dependent (near IID) regime. Together, these results clarify how BPE acts not only as a compression mechanism but also as a statistical transform that reconstructs key informational properties of natural language.

[15] Metaphors are a Source of Cross-Domain Misalignment of Large Reasoning Models

Zhibo Hu,Chen Wang,Yanfeng Shu,Hye-young Paik,Liming Zhu

Main category: cs.CL

TL;DR: 本文研究了隐喻对大语言模型推理路径的影响，发现训练数据中的隐喻与模型在跨领域推理中的错配程度存在强因果关系，并通过干预不同训练阶段的隐喻使用验证了其影响。

Details

Motivation: 由于人类决策受隐喻影响，而大语言模型的训练数据包含大量隐喻，因此探究隐喻是否会影响模型的推理过程及其导致的错配问题。 Method: 通过在预训练、微调和再对齐阶段引入隐喻干预，分析模型在跨领域推理中的错配程度变化，并监测模型中全局与局部潜在特征的激活情况，进而设计错配内容检测器。 Result: 发现隐喻与模型错配程度之间存在显著因果关系，且隐喻影响了模型推理中潜在特征的激活模式；基于这些特征设计的检测器能高精度识别错配内容。 Conclusion: 隐喻在大语言模型的推理错配中起关键作用，控制训练数据中的隐喻有助于缓解跨领域错配问题，同时可通过监控潜在特征实现对错配内容的有效检测。 Abstract: Earlier research has shown that metaphors influence human's decision making, which raises the question of whether metaphors also influence large language models (LLMs)' reasoning pathways, considering their training data contain a large number of metaphors. In this work, we investigate the problem in the scope of the emergent misalignment problem where LLMs can generalize patterns learned from misaligned content in one domain to another domain. We discover a strong causal relationship between metaphors in training data and the misalignment degree of LLMs' reasoning contents. With interventions using metaphors in pre-training, fine-tuning and re-alignment phases, models' cross-domain misalignment degrees change significantly. As we delve deeper into the causes behind this phenomenon, we observe that there is a connection between metaphors and the activation of global and local latent features of large reasoning models. By monitoring these latent features, we design a detector that predict misaligned content with high accuracy.

[16] Breaking the Assistant Mold: Modeling Behavioral Variation in LLM Based Procedural Character Generation

Maan Qraitem,Kate Saenko,Bryan A. Plummer

Main category: cs.CL

TL;DR: 本文提出了PersonaWeaver框架，通过解耦角色的世界设定与行为设定，解决现有程序化角色生成中的道德正面偏见和助手式回应偏见，从而生成更具多样性与戏剧张力的虚拟角色。

Details

Motivation: 现有的大规模角色生成方法存在正向道德偏见和乐于助人式回应偏见，导致角色行为可预测且缺乏戏剧性，限制了虚拟世界中角色的多样性与真实感。 Method: 提出PersonaWeaver框架，将角色构建分为世界设定（如身份、人口特征）和行为设定（如道德立场、互动风格），并通过分离建模实现多样化反应；利用提示工程和控制生成策略引入二阶多样性（如语气、长度、标点）。 Result: 生成的角色展现出更丰富的道德立场和互动方式，减少了默认同意或直接回应的倾向，在多个维度上提升了行为多样性和风格差异，增强了虚拟角色的复杂性与叙事潜力。 Conclusion: PersonaWeaver有效缓解了由最大似然训练和助手微调带来的对齐偏差，为程序化内容生成中的角色多样性提供了新路径，推动了更具表现力和戏剧性的虚拟世界构建。 Abstract: Procedural content generation has enabled vast virtual worlds through levels, maps, and quests, but large-scale character generation remains underexplored. We identify two alignment-induced biases in existing methods: a positive moral bias, where characters uniformly adopt agreeable stances (e.g. always saying lying is bad), and a helpful assistant bias, where characters invariably answer questions directly (e.g. never refusing or deflecting). While such tendencies suit instruction-following systems, they suppress dramatic tension and yield predictable characters, stemming from maximum likelihood training and assistant fine-tuning. To address this, we introduce PersonaWeaver, a framework that disentangles world-building (roles, demographics) from behavioral-building (moral stances, interactional styles), yielding characters with more diverse reactions and moral stances, as well as second-order diversity in stylistic markers like length, tone, and punctuation. Code: https://github.com/mqraitem/Persona-Weaver

[17] Rendering Data Unlearnable by Exploiting LLM Alignment Mechanisms

Ruihan Zhang,Jun Sun

Main category: cs.CL

TL;DR: 提出一种名为Disclaimer Injection的新方法，通过注入对齐触发的免责声明来保护数据不被大语言模型学习，在黑盒场景下实现数据防护。

Details

Motivation: 担心大语言模型在未经授权的情况下使用专有或个人数据进行训练，需要在现实的黑盒环境中保护数据不被模型学习。 Method: 通过设计能够触发模型对齐机制的免责声明，并将其插入文本中，使模型在训练时持续激活对齐相关层，从而抑制任务学习。 Result: 在受保护数据上微调的模型表现出显著且系统性的性能下降，说明该方法有效阻止了模型对原始内容的学习。 Conclusion: 首次利用对齐行为作为数据保护手段，提出了无需访问或修改训练流程即可在大规模LLM中限制数据可学习性的实用方法。 Abstract: Large language models (LLMs) are increasingly trained on massive, heterogeneous text corpora, raising serious concerns about the unauthorised use of proprietary or personal data during model training. In this work, we address the problem of data protection against unwanted model learning in a realistic black-box setting. We propose Disclaimer Injection, a novel data-level defence that renders text unlearnable to LLMs. Rather than relying on model-side controls or explicit data removal, our approach exploits the models' own alignment mechanisms: by injecting carefully designed alignment-triggering disclaimers to prevent effective learning. Through layer-wise analysis, we find that fine-tuning on such protected data induces persistent activation of alignment-related layers, causing alignment constraints to override task learning even on common inputs. Consequently, models trained on such data exhibit substantial and systematic performance degradation compared to standard fine-tuning. Our results identify alignment behaviour as a previously unexplored lever for data protection and, to our knowledge, present the first practical method for restricting data learnability at LLM scale without requiring access to or modification of the training pipeline.

[18] Tigrinya Number Verbalization: Rules, Algorithm, and Implementation

Fitsum Gaim,Issayas Tesfamariam

Main category: cs.CL

TL;DR: 本文系统地形式化了提格里尼亚语中基数和序数的口语表达规则，填补了该语言计算资源的空白，并开发了一个开源的数字转文字算法，同时揭示了当前大语言模型在处理提格里尼亚语数字时的不足。

Details

Motivation: 提格里尼亚语缺乏关于数字口语表达的系统性计算资源，且现有大语言模型在该任务上表现不佳，亟需明确的规则记录与工具支持。 Method: 通过归纳提格里尼亚语中数字表达的规范规则（包括连接系统、量级词及日期、时间、货币等特殊情况），设计并实现一个形式化的数字转文字算法，并发布开源实现。 Result: 成功构建了提格里尼亚语数字口语化的形式化系统，开源工具经评估显示前沿大语言模型在该任务上存在显著错误率。 Conclusion: 本工作为提格里尼亚语的语言建模、语音合成和无障碍应用提供了重要基础资源，强调了低资源语言中显式规则建模的必要性。 Abstract: We present a systematic formalization of Tigrinya cardinal and ordinal number verbalization, addressing a gap in computational resources for the language. This work documents the canonical rules governing the expression of numerical values in spoken Tigrinya, including the conjunction system, scale words, and special cases for dates, times, and currency. We provide a formal algorithm for number-to-word conversion and release an open-source implementation. Evaluation of frontier large language models (LLMs) reveals significant gaps in their ability to accurately verbalize Tigrinya numbers, underscoring the need for explicit rule documentation. This work serves language modeling, speech synthesis, and accessibility applications targeting Tigrinya-speaking communities.

[19] Implicit Graph, Explicit Retrieval: Towards Efficient and Interpretable Long-horizon Memory for Large Language Models

Xin Zhang,Kailai Yang,Hao Li,Chenyue Li,Qiyu Wei,Sophia Ananiadou

Main category: cs.CL

TL;DR: 提出LatentGraphMem，一种结合隐式图记忆与显式子图检索的记忆框架，用于提升大语言模型在长上下文下的问答性能。

Details

Motivation: 现有记忆系统在处理长上下文时面临效率与可解释性之间的权衡：显式结构化记忆可解释但易崩溃，隐式潜在记忆高效稳定但难以检查。 Method: 设计LatentGraphMem框架，将图结构记忆存储于潜在空间以保证效率和稳定性，并通过任务特定的子图检索接口返回紧凑的符号子图用于推理和人工检查；训练时使用显式图视图与冻结的推理器进行问答监督，推理时在潜在空间中检索并仅外部化检索到的子图。 Result: 在多个模型规模的长周期基准测试上，LatentGraphMem持续优于代表性的显式图和潜在记忆基线方法，同时支持参数高效的适应性和对更大推理器的灵活扩展，且不引入庞大的符号产物。 Conclusion: LatentGraphMem有效平衡了记忆系统的可解释性、效率与稳定性，适用于稀疏且分散证据下的长上下文问答任务。 Abstract: Long-horizon applications increasingly require large language models (LLMs) to answer queries when relevant evidence is sparse and dispersed across very long contexts. Existing memory systems largely follow two paradigms: explicit structured memories offer interpretability but often become brittle under long-context overload, while latent memory mechanisms are efficient and stable yet difficult to inspect. We propose LatentGraphMem, a memory framework that combines implicit graph memory with explicit subgraph retrieval. LatentGraphMem stores a graph-structured memory in latent space for stability and efficiency, and exposes a task-specific subgraph retrieval interface that returns a compact symbolic subgraph under a fixed budget for downstream reasoning and human inspection. During training, an explicit graph view is materialized to interface with a frozen reasoner for question-answering supervision. At inference time, retrieval is performed in latent space and only the retrieved subgraph is externalized. Experiments on long-horizon benchmarks across multiple model scales show that LatentGraphMem consistently outperforms representative explicit-graph and latent-memory baselines, while enabling parameter-efficient adaptation and flexible scaling to larger reasoners without introducing large symbolic artifacts.

[20] PCoA: A New Benchmark for Medical Aspect-Based Summarization With Phrase-Level Context Attribution

Bohao Chu,Sameh Frihat,Tabea M. G. Pakull,Hendrik Damm,Meijie Li,Ula Muhabbek,Georg Lodde,Norbert Fuhr

Main category: cs.CL

TL;DR: 提出了PCoA，一个带有短语级上下文归因的医学方面摘要专家标注基准，并提出细粒度解耦评估框架，验证了其在评估生成摘要质量方面的可靠性。

Details

Motivation: 现有系统生成摘要的验证存在困难，尤其是在高风险的医疗领域，需要精确的源上下文归因。 Method: 构建了一个名为PCoA的专家标注数据集，对每个方面摘要标注支持性句子和关键短语；提出一种解耦评估框架，分别评估摘要、引用和关键短语的质量。 Result: 实验证明PCoA数据集具有高质量和一致性，多个大模型在该任务上的表现得到基准测试；识别相关句子和关键短语可提升摘要质量。 Conclusion: PCoA为带短语级上下文归因的摘要生成与评估提供了可靠基准，且前置的细粒度归因有助于提高摘要质量。 Abstract: Verifying system-generated summaries remains challenging, as effective verification requires precise attribution to the source context, which is especially crucial in high-stakes medical domains. To address this challenge, we introduce PCoA, an expert-annotated benchmark for medical aspect-based summarization with phrase-level context attribution. PCoA aligns each aspect-based summary with its supporting contextual sentences and contributory phrases within them. We further propose a fine-grained, decoupled evaluation framework that independently assesses the quality of generated summaries, citations, and contributory phrases. Through extensive experiments, we validate the quality and consistency of the PCoA dataset and benchmark several large language models on the proposed task. Experimental results demonstrate that PCoA provides a reliable benchmark for evaluating system-generated summaries with phrase-level context attribution. Furthermore, comparative experiments show that explicitly identifying relevant sentences and contributory phrases before summarization can improve overall quality. The data and code are available at https://github.com/chubohao/PCoA.

[21] Training-Free Adaptation of New-Generation LLMs using Legacy Clinical Models

Sasha Ronaghi,Chloe Stanwyck,Asad Aali,Amir Ronaghi,Miguel Fuentes,Tina Hernandez-Boussard,Emily Alsentzer

Main category: cs.CL

TL;DR: 提出Cross-Architecture Proxy Tuning (CAPT)，一种无需训练即可将通用领域大模型适配到临床领域的模型集成方法，支持不同词汇表的模型，通过对比解码注入临床相关信息，在六项临床任务中显著优于现有方法。

Details

Motivation: 避免为每个新模型重复昂贵的临床领域继续预训练和微调，实现对最新通用大模型的高效、低成本临床适配。 Method: 提出CAPT框架，利用已有临床模型作为‘代理’，通过对比解码机制在推理时将临床知识注入新通用模型，支持跨架构与不同词汇表的模型协作。 Result: 在六个临床分类与生成任务上，CAPT显著优于单独使用通用或临床模型，以及UniTE和原始代理调优等先进集成方法（平均超越UniTE 17.6%，超越代理调优41.4%），并被医生案例研究证实能提升临床语言的准确性和特异性。 Conclusion: CAPT提供了一种高效、无需训练的途径，利用旧版临床模型激活新版通用模型的临床能力，解决了模型迭代中的知识迁移与兼容性难题，具有实际部署价值。 Abstract: Adapting language models to the clinical domain through continued pretraining and fine-tuning requires costly retraining for each new model generation. We propose Cross-Architecture Proxy Tuning (CAPT), a model-ensembling approach that enables training-free adaptation of state-of-the-art general-domain models using existing clinical models. CAPT supports models with disjoint vocabularies, leveraging contrastive decoding to selectively inject clinically relevant signals while preserving the general-domain model's reasoning and fluency. On six clinical classification and text-generation tasks, CAPT with a new-generation general-domain model and an older-generation clinical model consistently outperforms both models individually and state-of-the-art ensembling approaches (average +17.6% over UniTE, +41.4% over proxy tuning across tasks). Through token-level analysis and physician case studies, we demonstrate that CAPT amplifies clinically actionable language, reduces context errors, and increases clinical specificity.

[22] The Critical Role of Aspects in Measuring Document Similarity

Eftekhar Hossain,Tarnika Hazra,Ahatesham Bhuiyan,Santu Karmaker

Main category: cs.CL

TL;DR: ASPECTSIM是一种新的文档相似性度量框架，强调在明确指定的方面基础上进行相似性判断，相较于传统整体方法更有效。研究通过新构建的26K方面-文档对基准测试发现，基于GPT-4o直接提示的ASPECTSIM在人机一致性上显著优于传统方法。同时，针对开源模型的大规模元评估显示，尽管两阶段精炼策略可提升小模型性能，但其表现仍远低于GPT-4o。

Details

Motivation: 传统文档相似性测量采用整体方式，忽略了不同方面对相似性的影响，导致与人类判断的一致性较低。本文旨在提出一种更符合人类认知的、可解释的框架，通过显式引入“方面”来改进文档相似性建模。 Method: 提出ASPECTSIM框架，要求在明确给定的方面条件下计算文档相似性。使用新构建的包含26K方面-文档对的基准数据集进行实验；采用GPT-4o直接提示实现初始版本，并对16个开源LLM和9个嵌入模型进行大规模元评估，提出两阶段精炼策略以提升小模型表现。 Result: 基于GPT-4o的ASPECTSIM在人机一致性上比传统整体方法高出约80%；直接提示小规模LLM效果差（仅20-30%一致性），但两阶段精炼使其提升约140%；然而仍远低于GPT-4o的表现。 Conclusion: 显式考虑方面信息能显著提升文档相似性度量的质量，应重新审视当前标准做法；尽管可通过优化策略提升小模型表现，但其在捕捉方面条件相似性方面仍落后于大型专有模型。 Abstract: We introduce ASPECTSIM, a simple and interpretable framework that requires conditioning document similarity on an explicitly specified aspect, which is different from the traditional holistic approach in measuring document similarity. Experimenting with a newly constructed benchmark of 26K aspect-document pairs, we found that ASPECTSIM, when implemented with direct GPT-4o prompting, achieves substantially higher human-machine agreement ($\approx$80% higher) than the same for holistic similarity without explicit aspects. These findings underscore the importance of explicitly accounting for aspects when measuring document similarity and highlight the need to revise standard practice. Next, we conducted a large-scale meta-evaluation using 16 smaller open-source LLMs and 9 embedding models with a focus on making ASPECTSIM accessible and reproducible. While directly prompting LLMs to produce ASPECTSIM scores turned out be ineffective (20-30% human-machine agreement), a simple two-stage refinement improved their agreement by $\approx$140%. Nevertheless, agreement remains well below that of GPT-4o-based models, indicating that smaller open-source LLMs still lag behind large proprietary models in capturing aspect-conditioned similarity.

[23] Grading Scale Impact on LLM-as-a-Judge: Human-LLM Alignment Is Highest on 0-5 Grading Scale

Weiyue Li,Minda Zhao,Weixuan Dong,Jiahui Cai,Yuze Wei,Michael Pocress,Yi Li,Wanyan Yuan,Xiaoyue Wang,Ruoyu Hou,Kaiyuan Lou,Wenqi Zeng,Yutong Yang,Yilun Du,Mengyu Wang

Main category: cs.CL

TL;DR: 研究比较了人类与大语言模型（LLM）作为评分者的评分一致性，发现评分尺度显著影响人类与LLM之间的一致性，0-5分制在任务聚合下表现出最强的对齐性，并揭示了不同性别组间系统性差异，强调评分尺度设计和子群诊断的重要性。

Details

Motivation: 探索评分尺度对大语言模型（LLM）作为自动评估者时一致性的影响，弥补此前研究中对评分尺度本身作用的忽视。 Method: 通过在三个评分尺度和六个基准任务上收集人类与LLM的评分，使用组内相关系数（ICC）衡量绝对一致性，分析尺度选择对人类-LLM一致性和内部可靠性的影响。 Result: LLM在主观任务上的评分受尺度影响明显；0-5分制下人类与LLM对齐最佳；群体总体可靠性可能掩盖不同任务和子群（如性别）间的异质性。 Conclusion: 评分尺度的设计显著影响LLM作为评判者的表现，需结合子群诊断以提升评估协议的公平性与可靠性。 Abstract: Large language models (LLMs) are increasingly used as automated evaluators, yet prior works demonstrate that these LLM judges often lack consistency in scoring when the prompt is altered. However, the effect of the grading scale itself remains underexplored. We study the LLM-as-a-judge problem by comparing two kinds of raters: humans and LLMs. We collect ratings from both groups on three scales and across six benchmarks that include objective, open-ended subjective, and mixed tasks. Using intraclass correlation coefficients (ICC) to measure absolute agreement, we find that LLM judgments are not perfectly consistent across scales on subjective benchmarks, and that the choice of scale substantially shifts human-LLM agreement, even when within-group panel reliability is high. Aggregated over tasks, the grading scale of 0-5 yields the strongest human-LLM alignment. We further demonstrate that pooled reliability can mask benchmark heterogeneity and reveal systematic subgroup differences in alignment across gender groups, strengthening the importance of scale design and sub-level diagnostics as essential components of LLM-as-a-judge protocols.

[24] Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks

Atsuki Yamaguchi,Maggie Mi,Nikolaos Aletras

Main category: cs.CL

TL;DR: 提出L2T预训练框架，结合语言学习任务与标准的下一个词预测，提升语言模型的语言能力，同时保持在通用推理任务上的竞争力。

Details

Motivation: 现有的语言模型在原始文本数据集上进行预训练以逐个生成文本序列，虽然这种方法有助于世界知识和推理的学习，但并未明确优化语言能力。 Method: 受到人类语言习得的启发，L2T将原始文本转换为结构化的输入-输出对，提供明确的语言刺激，并在原始文本和L2T数据混合的数据集上预训练语言模型。 Result: 在语言能力基准测试中整体性能得到提高，且获取速度加快，同时在一般推理任务上保持了有竞争力的表现。 Conclusion: L2T框架有效提升了语言模型的语言能力，同时不影响其在其他任务上的表现。 Abstract: Language models (LMs) are pre-trained on raw text datasets to generate text sequences token-by-token. While this approach facilitates the learning of world knowledge and reasoning, it does not explicitly optimize for linguistic competence. To bridge this gap, we propose L2T, a pre-training framework integrating Language Learning Tasks alongside standard next-token prediction. Inspired by human language acquisition, L2T transforms raw text into structured input-output pairs to provide explicit linguistic stimulation. Pre-training LMs on a mixture of raw text and L2T data not only improves overall performance on linguistic competence benchmarks but accelerates its acquisition, while maintaining competitive performance on general reasoning tasks.

[25] Prompting Underestimates LLM Capability for Time Series Classification

Dan Schumacher,Erfan Nourbakhsh,Rocky Slavin,Anthony Rios

Main category: cs.CL

TL;DR: 大型语言模型在时间序列分类中的表现被提示方法低估，线性探测显示其内部表征实际包含丰富的时序信息。

Details

Motivation: 研究提示方法评估大模型时间序列分类能力的局限性，探究模型内部是否真正编码了有意义的时间结构。 Method: 通过比较零样本提示输出与相同内部表示上的线性探测结果，并进行逐层分析，评估信息在模型中的分布与演化。 Result: 零样本提示性能接近随机，而线性探针将平均F1从0.15-0.26提升至0.61-0.67，媲美专业时序模型；判别性信息出现在早期transformer层，并受视觉和多模态输入增强。 Conclusion: 当前基于提示的评估低估了大模型对时间序列的理解能力，存在表征与评估之间的系统性不匹配。 Abstract: Prompt-based evaluations suggest that large language models (LLMs) perform poorly on time series classification, raising doubts about whether they encode meaningful temporal structure. We show that this conclusion reflects limitations of prompt-based generation rather than the model's representational capacity by directly comparing prompt outputs with linear probes over the same internal representations. While zero-shot prompting performs near chance, linear probes improve average F1 from 0.15-0.26 to 0.61-0.67, often matching or exceeding specialized time series models. Layer-wise analyses further show that class-discriminative time series information emerges in early transformer layers and is amplified by visual and multimodal inputs. Together, these results demonstrate a systematic mismatch between what LLMs internally represent and what prompt-based evaluation reveals, leading current evaluations to underestimate their time series understanding.

[26] EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning

Mingyang Wei,Dehai Min,Zewen Liu,Yuzhang Xie,Guanchen Wu,Carl Yang,Max S. Y. Lau,Qi He,Lu Cheng,Wei Jin

Main category: cs.CL

TL;DR: EpiQAL是一个新的流行病学问答基准，用于评估基于证据的流行病学推理能力，包含三个子集，分别测试事实回忆、多步推理和结论重构。

Details

Motivation: 现有医学问答基准主要关注临床知识或个体患者层面的推理，缺乏对群体层面流行病学推理的系统评估。 Method: 构建了EpiQAL基准，包含来自开放文献的三个子集，结合专家设计的分类体系、多模型验证和基于检索的难度控制进行数据构造。 Result: 在十个开源模型上的实验表明，当前大语言模型在流行病学推理任务上表现有限，尤其是多步推理最具挑战性；模型排名在不同子集上有所变化，模型规模并非唯一决定因素；思维链提示有助于多步推理但其他方面效果不一。 Conclusion: EpiQAL为评估模型在证据支持、推理解析和结论重建方面的流行病学推理能力提供了细粒度的诊断工具。 Abstract: Reliable epidemiological reasoning requires synthesizing study evidence to infer disease burden, transmission dynamics, and intervention effects at the population level. Existing medical question answering benchmarks primarily emphasize clinical knowledge or patient-level reasoning, yet few systematically evaluate evidence-grounded epidemiological inference. We present EpiQAL, the first diagnostic benchmark for epidemiological question answering across diverse diseases, comprising three subsets built from open-access literature. The subsets respectively evaluate text-grounded factual recall, multi-step inference linking document evidence with epidemiological principles, and conclusion reconstruction with the Discussion section withheld. Construction combines expert-designed taxonomy guidance, multi-model verification, and retrieval-based difficulty control. Experiments on ten open models reveal that current LLMs show limited performance on epidemiological reasoning, with multi-step inference posing the greatest challenge. Model rankings shift across subsets, and scale alone does not predict success. Chain-of-Thought prompting benefits multi-step inference but yields mixed results elsewhere. EpiQAL provides fine-grained diagnostic signals for evidence grounding, inferential reasoning, and conclusion reconstruction.

[27] SegNSP: Revisiting Next Sentence Prediction for Linear Text Segmentation

José Isidro,Filipe Cunha,Purificação Silvano,Alípio Jorge,Nuno Guimarães,Sérgio Nunes,Ricardo Campos

Main category: cs.CL

TL;DR: 本文提出了SegNSP，将线性文本分割视为下一句预测（NSP）任务，通过标签无关的NSP方法和改进的损失函数与负采样策略，在无需显式主题标注的情况下有效检测话题边界。

Details

Motivation: 由于定义话题边界复杂、语篇结构多变以及局部连贯性与全局上下文之间的平衡难题，线性文本分割在自然语言处理中仍具挑战性，限制了摘要、信息检索等下游任务的发展。 Method: 提出SegNSP模型，采用标签无关的下一句预测框架，结合分割感知损失和更难的负样本采样来增强对语篇连续性的建模，不依赖额外的主题分类监督信号。 Result: 在CitiLink-Minutes数据集上达到0.79的B-F1分数，在WikiSection上达到0.65，比最强可复现基线TopSeg提升0.17绝对值点。 Conclusion: 实验结果表明，通过建模句子间连续性可有效提升文本分割质量，SegNSP在两个数据集上表现出色且鲁棒，支持多种下游NLP应用。 Abstract: Linear text segmentation is a long-standing problem in natural language processing (NLP), focused on dividing continuous text into coherent and semantically meaningful units. Despite its importance, the task remains challenging due to the complexity of defining topic boundaries, the variability in discourse structure, and the need to balance local coherence with global context. These difficulties hinder downstream applications such as summarization, information retrieval, and question answering. In this work, we introduce SegNSP, framing linear text segmentation as a next sentence prediction (NSP) task. Although NSP has largely been abandoned in modern pre-training, its explicit modeling of sentence-to-sentence continuity makes it a natural fit for detecting topic boundaries. We propose a label-agnostic NSP approach, which predicts whether the next sentence continues the current topic without requiring explicit topic labels, and enhance it with a segmentation-aware loss combined with harder negative sampling to better capture discourse continuity. Unlike recent proposals that leverage NSP alongside auxiliary topic classification, our approach avoids task-specific supervision. We evaluate our model against established baselines on two datasets, CitiLink-Minutes, for which we establish the first segmentation benchmark, and WikiSection. On CitiLink-Minutes, SegNSP achieves a B-$F_1$ of 0.79, closely aligning with human-annotated topic transitions, while on WikiSection it attains a B-F$_1$ of 0.65, outperforming the strongest reproducible baseline, TopSeg, by 0.17 absolute points. These results demonstrate competitive and robust performance, highlighting the effectiveness of modeling sentence-to-sentence continuity for improving segmentation quality and supporting downstream NLP applications.

[28] Self-Explaining Hate Speech Detection with Moral Rationales

Francielle Vargas,Jackson Trager,Diego Alves,Surendrabikram Thapa,Matteo Guida,Berk Atil,Daryna Dementieva,Andrew Smart,Ameeta Agrawal

Main category: cs.CL

TL;DR: 本文提出了一种基于道德基础理论的可解释仇恨言论检测框架SMRA，通过将专家标注的道德理由直接融入注意力机制训练目标，提升模型性能、解释忠实性和文化情境化能力。

Details

Motivation: 现有仇恨言论检测模型依赖表面词汇特征，易受伪相关影响，缺乏鲁棒性、文化语境理解和可解释性。 Method: 基于道德基础理论，提出监督式道德理由注意力（SMRA）框架，将token级道德理由作为注意力对齐的监督信号，并构建包含道德标注和元数据的新基准数据集HateBRMoralXplain。 Result: SMRA在二分类仇恨检测和多标签道德情感分类中分别提升0.9和1.5 F1分数，解释忠实性显著提高（IoU F1 +7.4，Token F1 +5.0），且解释更简洁、充分性增强，公平性保持稳定。 Conclusion: SMRA通过道德理由监督注意力机制，实现了更准确、可解释且文化敏感的仇恨言论检测，无需在性能与公平性之间权衡。 Abstract: Hate speech detection models rely on surface-level lexical features, increasing vulnerability to spurious correlations and limiting robustness, cultural contextualization, and interpretability. We propose Supervised Moral Rationale Attention (SMRA), the first self-explaining hate speech detection framework to incorporate moral rationales as direct supervision for attention alignment. Based on Moral Foundations Theory, SMRA aligns token-level attention with expert-annotated moral rationales, guiding models to attend to morally salient spans rather than spurious lexical patterns. Unlike prior rationale-supervised or post-hoc approaches, SMRA integrates moral rationale supervision directly into the training objective, producing inherently interpretable and contextualized explanations. To support our framework, we also introduce HateBRMoralXplain, a Brazilian Portuguese benchmark dataset annotated with hate labels, moral categories, token-level moral rationales, and socio-political metadata. Across binary hate speech detection and multi-label moral sentiment classification, SMRA consistently improves performance (e.g., +0.9 and +1.5 F1, respectively) while substantially enhancing explanation faithfulness, increasing IoU F1 (+7.4 pp) and Token F1 (+5.0 pp). Although explanations become more concise, sufficiency improves (+2.3 pp) and fairness remains stable, indicating more faithful rationales without performance or bias trade-offs

[29] CALM: Culturally Self-Aware Language Models

Lingzhi Shen,Xiaohao Cai,Yunfei Long,Imran Razzak,Guanming Chen,Shoaib Jameel

Main category: cs.CL

TL;DR: 本文提出CALM框架，通过解耦任务语义与文化特征，利用对比学习、交叉注意力和专家混合机制，实现语言模型的文化自适应与持续反思，提升跨文化理解能力。

Details

Motivation: 现有语言模型将文化视为静态背景知识，缺乏对文化动态性的建模，导致在需要文化敏感性的任务中表现不佳。 Method: 提出CALM框架：1）通过对比学习将任务语义与显式/隐式文化信号解耦并聚类；2）使用交叉注意力对齐文化特征；3）通过Mixture-of-Experts沿文化维度自适应融合；4）结合自提示反思学习构建文化感知的内部状态。 Result: 在多个跨文化基准数据集上实验表明，CALM显著优于现有最先进方法。 Conclusion: CALM有效赋予语言模型文化自意识，支持细粒度文化交互与持续自我修正，提升了模型在跨文化场景下的可靠性与适应性。 Abstract: Cultural awareness in language models is the capacity to understand and adapt to diverse cultural contexts. However, most existing approaches treat culture as static background knowledge, overlooking its dynamic and evolving nature. This limitation reduces their reliability in downstream tasks that demand genuine cultural sensitivity. In this work, we introduce CALM, a novel framework designed to endow language models with cultural self-awareness. CALM disentangles task semantics from explicit cultural concepts and latent cultural signals, shaping them into structured cultural clusters through contrastive learning. These clusters are then aligned via cross-attention to establish fine-grained interactions among related cultural features and are adaptively integrated through a Mixture-of-Experts mechanism along culture-specific dimensions. The resulting unified representation is fused with the model's original knowledge to construct a culturally grounded internal identity state, which is further enhanced through self-prompted reflective learning, enabling continual adaptation and self-correction. Extensive experiments conducted on multiple cross-cultural benchmark datasets demonstrate that CALM consistently outperforms state-of-the-art methods.

[30] Submodular Evaluation Subset Selection in Automatic Prompt Optimization

Jinming Nian,Zhiyuan Peng,Hongwei Shang,Dae Hoon Park,Yi Fang

Main category: cs.CL

TL;DR: 提出了一种基于子模函数的评估子集选择方法SESS，用于自动提示优化，相比随机或启发式方法能更有效地提升提示优化效果。

Details

Motivation: 现有的自动提示优化方法通常依赖随机采样的小规模评估子集，但如何选择这些子集对性能影响显著，却常被视为实现细节，缺乏系统研究。 Method: 将评估子集选择建模为最大化单调子模目标函数的问题，并采用具有理论保证的贪心算法进行选择。 Result: 在GSM8K、MATH和GPQA-Diamond数据集上，使用SESS选择的子集优化出的提示优于随机或启发式选择方法。 Conclusion: 有原则地选择评估子集对提示优化至关重要，子模性框架为高效且有效的选择提供了理论支持和实践方案。 Abstract: Automatic prompt optimization reduces manual prompt engineering, but relies on task performance measured on a small, often randomly sampled evaluation subset as its main source of feedback signal. Despite this, how to select that evaluation subset is usually treated as an implementation detail. We study evaluation subset selection for prompt optimization from a principled perspective and propose SESS, a submodular evaluation subset selection method. We frame selection as maximizing an objective set function and show that, under mild conditions, it is monotone and submodular, enabling greedy selection with theoretical guarantees. Across GSM8K, MATH, and GPQA-Diamond, submodularly selected evaluation subsets can yield better optimized prompts than random or heuristic baselines.

[31] Beyond Perplexity: A Lightweight Benchmark for Knowledge Retention in Supervised Fine-Tuning

Soheil Zibakhsh Shabgahi,Pedram Aghazadeh,Farinaz Koushanfar

Main category: cs.CL

TL;DR: 提出了一种名为知识保留测试（KR-Test）的轻量级评估框架，用于在监督微调中区分大语言模型的真实知识掌握与语言模仿，提升微调过程的可解释性。

Details

Motivation: 验证困惑度不足以反映模型是否真正掌握知识，需一种能区分语言模仿与事实内化的方法。 Method: 设计了基于语料库的KR-Test，利用自动生成的对比样例，通过衡量正确与错误续写之间的似然偏好来评估知识保留，无需指令微调或生成解码。 Result: 验证了KR-Test的可靠性，并通过LoRA微调动态分析展示了其诊断能力，揭示了语言收敛与知识保留之间的细粒度分离。 Conclusion: KR-Test是一种有效且轻量的评估工具，能够更准确地监测监督微调中的知识学习过程，增强对微调动态的理解。 Abstract: Supervised Fine-Tuning (SFT) is a standard approach for injecting domain knowledge into Large Language Models (LLMs). However, relying on validation perplexity to monitor training is often insufficient, as it confounds stylistic mimicry with genuine factual internalization. To address this, we introduce the Knowledge Retention (KR) Test , a lightweight, corpus-grounded evaluation framework designed to distinguish factual learning from linguistics. KR-Test utilizes automatically generated contrastive examples to measure likelihood preferences for correct versus incorrect continuations, requiring no instruction tuning or generative decoding. We validate the framework's integrity through a "blind vs. oracle" baseline analysis. Furthermore, we demonstrate the diagnostic capabilities of KR-Test by analyzing the training dynamics of Low-Rank Adaptation (LoRA). By exposing the fine-grained dissociation between linguistic convergence and knowledge retention, KR-Test enhances the interpretability of fine-tuning dynamics.

[32] Reasoning Pattern Alignment Merging for Adaptive Reasoning

Zhaofeng Zhong,Wei Yuan,Tong Chen,Xiangyu Zhao,Quoc Viet Hung Nguyen,Hongzhi Yin

Main category: cs.CL

TL;DR: 本文提出了一种名为Reasoning Pattern Alignment Merging (RPAM)的轻量级模型融合方法，通过结合长链思维（Long-CoT）模型和短链指令（Short-CoT）模型，实现查询自适应的高效推理，显著降低推理成本同时保持性能。

Details

Motivation: 现有大型推理模型在复杂任务中表现良好，但通常对所有查询生成冗长的推理路径，导致计算浪费；现有加速方法依赖重训练或复杂提示设计，成本高或对输入敏感，因此需要一种无需训练、低成本且稳定的加速方案。 Method: 提出RPAM，一种基于特征对齐的逐层模型融合框架：首先构建小型模式标注校准集为每个查询分配合适的推理模式；然后通过对其选中模型的中间表示优化逐层融合系数，并使用对比目标远离非选中模型的表示。 Result: 在七个主流推理基准上的实验表明，RPAM在显著降低推理成本的同时保持了强大的推理性能。 Conclusion: RPAM是一种有效的轻量级推理加速方法，通过模型融合与特征对齐实现查询自适应的推理路径选择，无需重新训练且易于复现。 Abstract: Recent large reasoning models (LRMs) have made substantial progress in complex reasoning tasks, yet they often generate lengthy reasoning paths for every query, incurring unnecessary computation and latency. Existing speed-up approaches typically rely on retraining the model or designing sophisticated prompting, which are either prohibitively expensive or highly sensitive to the input and prompt formulation. In this work, we study model merging as a lightweight alternative for efficient reasoning: by combining a long chain-of-thought (Long-CoT) reasoning model with a Short-CoT instruction model, we obtain an adaptive reasoner without training from scratch or requiring large-scale additional data. Building on this idea, we propose Reasoning Pattern Alignment Merging (RPAM), a layer-wise model merging framework based on feature alignment to facilitate query-adaptive reasoning. RPAM first constructs a small pattern-labeled calibration set that assigns each query an appropriate reasoning pattern. It then optimizes layer-wise merging coefficients by aligning the merged model's intermediate representations with those of the selected model, while a contrastive objective explicitly pushes them away from the non-selected model. Experiments on seven widely used reasoning benchmarks show that RPAM substantially reduces inference cost while maintaining strong performance. Upon article acceptance, we will provide open-source code to reproduce experiments for RPAM.

[33] IntroLM: Introspective Language Models via Prefilling-Time Self-Evaluation

Hossein Hosseini Kasnavieh,Gholamreza Haffari,Chris Leckie,Adel N. Toosi

Main category: cs.CL

TL;DR: IntroLM是一种使因果语言模型在预填充阶段自我预测输出质量的新方法，通过引入仅对内省令牌激活的条件LoRA，保持原始模型性能的同时实现高质量预测。

Details

Motivation: 现有方法依赖外部分类器（如BERT），存在上下文窗口有限、表示能力不足和额外计算开销的问题，难以有效预测大语言模型输出质量。 Method: 提出IntroLM，利用内省令牌和令牌条件LoRA，在预填充阶段让模型自我预测输出质量，且不影响生成过程和原始模型行为。 Result: 在问答基准上，IntroLM应用于Qwen3 8B时ROC AUC达到90%，比DeBERTa分类器高14%；在多模型路由系统中，延迟降低最多33%，大模型使用量减少最多50%，同时保持可靠性。 Conclusion: IntroLM能高效准确地预测LLM输出质量，优于外部分类器，并显著提升多模型系统的成本与性能平衡。 Abstract: A major challenge for the operation of large language models (LLMs) is how to predict whether a specific LLM will produce sufficiently high-quality output for a given query. Existing approaches rely on external classifiers, most commonly BERT based models, which suffer from limited context windows, constrained representational capacity, and additional computational overhead. We propose IntroLM, a method that enables causal language models to predict their own output quality during the prefilling phase without affecting generation using introspective tokens. By introducing token conditional LoRA that activates only for the introspective token, the model learns to predict the output quality for a given query while preserving the original backbone behavior and avoiding external evaluators. On question answering benchmarks, IntroLM applied to Qwen3 8B achieves a ROC AUC of 90 precent for success prediction, outperforming a DeBERTa classifier by 14 precent. When integrated into multi model routing systems, IntroLM achieves superior cost performance tradeoffs, reducing latency by up to 33 precent and large model usage by up to 50 precent at matched reliability.

[34] Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents

Yuanchen Bei,Tianxin Wei,Xuying Ning,Yanjun Zhao,Zhining Liu,Xiao Lin,Yada Zhu,Hendrik Hamann,Jingrui He,Hanghang Tong

Main category: cs.CL

TL;DR: 本文提出了Mem-Gallery，一个用于评估多模态大语言模型代理在长期对话中多模态记忆能力的新基准，包含高质量的多轮次、跨视觉与文本信息的对话数据，并设计了系统性评估框架，揭示了当前模型在记忆保持、推理与知识管理方面的局限性。

Details

Motivation: 现有基准无法有效评估多模态大语言模型在长期对话中对多模态信息的记忆保持、组织与演化能力，因此需要一个新的基准来填补这一空白。 Method: 构建了一个名为Mem-Gallery的新基准数据集，包含基于视觉和文本信息的高质量多会话对话，并提出一个涵盖记忆提取与适应、记忆推理以及记忆知识管理三个维度的系统性评估框架。 Result: 在十三个记忆系统上进行了广泛评测，结果表明显式的多模态信息保留和记忆组织是必要的，当前模型在记忆推理和知识管理方面存在持续缺陷，并面临效率瓶颈。 Conclusion: Mem-Gallery为评估多模态长期对话记忆提供了有效工具，揭示了现有MLLM代理在记忆功能上的关键挑战，推动未来研究关注显式记忆结构与高效管理机制的设计。 Abstract: Long-term memory is a critical capability for multimodal large language model (MLLM) agents, particularly in conversational settings where information accumulates and evolves over time. However, existing benchmarks either evaluate multi-session memory in text-only conversations or assess multimodal understanding within localized contexts, failing to evaluate how multimodal memory is preserved, organized, and evolved across long-term conversational trajectories. Thus, we introduce Mem-Gallery, a new benchmark for evaluating multimodal long-term conversational memory in MLLM agents. Mem-Gallery features high-quality multi-session conversations grounded in both visual and textual information, with long interaction horizons and rich multimodal dependencies. Building on this dataset, we propose a systematic evaluation framework that assesses key memory capabilities along three functional dimensions: memory extraction and test-time adaptation, memory reasoning, and memory knowledge management. Extensive benchmarking across thirteen memory systems reveals several key findings, highlighting the necessity of explicit multimodal information retention and memory organization, the persistent limitations in memory reasoning and knowledge management, as well as the efficiency bottleneck of current models.

[35] PALM-Bench: A Comprehensive Benchmark for Personalized Audio-Language Models

Yuwen Wang,Xinyuan Qian,Tian-Hao Zhang,Jiaran Gao,Yuchen Pan,Xin Wang,Zhou Pan,Chen Wei,Yiming Wang

Main category: cs.CL

TL;DR: 本文提出了个性化大型音频语言模型（PALM）的任务，旨在解决现有LALMs在个性化问答中的不足，并构建了首个基准PALM-Bench用于评估多说话人场景下的个性化音频理解。

Details

Motivation: 现有大型音频语言模型在处理个性化问题时表现不佳，无法有效利用个人上下文进行推理，而人类则能基于个人背景进行判断，因此需要构建能够理解个人语境的模型。 Method: 形式化定义了个性化LALMs（PALM）任务，构建了包含多任务的PALM-Bench基准，通过训练无关提示和监督微调方法在开源LALMs上进行实验评估。 Result: 实验表明，当前的提示方法和微调策略虽有一定提升，但在建模个性化知识及其跨任务迁移方面仍存在局限。 Conclusion: PALM为个性化音频语言理解提供了新方向，PALM-Bench有助于推动该领域发展，未来需更强的个性化建模能力。 Abstract: Large Audio-Language Models (LALMs) have demonstrated strong performance in audio understanding and generation. Yet, our extensive benchmarking reveals that their behavior is largely generic (e.g., summarizing spoken content) and fails to adequately support personalized question answering (e.g., summarizing what my best friend says). In contrast, human conditions their interpretation and decision-making on each individual's personal context. To bridge this gap, we formalize the task of Personalized LALMs (PALM) for recognizing personal concepts and reasoning within personal context. Moreover, we create the first benchmark (PALM-Bench) to foster the methodological advances in PALM and enable structured evaluation on several tasks across multi-speaker scenarios. Our extensive experiments on representative open-source LALMs, show that existing training-free prompting and supervised fine-tuning strategies, while yield improvements, remains limited in modeling personalized knowledge and transferring them across tasks robustly. Data and code will be released.

[36] Persona-aware and Explainable Bikeability Assessment: A Vision-Language Model Approach

Yilong Dai,Ziyi Wang,Chenguang Wang,Kexin Zhou,Yiheng Qian,Susu Xu,Xiang Yan

Main category: cs.CL

TL;DR: 提出一种基于视觉-语言模型的个性化骑行适宜性评估框架，结合用户感知与道路环境多粒度数据，实现可解释的骑行环境评价。

Details

Motivation: 现有基于感知的骑行适宜性评估方法难以充分捕捉道路环境复杂性和用户主观感知差异，需更精细建模。 Method: 构建具有理论支撑的骑行者 persona 分类条件化VLM框架，采用多粒度监督微调（结合专家标注与用户评分）和AI驱动的数据增强生成控制配对数据，并通过链式思维推理生成个性化解释。 Result: 在12,400条来自427名骑手的全景图像众包评估数据上验证，该框架在骑行适宜性评分预测上表现优异，并能进行可解释的因素归因分析。 Conclusion: 所提方法有效整合用户异质性感知与多源数据，支持高解释性的城市骑行环境评估，推动可持续交通规划。 Abstract: Bikeability assessment is essential for advancing sustainable urban transportation and creating cyclist-friendly cities, and it requires incorporating users' perceptions of safety and comfort. Yet existing perception-based bikeability assessment approaches face key limitations in capturing the complexity of road environments and adequately accounting for heterogeneity in subjective user perceptions. This paper proposes a persona-aware Vision-Language Model framework for bikeability assessment with three novel contributions: (i) theory-grounded persona conditioning based on established cyclist typology that generates persona-specific explanations via chain-of-thought reasoning; (ii) multi-granularity supervised fine-tuning that combines scarce expert-annotated reasoning with abundant user ratings for joint prediction and explainable assessment; and (iii) AI-enabled data augmentation that creates controlled paired data to isolate infrastructure variable impacts. To test and validate this framework, we developed a panoramic image-based crowdsourcing system and collected 12,400 persona-conditioned assessments from 427 cyclists. Experiment results show that the proposed framework offers competitive bikeability rating prediction while uniquely enabling explainable factor attribution.

[37] DeepSynth-Eval: Objectively Evaluating Information Consolidation in Deep Survey Writing

Hongzhi Zhang,Yuanze Hu,Tinghai Zhang,Jia Fu,Tao Wang,Junwei Jing,Zhaoxin Fan,Qi Wang,Ruiming Tang,Han Li,Guorui Zhou,Kun Gai

Main category: cs.CL

TL;DR: 本文提出了DeepSynth-Eval，一个用于客观评估大语言模型在深度研究中信息整合能力的基准，通过使用高质量综述论文作为黄金标准，反向生成研究请求并构建‘Oracle Contexts’，从而将主观写作评价转化为可验证的指标。

Details

Motivation: 现有的检索后合成阶段缺乏客观评估方法，因为开放性写作具有主观性，难以衡量模型在整合大量上下文和碎片化证据方面的能力。 Method: 利用高质量综述论文作为黄金标准，反向工程生成研究请求，并从其参考文献中构建‘Oracle Contexts’，提出细粒度评估协议，包括通用清单（事实覆盖）和约束清单（结构组织）。 Result: 在96项任务上的实验表明，从数百个参考文献中合成信息仍具挑战性；基于代理的分步规划与写作流程显著优于单轮生成，在减少幻觉和满足复杂结构约束方面表现更佳。 Conclusion: DeepSynth-Eval为信息整合能力提供了可量化的评估框架，验证了分步代理工作流在长篇报告生成中的有效性，推动了LLM向自主智能体的发展。 Abstract: The evolution of Large Language Models (LLMs) towards autonomous agents has catalyzed progress in Deep Research. While retrieval capabilities are well-benchmarked, the post-retrieval synthesis stage--where agents must digest massive amounts of context and consolidate fragmented evidence into coherent, long-form reports--remains under-evaluated due to the subjectivity of open-ended writing. To bridge this gap, we introduce DeepSynth-Eval, a benchmark designed to objectively evaluate information consolidation capabilities. We leverage high-quality survey papers as gold standards, reverse-engineering research requests and constructing "Oracle Contexts" from their bibliographies to isolate synthesis from retrieval noise. We propose a fine-grained evaluation protocol using General Checklists (for factual coverage) and Constraint Checklists (for structural organization), transforming subjective judgment into verifiable metrics. Experiments across 96 tasks reveal that synthesizing information from hundreds of references remains a significant challenge. Our results demonstrate that agentic plan-and-write workflows significantly outperform single-turn generation, effectively reducing hallucinations and improving adherence to complex structural constraints.

[38] Layer-Order Inversion: Rethinking Latent Multi-Hop Reasoning in Large Language Models

Xukai Liu,Ye Liu,Jipeng Zhang,Yanghai Zhang,Kai Zhang,Qi Liu

Main category: cs.CL

TL;DR: 本文研究了大语言模型在多跳推理中的内部机制，发现后续跳跃的答案实体可能比桥接实体更早被解码，提出了“层序倒置”现象，并提出了一种“概率性召回-提取”框架来解释这一行为。

Details

Motivation: 尽管大型语言模型在多跳推理上表现良好，但其内部如何组合多个事实仍不清楚。现有假设认为桥接实体按层顺序计算，但该假设是否普遍成立尚需验证。 Method: 通过对真实世界多跳查询进行系统分析，提出并验证了“概率性召回-提取”框架，结合探针分析、对先前解码证据的重新解释以及对思维链增益的分析。 Result: 发现了“层序倒置”现象：随着总跳跃数增加，后续跳跃的答案实体可比桥接实体更早在浅层解码；并通过实证验证了所提出的框架能有效解释多跳推理中的各种现象及失败原因。 Conclusion: 多跳推理并非严格按层逐跳对齐进行，而是通过浅层广泛召回、深层选择性提取的机制实现，这对理解大模型推理机理和改进模型设计具有重要意义。 Abstract: Large language models (LLMs) perform well on multi-hop reasoning, yet how they internally compose multiple facts remains unclear. Recent work proposes \emph{hop-aligned circuit hypothesis}, suggesting that bridge entities are computed sequentially across layers before later-hop answers. Through systematic analyses on real-world multi-hop queries, we show that this hop-aligned assumption does not generalize: later-hop answer entities can become decodable earlier than bridge entities, a phenomenon we call \emph{layer-order inversion}, which strengthens with total hops. To explain this behavior, we propose a \emph{probabilistic recall-and-extract} framework that models multi-hop reasoning as broad probabilistic recall in shallow MLP layers followed by selective extraction in deeper attention layers. This framework is empirically validated through systematic probing analyses, reinterpreting prior layer-wise decoding evidence, explaining chain-of-thought gains, and providing a mechanistic diagnosis of multi-hop failures despite correct single-hop knowledge. Code is available at https://github.com/laquabe/Layer-Order-Inversion.

[39] EvolMem: A Cognitive-Driven Benchmark for Multi-Session Dialogue Memory

Ye Shen,Dun Pei,Yiqiu Guo,Junying Wang,Yijin Guo,Zicheng Zhang,Qi Jia,Jun Zhou,Guangtao Zhai

Main category: cs.CL

TL;DR: 本文提出了EvolMem，一个基于认知心理学的多会话记忆能力评测基准，用于系统评估大语言模型在多种记忆维度上的表现，并通过混合数据合成框架生成具有可控复杂度的多会话对话。

Details

Motivation: 现有基准缺乏对大语言模型在多会话、多样化记忆维度下的系统性评估，尤其是在长期记忆和不同类型记忆（陈述性与非陈述性）方面的能力衡量不足。 Method: 提出EvolMem基准，结合认知心理学理论，分解记忆为多个细粒度能力；采用主题启动生成与叙事启发转换相结合的混合数据合成框架，生成多会话对话并配套特定样本的评估指南。 Result: 实验表明当前大语言模型在不同记忆维度上表现不一，没有模型在所有方面均占优；代理记忆机制未能稳定提升模型表现，且常存在效率问题。 Conclusion: EvolMem为评估LLM和代理系统的多会话记忆能力提供了更全面、细致的基准，揭示了现有模型和机制在记忆能力上的局限性，推动未来研究向更高效、多维的记忆建模发展。 Abstract: Despite recent advances in understanding and leveraging long-range conversational memory, existing benchmarks still lack systematic evaluation of large language models(LLMs) across diverse memory dimensions, particularly in multi-session settings. In this work, we propose EvolMem, a new benchmark for assessing multi-session memory capabilities of LLMs and agent systems. EvolMem is grounded in cognitive psychology and encompasses both declarative and non-declarative memory, further decomposed into multiple fine-grained abilities. To construct the benchmark, we introduce a hybrid data synthesis framework that consists of topic-initiated generation and narrative-inspired transformations. This framework enables scalable generation of multi-session conversations with controllable complexity, accompanied by sample-specific evaluation guidelines. Extensive evaluation reveals that no LLM consistently outperforms others across all memory dimensions. Moreover, agent memory mechanisms do not necessarily enhance LLMs' capabilities and often exhibit notable efficiency limitations. Data and code will be released at https://github.com/shenye7436/EvolMem.

[40] Value-Action Alignment in Large Language Models under Privacy-Prosocial Conflict

Guanyu Chen,Chenxiao Yu,Xiyang Hu

Main category: cs.CL

TL;DR: 提出了一种基于上下文的评估协议，用于测量大语言模型在隐私态度、亲社会性和数据共享接受度之间的价值-行为一致性，并引入了VAAR指标来量化这种一致性。

Details

Motivation: 现有评估方法孤立地测量隐私态度或分享意图，难以反映真实人类在多重动机冲突下的决策行为，因此需要一种能同时评估多种价值观如何共同影响实际数据共享行为的方法。 Method: 设计了一个顺序施测标准化问卷的上下文评估协议，结合多组结构方程模型（MGSEM）分析隐私关注和亲社会性对数据共享的影响路径，并提出Value-Action Alignment Rate (VAAR) 指标来衡量模型与人类一致的价值-行为对齐程度。 Result: 在多个大语言模型中发现了稳定但模型特定的隐私-亲社会性-数据共享接受度特征（Privacy-PSA-AoDS），且在价值-行为对齐方面存在显著异质性。 Conclusion: 该评估框架能够有效揭示大语言模型在涉及隐私与亲社会动机权衡时的决策机制，VAAR为衡量模型是否真正遵循其表达的价值观提供了可量化的依据。 Abstract: Large language models (LLMs) are increasingly used to simulate decision-making tasks involving personal data sharing, where privacy concerns and prosocial motivations can push choices in opposite directions. Existing evaluations often measure privacy-related attitudes or sharing intentions in isolation, which makes it difficult to determine whether a model's expressed values jointly predict its downstream data-sharing actions as in real human behaviors. We introduce a context-based assessment protocol that sequentially administers standardized questionnaires for privacy attitudes, prosocialness, and acceptance of data sharing within a bounded, history-carrying session. To evaluate value-action alignments under competing attitudes, we use multi-group structural equation modeling (MGSEM) to identify relations from privacy concerns and prosocialness to data sharing. We propose Value-Action Alignment Rate (VAAR), a human-referenced directional agreement metric that aggregates path-level evidence for expected signs. Across multiple LLMs, we observe stable but model-specific Privacy-PSA-AoDS profiles, and substantial heterogeneity in value-action alignment.

[41] Evaluating LLMs for Police Decision-Making: A Framework Based on Police Action Scenarios

Sangyub Lee,Heedou Kim,Hyeoncheol Kim

Main category: cs.CL

TL;DR: 提出针对警务操作的大型语言模型评估框架PAS，包含新构建的QA数据集和关键指标，实验表明商业LLM在警务相关任务上表现不佳。

Details

Motivation: 现有大型语言模型在警务操作中的应用缺乏专门的评估框架，可能导致非法逮捕和证据收集不当等严重问题。 Method: 构建覆盖整个评估过程的系统性框架PAS，基于8000多份官方文件创建新的QA数据集，并通过警察专家判断进行统计分析验证关键指标。 Result: 实验结果显示商业大型语言模型在新的警务相关任务上表现较差，特别是在提供基于事实的建议方面。 Conclusion: 需要可扩展的评估框架以确保AI驱动的警务操作的可靠性，本文提出的PAS框架和发布的数据有助于推动该领域的研究与实践。 Abstract: The use of Large Language Models (LLMs) in police operations is growing, yet an evaluation framework tailored to police operations remains absent. While LLM's responses may not always be legally incorrect, their unverified use still can lead to severe issues such as unlawful arrests and improper evidence collection. To address this, we propose PAS (Police Action Scenarios), a systematic framework covering the entire evaluation process. Applying this framework, we constructed a novel QA dataset from over 8,000 official documents and established key metrics validated through statistical analysis with police expert judgements. Experimental results show that commercial LLMs struggle with our new police-related tasks, particularly in providing fact-based recommendations. This study highlights the necessity of an expandable evaluation framework to ensure reliable AI-driven police operations. We release our data and prompt template.

[42] DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs

Shidong Cao,Hongzhan Lin,Yuxuan Gu,Ziyang Luo,Jing Ma

Main category: cs.CL

TL;DR: 提出DiffCoT，一种基于扩散机制的思维链推理框架，通过迭代去噪实现中间步骤的生成与回溯修正，提升数学推理的鲁棒性和纠错能力。

Details

Motivation: 传统思维链（CoT）推理在多步数学问题求解中易受暴露偏差和错误累积影响，早期错误会通过自回归解码不可逆传播，限制性能。 Method: 将CoT推理重构为迭代去噪过程，引入基于滑动窗口的扩散机制，在推理步骤层面融合扩散原理，并设计因果扩散噪声调度以保持推理链的时间结构一致性。 Result: 在三个多步推理基准上、多种模型结构下，DiffCoT均优于现有的CoT偏好优化方法，显著提升推理鲁棒性和错误纠正能力。 Conclusion: DiffCoT通过扩散式迭代修正机制，有效缓解了传统CoT中的错误累积问题，为可靠的大模型推理提供了新思路。 Abstract: Chain-of-Thought (CoT) reasoning improves multi-step mathematical problem solving in large language models but remains vulnerable to exposure bias and error accumulation, as early mistakes propagate irreversibly through autoregressive decoding. In this work, we propose DiffCoT, a diffusion-styled CoT framework that reformulates CoT reasoning as an iterative denoising process. DiffCoT integrates diffusion principles at the reasoning-step level via a sliding-window mechanism, enabling unified generation and retrospective correction of intermediate steps while preserving token-level autoregression. To maintain causal consistency, we further introduce a causal diffusion noise schedule that respects the temporal structure of reasoning chains. Extensive experiments on three multi-step CoT reasoning benchmarks across diverse model backbones demonstrate that DiffCoT consistently outperforms existing CoT preference optimization methods, yielding improved robustness and error-correction capability in CoT reasoning.

[43] How Do Large Language Models Learn Concepts During Continual Pre-Training?

Barry Menglong Yao,Sha Li,Yunzhi Yao,Minqian Liu,Zaishuo Xia,Qifan Wang,Lifu Huang

Main category: cs.CL

TL;DR: 本文研究了大语言模型在持续预训练过程中概念的获取、保留与遗忘机制，提出了概念电路（Concept Circuits）及其图度量分析方法，揭示了概念学习与遗忘的阶段性动态模式、语义相似概念间的干扰效应以及概念间的学习促进作用。

Details

Motivation: 理解人类通过概念认知世界的方式，探索大语言模型在持续学习中如何获取、保持和遗忘抽象概念，并揭示多概念之间的相互作用机制。 Method: 引入概念电路作为与特定概念相关的计算子图，结合图度量方法分析其结构特征，通过行为动态与内部电路关联分析概念的学习、遗忘、干扰与协同效应。 Result: 1) 概念电路能有效反映概念学习与遗忘；2) 存在先上升后下降并稳定的阶段式学习模式；3) 学习增益大的概念后续更易被遗忘；4) 语义相似概念干扰更强；5) 某些概念可显著促进其他概念的学习。 Conclusion: 该研究从电路层面揭示了大语言模型中概念学习的动态机制，为设计更可解释、鲁棒的概念感知训练策略提供了理论支持。 Abstract: Human beings primarily understand the world through concepts (e.g., dog), abstract mental representations that structure perception, reasoning, and learning. However, how large language models (LLMs) acquire, retain, and forget such concepts during continual pretraining remains poorly understood. In this work, we study how individual concepts are acquired and forgotten, as well as how multiple concepts interact through interference and synergy. We link these behavioral dynamics to LLMs' internal Concept Circuits, computational subgraphs associated with specific concepts, and incorporate Graph Metrics to characterize circuit structure. Our analysis reveals: (1) LLMs concept circuits provide a non-trivial, statistically significant signal of concept learning and forgetting; (2) Concept circuits exhibit a stage-wise temporal pattern during continual pretraining, with an early increase followed by gradual decrease and stabilization; (3) concepts with larger learning gains tend to exhibit greater forgetting under subsequent training; (4) semantically similar concepts induce stronger interference than weakly related ones; (5) conceptual knowledge differs in their transferability, with some significantly facilitating the learning of others. Together, our findings offer a circuit-level view of concept learning dynamics and inform the design of more interpretable and robust concept-aware training strategies for LLMs.

[44] PsychEthicsBench: Evaluating Large Language Models Against Australian Mental Health Ethics

Yaling Shen,Stephanie Fong,Yiwen Jiang,Zimu Wang,Feilong Tang,Qingyang Xu,Xiangyu Zhao,Zhongxing Xu,Jiahe Liu,Jinpeng Hu,Dominic Dwyer,Zongyuan Ge

Main category: cs.CL

TL;DR: 本文提出了PsychEthicsBench，首个基于澳大利亚心理学和精神病学指南的原则性基准，用于评估大语言模型在心理健康应用中的伦理知识和行为反应，强调拒绝率并非伦理行为的良好指标，并指出领域特定微调可能降低伦理鲁棒性。

Details

Motivation: 当前对大语言模型在心理健康中应用的安全性评估主要依赖以拒绝为中心的信号，无法反映临床实践中所需的细致行为，且不恰当的拒绝可能显得缺乏同理心，阻碍求助行为。因此需要更合适的评估框架。 Method: 构建了一个基于澳大利亚心理与精神科指南的多选题和开放式任务评估基准PsychEthicsBench，包含细粒度的伦理标注，并在14个模型上进行实证评估，分析拒绝率与伦理行为之间的关系以及微调对伦理对齐的影响。 Result: 实验结果显示拒绝率不能有效反映模型的伦理表现，安全触发与临床适当性之间存在显著差异；部分领域微调后的专业模型在伦理对齐方面表现不如其基础模型。 Conclusion: PsychEthicsBench为心理健康领域的大语言模型提供了系统化、司法管辖区感知的评估基础，推动该领域更负责任的发展。 Abstract: The increasing integration of large language models (LLMs) into mental health applications necessitates robust frameworks for evaluating professional safety alignment. Current evaluative approaches primarily rely on refusal-based safety signals, which offer limited insight into the nuanced behaviors required in clinical practice. In mental health, clinically inadequate refusals can be perceived as unempathetic and discourage help-seeking. To address this gap, we move beyond refusal-centric metrics and introduce \texttt{PsychEthicsBench}, the first principle-grounded benchmark based on Australian psychology and psychiatry guidelines, designed to evaluate LLMs' ethical knowledge and behavioral responses through multiple-choice and open-ended tasks with fine-grained ethicality annotations. Empirical results across 14 models reveal that refusal rates are poor indicators of ethical behavior, revealing a significant divergence between safety triggers and clinical appropriateness. Notably, we find that domain-specific fine-tuning can degrade ethical robustness, as several specialized models underperform their base backbones in ethical alignment. PsychEthicsBench provides a foundation for systematic, jurisdiction-aware evaluation of LLMs in mental health, encouraging more responsible development in this domain.

[45] OLA: Output Language Alignment in Code-Switched LLM Interactions

Juhyun Oh,Haneul Yoo,Faiz Ghifari Haznitrama,Alice Oh

Main category: cs.CL

TL;DR: 本文提出了OLA基准，用于评估大语言模型在韩英混用语境下的输出语言对齐能力，发现现有模型普遍存在非英语偏向和语言错配问题，且该问题可泛化至中文和印尼语。研究表明，这种失败主要源于对齐不足而非模型根本缺陷，通过少量数据的代码切换感知DPO可显著改善表现。

Details

Motivation: 多语言用户在对话中自然地进行语码转换，但当前大语言模型难以根据上下文推断用户期望的回复语言，导致输出语言不匹配的问题。为此，研究旨在评估并改进模型在真实语码转换交互中的输出语言对齐能力。 Method: 提出OLA基准，涵盖从句内混合到指令-内容不匹配等多种韩英语码转换场景；测试前沿大模型的表现，并分析其语言偏误、中途切换和语言侵入现象；进一步检验思维链提示的效果以及使用约1000个样本的代码切换感知DPO进行对齐微调的效果。 Result: 发现现有大模型常错误响应为非期望语言（尤其偏向非英语），且存在响应中途切换和语言侵入问题；思维链提示未能纠正此类错误，显示其语用推理薄弱；而基于少量数据的代码切换感知DPO能显著减少语言错配。 Conclusion: 当前大语言模型在处理语码转换时因对齐不足而无法准确识别用户的隐式语言期望，需专门针对真实多语言交互场景进行对齐优化，以提升实际应用中的用户体验。 Abstract: Code-switching, alternating between languages within a conversation, is natural for multilingual users, yet poses fundamental challenges for large language models (LLMs). When a user code-switches in their prompt to an LLM, they typically do not specify the expected language of the LLM response, and thus LLMs must infer the output language from contextual and pragmatic cues. We find that current LLMs systematically fail to align with this expectation, responding in undesired languages even when cues are clear to humans. We introduce OLA, a benchmark to evaluate LLMs' Output Language Alignment in code-switched interactions. OLA focuses on Korean--English code-switching and spans simple intra-sentential mixing to instruction-content mismatches. Even frontier models frequently misinterpret implicit language expectation, exhibiting a bias toward non-English responses. We further show this bias generalizes beyond Korean to Chinese and Indonesian pairs. Models also show instability through mid-response switching and language intrusions. Chain-of-Thought prompting fails to resolve these errors, indicating weak pragmatic reasoning about output language. However, Code-Switching Aware DPO with minimal data (about 1K examples) substantially reduces misalignment, suggesting these failures stem from insufficient alignment rather than fundamental limitations. Our results highlight the need to align multilingual LLMs with users' implicit expectations in real-world code-switched interactions.

[46] From Chains to Graphs: Self-Structured Reasoning for General-Domain LLMs

Yingjian Chen,Haoran Liu,Yinhong Liu,Sherry T. Tong,Aosong Feng,Jinghui Lu,Juntao Zhang,Yusuke Iwasawa,Yutaka Matsuo,Irene Li

Main category: cs.CL

TL;DR: 本文提出了一种名为Self-Graph Reasoning (SGR) 的新框架，使大语言模型能够在开放域问答中以图结构进行推理，提升推理一致性和性能。

Details

Motivation: 现有推理方法如思维链（CoT）为线性结构，容易导致逻辑不一致；缺乏让大模型自主构建和使用图结构进行推理的机制，尤其在开放域问答中。 Method: 提出Self-Graph Reasoning (SGR) 框架，使LLMs将推理过程显式表示为图结构，并构建了一个融合多个候选推理图的图结构推理数据集用于训练。 Result: 在五个通用和专业领域的QA基准上，SGR相比基础模型平均提升17.74%，微调后的LLaMA-3.3-70B模型性能媲美GPT-4o并超过Claude-3.5-Haiku。 Conclusion: 图结构推理能有效提升大语言模型的推理一致性和答案质量，SGR为大模型自主构建结构化推理提供了新路径。 Abstract: Large Language Models (LLMs) show strong reasoning ability in open-domain question answering, yet their reasoning processes are typically linear and often logically inconsistent. In contrast, real-world reasoning requires integrating multiple premises and solving subproblems in parallel. Existing methods, such as Chain-of-Thought (CoT), express reasoning in a linear textual form, which may appear coherent but frequently leads to inconsistent conclusions. Recent approaches rely on externally provided graphs and do not explore how LLMs can construct and use their own graph-structured reasoning, particularly in open-domain QA. To fill this gap, we novelly explore graph-structured reasoning of LLMs in general-domain question answering. We propose Self-Graph Reasoning (SGR), a framework that enables LLMs to explicitly represent their reasoning process as a structured graph before producing the final answer. We further construct a graph-structured reasoning dataset that merges multiple candidate reasoning graphs into refined graph structures for model training. Experiments on five QA benchmarks across both general and specialized domains show that SGR consistently improves reasoning consistency and yields a 17.74% gain over the base model. The LLaMA-3.3-70B model fine-tuned with SGR performs comparably to GPT-4o and surpasses Claude-3.5-Haiku, demonstrating the effectiveness of graph-structured reasoning.

[47] DiVA: Fine-grained Factuality Verification with Agentic-Discriminative Verifier

Hui Huang,Muyun Yang,Yuki Arase

Main category: cs.CL

TL;DR: 本文提出了一种新的细粒度事实性验证框架DiVA，结合生成模型的搜索能力和判别模型的评分能力，并构建了新基准FGVeriBench进行评估。

Details

Motivation: 现有事实性验证研究多为二元判断，无法区分错误严重程度，限制了其在细粒度评估和偏好优化等场景的应用。 Method: 提出Agentic Discriminative Verifier (DiVA)，融合生成模型的代理搜索能力与判别模型的精确打分能力，并构建新基准FGVeriBench用于细粒度事实性验证。 Result: 在FGVeriBench上的实验表明，DiVA在一般性和多跳问题的事实性验证上均显著优于现有方法。 Conclusion: DiVA通过结合生成与判别模型的优势，实现了更精细、准确的事实性验证，推动了LLM事实性评估向细粒度方向发展。 Abstract: Despite the significant advancements of Large Language Models (LLMs), their factuality remains a critical challenge, fueling growing interest in factuality verification. Existing research on factuality verification primarily conducts binary judgments (e.g., correct or incorrect), which fails to distinguish varying degrees of error severity. This limits its utility for applications such as fine-grained evaluation and preference optimization. To bridge this gap, we propose the Agentic Discriminative Verifier (DiVA), a hybrid framework that synergizes the agentic search capabilities of generative models with the precise scoring aptitude of discriminative models. We also construct a new benchmark, FGVeriBench, as a robust testbed for fine-grained factuality verification. Experimental results on FGVeriBench demonstrate that our DiVA significantly outperforms existing methods on factuality verification for both general and multi-hop questions.

[48] Analyzing Reasoning Shifts in Audio Deepfake Detection under Adversarial Attacks: The Reasoning Tax versus Shield Bifurcation

Binh Nguyen,Thai Le

Main category: cs.CL

TL;DR: 本文提出了一种新的取证审计框架，用于评估音频语言模型（ALM）在对抗攻击下的推理鲁棒性，从声学感知、认知连贯性和认知失调三个维度进行分析，发现显式推理并不总是增强鲁棒性，可能成为“盾牌”或“税”，并指出认知失调可作为潜在操纵的“静默警报”。

Details

Motivation: 现有音频深度伪造检测主要关注最终分类结果的鲁棒性，缺乏对模型推理过程稳定性的评估；本文旨在填补这一空白，探讨ALM在对抗攻击下推理路径的可靠性。 Method: 提出一个三维度的取证审计框架：声学感知、认知连贯性和认知失调，系统分析ALM在不同类型对抗攻击下的推理行为变化。 Result: 研究发现显式推理不总能提升鲁棒性：对声学感知强的模型，推理起保护作用（“盾牌”）；对其他模型则导致性能下降（“税”），尤其在语言攻击下认知连贯性降低，攻击成功率上升；即使分类失败，高认知失调仍可提示潜在操纵。 Conclusion: 推理在音频深伪检测中具有双重角色，其鲁棒性取决于模型对底层特征的感知能力，认知失调可作为辅助检测信号，为可解释性与安全性提供了新视角。 Abstract: Audio Language Models (ALMs) offer a promising shift towards explainable audio deepfake detections (ADDs), moving beyond \textit{black-box} classifiers by providing some level of transparency into their predictions via reasoning traces. This necessitates a new class of model robustness analysis: robustness of the predictive reasoning under adversarial attacks, which goes beyond existing paradigm that mainly focuses on the shifts of the final predictions (e.g., fake v.s. real). To analyze such reasoning shifts, we introduce a forensic auditing framework to evaluate the robustness of ALMs' reasoning under adversarial attacks in three inter-connected dimensions: acoustic perception, cognitive coherence, and cognitive dissonance. Our systematic analysis reveals that explicit reasoning does not universally enhance robustness. Instead, we observe a bifurcation: for models exhibiting robust acoustic perception, reasoning acts as a defensive \textit{``shield''}, protecting them from adversarial attacks. However, for others, it imposes a performance \textit{``tax''}, particularly under linguistic attacks which reduce cognitive coherence and increase attack success rate. Crucially, even when classification fails, high cognitive dissonance can serve as a \textit{silent alarm}, flagging potential manipulation. Overall, this work provides a critical evaluation of the role of reasoning in forensic audio deepfake analysis and its vulnerabilities.

[49] Evaluating the Pre-Consultation Ability of LLMs using Diagnostic Guidelines

Jean Seo,Gibaeg Kim,Kihun Shin,Seungseop Lim,Hyunkyung Lee,Wooseok Han,Jongwon Lee,Eunho Yang

Main category: cs.CL

TL;DR: EPAG是一个用于评估大语言模型在诊断指南下预咨询能力的基准数据集和框架，通过直接比较病史与指南以及间接疾病诊断来评估模型性能。

Details

Motivation: 为了更好地评估大语言模型在真实临床场景中预咨询阶段的应用能力，需要一个标准化的评估基准。 Method: 提出EPAG框架，结合HPI-诊断指南对比和疾病诊断任务，对LLM进行直接与间接评估，并构建专用数据集进行实验。 Result: 实验表明，经过任务特定数据集微调的小型开源模型可超越前沿大模型；更多HPI信息并不总能提升诊断表现；预咨询语言会影响对话特征。 Conclusion: EPAG为评估和改进大语言模型在临床预咨询中的应用提供了有效工具，推动该领域的发展。 Abstract: We introduce EPAG, a benchmark dataset and framework designed for Evaluating the Pre-consultation Ability of LLMs using diagnostic Guidelines. LLMs are evaluated directly through HPI-diagnostic guideline comparison and indirectly through disease diagnosis. In our experiments, we observe that small open-source models fine-tuned with a well-curated, task-specific dataset can outperform frontier LLMs in pre-consultation. Additionally, we find that increased amount of HPI (History of Present Illness) does not necessarily lead to improved diagnostic performance. Further experiments reveal that the language of pre-consultation influences the characteristics of the dialogue. By open-sourcing our dataset and evaluation pipeline on https://github.com/seemdog/EPAG, we aim to contribute to the evaluation and further development of LLM applications in real-world clinical settings.

[50] Reasoning Model Is Superior LLM-Judge, Yet Suffers from Biases

Hui Huang,Xuanxin Wu,Muyun Yang,Yuki Arase

Main category: cs.CL

TL;DR: 本文首次系统比较了大推理模型（LRMs）是否优于非推理大语言模型（LLMs）作为评判模型，发现LRMs在准确性、指令遵循和抗攻击性方面表现更优，但仍存在表面质量偏差；为此提出PlanJudge方法，通过生成显式评估计划显著减轻偏差。

Details

Motivation: 探究大推理模型（LRMs）在作为评判模型时是否真正优于非推理大语言模型，并系统分析其优势与局限性，尤其是在推理密集型任务和抗偏差能力方面。 Method: 通过实证分析比较LRMs与非推理LLMs在判断任务中的表现，并提出PlanJudge策略，即在执行评估前要求模型生成明确的评估计划，以减轻表面质量偏差。 Result: 1) LRMs在判断准确性和推理任务上优于非推理LLMs；2) 在指令遵循和抗对抗攻击方面表现更强；3) 但仍存在对表面质量的显著偏差；4) PlanJudge能有效缓解LRMs和标准LLMs的此类偏差。 Conclusion: 虽然LRMs在多项评判能力上优于传统LLMs，但其仍受表面偏差影响；引入结构化评估流程（如PlanJudge）可显著提升判断的鲁棒性，为未来构建更可靠的自动评估系统提供了有效路径。 Abstract: This paper presents the first systematic comparison investigating whether Large Reasoning Models (LRMs) are superior judge to non-reasoning LLMs. Our empirical analysis yields four key findings: 1) LRMs outperform non-reasoning LLMs in terms of judgment accuracy, particularly on reasoning-intensive tasks; 2) LRMs demonstrate superior instruction-following capabilities in evaluation contexts; 3) LRMs exhibit enhanced robustness against adversarial attacks targeting judgment tasks; 4) However, LRMs still exhibit strong biases in superficial quality. To improve the robustness against biases, we propose PlanJudge, an evaluation strategy that prompts the model to generate an explicit evaluation plan before execution. Despite its simplicity, our experiments demonstrate that PlanJudge significantly mitigates biases in both LRMs and standard LLMs.

[51] Agent-Dice: Disentangling Knowledge Updates via Geometric Consensus for Agent Continual Learning

Zheng Wu,Xingyu Lou,Xinbei Ma,Yansi Li,Weiwen Liu,Weinan Zhang,Jun Wang,Zhuosheng Zhang

Main category: cs.CL

TL;DR: 提出Agent-Dice，一种基于方向共识评估的参数融合框架，用于解决LLM代理在持续学习中的稳定性-可塑性困境。

Details

Motivation: 现有LLM代理在持续学习新任务时易发生灾难性遗忘，难以平衡知识的稳定性与可塑性，其根源在于未能区分跨任务的共性知识与任务间的冲突知识。 Method: 通过几何共识过滤剪枝冲突梯度，并结合曲率感知的重要性加权来增强共享语义，实现两阶段的知识解耦更新。 Result: 在GUI代理和工具使用代理任务上实验表明，Agent-Dice在极低计算开销和参数更新下实现了优异的持续学习性能。 Conclusion: Agent-Dice有效解决了任务间知识干扰问题，为LLM代理的持续学习提供了理论支持与高效实践方案。 Abstract: Large Language Model (LLM)-based agents significantly extend the utility of LLMs by interacting with dynamic environments. However, enabling agents to continually learn new tasks without catastrophic forgetting remains a critical challenge, known as the stability-plasticity dilemma. In this work, we argue that this dilemma fundamentally arises from the failure to explicitly distinguish between common knowledge shared across tasks and conflicting knowledge introduced by task-specific interference. To address this, we propose Agent-Dice, a parameter fusion framework based on directional consensus evaluation. Concretely, Agent-Dice disentangles knowledge updates through a two-stage process: geometric consensus filtering to prune conflicting gradients, and curvature-based importance weighting to amplify shared semantics. We provide a rigorous theoretical analysis that establishes the validity of the proposed fusion scheme and offers insight into the origins of the stability-plasticity dilemma. Extensive experiments on GUI agents and tool-use agent domains demonstrate that Agent-Dice exhibits outstanding continual learning performance with minimal computational overhead and parameter updates.

[52] LLM-MC-Affect: LLM-Based Monte Carlo Modeling of Affective Trajectories and Latent Ambiguity for Interpersonal Dynamic Insight

Yu-Zheng Lin,Bono Po-Jen Shih,John Paul Martin Encinas,Elizabeth Victoria Abraham Achom,Karan Himanshu Patel,Jesus Horacio Pacheco,Sicong Shao,Jyotikrishna Dass,Soheil Salehi,Pratik Satam

Main category: cs.CL

TL;DR: 本文提出了一种名为LLM-MC-Affect的概率框架，通过大语言模型与蒙特卡洛估计建模情感的连续分布，捕捉人际互动中的情绪耦合动态。

Details

Motivation: 传统情感分析多将情感视为确定性标签，难以反映交互中情感的主观性、模糊性和时序耦合特性，本文旨在建立更真实、可量化的动态情感建模范式。 Method: 提出LLM-MC-Affect框架，利用随机大语言模型解码和蒙特卡洛估计，构建情感空间中的连续概率分布，并通过序列互相关与斜率指标分析对话双方的情感耦合关系。 Result: 在师生教学对话案例中验证了方法有效性，能够量化情感轨迹并识别引导-跟随模式，揭示如有效支架教学等高层互动特征。 Conclusion: 该框架为理解人际互动动态提供了可扩展且通用的工具，适用于教育及更广泛的社会行为研究领域。 Abstract: Emotional coordination is a core property of human interaction that shapes how relational meaning is constructed in real time. While text-based affect inference has become increasingly feasible, prior approaches often treat sentiment as a deterministic point estimate for individual speakers, failing to capture the inherent subjectivity, latent ambiguity, and sequential coupling found in mutual exchanges. We introduce LLM-MC-Affect, a probabilistic framework that characterizes emotion not as a static label, but as a continuous latent probability distribution defined over an affective space. By leveraging stochastic LLM decoding and Monte Carlo estimation, the methodology approximates these distributions to derive high-fidelity sentiment trajectories that explicitly quantify both central affective tendencies and perceptual ambiguity. These trajectories enable a structured analysis of interpersonal coupling through sequential cross-correlation and slope-based indicators, identifying leading or lagging influences between interlocutors. To validate the interpretive capacity of this approach, we utilize teacher-student instructional dialogues as a representative case study, where our quantitative indicators successfully distill high-level interaction insights such as effective scaffolding. This work establishes a scalable and deployable pathway for understanding interpersonal dynamics, offering a generalizable solution that extends beyond education to broader social and behavioral research.

[53] ELO: Efficient Layer-Specific Optimization for Continual Pretraining of Multilingual LLMs

HanGyeol Yoo,ChangSu Choi,Minjun Kim,Seohyun Song,SeungWoo Song,Inho Won,Jongyoul Park,Cheoneum Park,KyungTae Lim

Main category: cs.CL

TL;DR: 提出了一种高效的层特定优化（ELO）方法，用于在多语言大模型中增强特定语言的持续预训练，显著提升训练速度并保持源语言性能。

Details

Motivation: 传统持续预训练方法在多语言大模型中存在计算成本高和源语言性能退化的问题，需要更高效的方法来优化特定语言的训练。 Method: ELO方法分为两个阶段：首先在持续预训练阶段仅训练关键的第一层和最后一层；然后通过层对齐将新训练的层重新集成，并进行简短的全模型微调以对齐参数。 Result: ELO方法相比现有方法最高实现6.46倍的训练加速，在目标语言性能上提升达6.2%，同时有效保持了英语等源语言的能力。 Conclusion: ELO是一种高效且有效的持续预训练策略，能够在显著降低资源消耗的同时提升多语言大模型对特定语言的支持能力。 Abstract: We propose an efficient layer-specific optimization (ELO) method designed to enhance continual pretraining (CP) for specific languages in multilingual large language models (MLLMs). This approach addresses the common challenges of high computational cost and degradation of source language performance associated with traditional CP. The ELO method consists of two main stages: (1) ELO Pretraining, where a small subset of specific layers, identified in our experiments as the critically important first and last layers, are detached from the original MLLM and trained with the target language. This significantly reduces not only the number of trainable parameters but also the total parameters computed during the forward pass, minimizing GPU memory consumption and accelerating the training process. (2) Layer Alignment, where the newly trained layers are reintegrated into the original model, followed by a brief full fine-tuning step on a small dataset to align the parameters. Experimental results demonstrate that the ELO method achieves a training speedup of up to 6.46 times compared to existing methods, while improving target language performance by up to 6.2\% on qualitative benchmarks and effectively preserving source language (English) capabilities.

[54] SyncThink: A Training-Free Strategy to Align Inference Termination with Reasoning Saturation

Gengyang Li,Wang Cai,Yifeng Gao,Yunfang Wu

Main category: cs.CL

TL;DR: SyncThink是一种无需训练的解码方法，通过监控模型自身的推理转换信号来减少Chain-of-Thought（CoT）提示中的冗余，降低推理成本，并在多个任务上实现更高的准确性和效率。

Details

Motivation: CoT提示虽然提升了推理能力，但常产生冗长且重复的推理路径，显著增加推理开销，因此需要一种高效、无需训练的方法来减少这种冗余。 Method: 基于发现答案token对早期推理关注较弱而集中于特殊标记“/think”的现象，SyncThink利用该信号监测模型的推理状态转移，并适时终止推理过程。 Result: 在GSM8K、MMLU、GPQA和BBH等多个基准及三种DeepSeek-R1蒸馏模型上的实验表明，SyncThink平均使用656个生成token和28.68秒延迟达到62.00%的Top-1准确率，优于完整CoT解码的61.22%、2141个token和92.01秒；在长程任务如GPQA上，准确率提升高达+8.1%。 Conclusion: SyncThink有效减少了CoT推理中的冗余，降低了推理成本，并能防止“过思考”，从而在提升效率的同时保持甚至提高模型性能。 Abstract: Chain-of-Thought (CoT) prompting improves reasoning but often produces long and redundant traces that substantially increase inference cost. We present SyncThink, a training-free and plug-and-play decoding method that reduces CoT overhead without modifying model weights. We find that answer tokens attend weakly to early reasoning and instead focus on the special token "/think", indicating an information bottleneck. Building on this observation, SyncThink monitors the model's own reasoning-transition signal and terminates reasoning. Experiments on GSM8K, MMLU, GPQA, and BBH across three DeepSeek-R1 distilled models show that SyncThink achieves 62.00 percent average Top-1 accuracy using 656 generated tokens and 28.68 s latency, compared to 61.22 percent, 2141 tokens, and 92.01 s for full CoT decoding. On long-horizon tasks such as GPQA, SyncThink can further yield up to +8.1 absolute accuracy by preventing over-thinking.

Haonan Chen,Sicheng Gao,Radu Timofte,Tetsuya Sakai,Zhicheng Dou

Main category: cs.CL

TL;DR: 本文提出了e5-omni，一种轻量级的显式对齐方法，用于改进多模态嵌入模型，解决现有方法中相似性尺度不一致、负样本无效和跨模态几何不匹配等问题。

Details

Motivation: 现有的多模态嵌入模型依赖于预训练视觉语言模型的隐式对齐，导致相似性得分尺度不一致、负样本效率下降和嵌入分布不匹配等问题。 Method: 提出e5-omni，包含三个组件：模态感知的温度校准、可控的去偏负样本课程学习、以及带协方差正则化的批归一化白化。 Result: 在MMEB-V2和AudioCaps数据集上实验显示，e5-omni在强基线基础上取得持续提升，并能迁移到其他VLM骨干网络。 Conclusion: e5-omni通过显式对齐策略有效解决了多模态嵌入中的关键问题，提升了跨模态检索的稳定性和性能。 Abstract: Modern information systems often involve different types of items, e.g., a text query, an image, a video clip, or an audio segment. This motivates omni-modal embedding models that map heterogeneous modalities into a shared space for direct comparison. However, most recent omni-modal embeddings still rely heavily on implicit alignment inherited from pretrained vision-language model (VLM) backbones. In practice, this causes three common issues: (i) similarity logits have modality-dependent sharpness, so scores are not on a consistent scale; (ii) in-batch negatives become less effective over time because mixed-modality batches create an imbalanced hardness distribution; as a result, many negatives quickly become trivial and contribute little gradient; and (iii) embeddings across modalities show mismatched first- and second-order statistics, which makes rankings less stable. To tackle these problems, we propose e5-omni, a lightweight explicit alignment recipe that adapts off-the-shelf VLMs into robust omni-modal embedding models. e5-omni combines three simple components: (1) modality-aware temperature calibration to align similarity scales, (2) a controllable negative curriculum with debiasing to focus on confusing negatives while reducing the impact of false negatives, and (3) batch whitening with covariance regularization to better match cross-modal geometry in the shared embedding space. Experiments on MMEB-V2 and AudioCaps show consistent gains over strong bi-modal and omni-modal baselines, and the same recipe also transfers well to other VLM backbones. We release our model checkpoint at https://huggingface.co/Haon-Chen/e5-omni-7B.

[56] eTracer: Towards Traceable Text Generation via Claim-Level Grounding

Bohao Chu,Qianli Wang,Hendrik Damm,Hui Wang,Ula Muhabbek,Elisabeth Livingstone,Christoph M. Friedrich,Norbert Fuhr

Main category: cs.CL

TL;DR: eTracer是一个即插即用的框架，通过基于上下文证据的事后 grounding 实现可追溯的文本生成，提升系统生成内容在生物医学领域的可验证性和可信度。

Details

Motivation: 在高风险的生物医学领域，如何高效验证系统生成的响应是一个关键挑战，传统 grounding 方法在句子级别上存在对齐不足的问题。 Method: 提出eTracer框架，采用主张级别的事后 grounding 方法，将每个生成的声明与支持或反驳它的上下文证据对齐，并量化响应的忠实度。 Result: 实验表明，该方法显著提升了整体 grounding 质量和用户验证效率，克服了传统方法的局限性。 Conclusion: eTracer通过细粒度的主张级 grounding 提高了生成文本的可追溯性、忠实度和可信度，为高风险领域的可靠文本生成提供了有效解决方案。 Abstract: How can system-generated responses be efficiently verified, especially in the high-stakes biomedical domain? To address this challenge, we introduce eTracer, a plug-and-play framework that enables traceable text generation by grounding claims against contextual evidence. Through post-hoc grounding, each response claim is aligned with contextual evidence that either supports or contradicts it. Building on claim-level grounding results, eTracer not only enables users to precisely trace responses back to their contextual source but also quantifies response faithfulness, thereby enabling the verifiability and trustworthiness of generated responses. Experiments show that our claim-level grounding approach alleviates the limitations of conventional grounding methods in aligning generated statements with contextual sentence-level evidence, resulting in substantial improvements in overall grounding quality and user verification efficiency. The code and data are available at https://github.com/chubohao/eTracer.

[57] DisastQA: A Comprehensive Benchmark for Evaluating Question Answering in Disaster Management

Zhitong Chen,Kai Yin,Xiangjue Dong,Chengkai Liu,Xiangpeng Li,Yiming Xiao,Bo Li,Junwei Ma,Ali Mostafavi,James Caverlee

Main category: cs.CL

TL;DR: DisastQA是一个大规模灾害管理问答基准，包含3000个经过严格验证的问题，涵盖八种灾害类型，支持多种证据条件下的模型评估，并揭示了现有模型在噪声环境下的性能下降问题。

Details

Motivation: 现有问答基准多基于干净证据，无法反映灾害管理中信息不确定和冲突的现实情况，因此需要构建更贴近实际的评估基准。 Method: 通过人类与大语言模型协作的流程构建DisastQA，采用分层抽样确保覆盖均衡；设计多种证据条件（从闭卷到含噪声信息）进行模型评估；提出基于关键点的人工验证评估协议用于开放性问题。 Result: 实验评估了20个模型，在清洁环境下近期开源模型接近闭源系统表现，但在噪声环境下性能显著下降；整体表现与通用榜单（如MMLU-Pro）存在明显差异。 Conclusion: DisastQA能更真实地评估模型在灾害管理中的推理能力，暴露了当前模型在处理不确定和冲突信息时的可靠性缺陷，凸显了提升鲁棒性的必要性。 Abstract: Accurate question answering (QA) in disaster management requires reasoning over uncertain and conflicting information, a setting poorly captured by existing benchmarks built on clean evidence. We introduce DisastQA, a large-scale benchmark of 3,000 rigorously verified questions (2,000 multiple-choice and 1,000 open-ended) spanning eight disaster types. The benchmark is constructed via a human-LLM collaboration pipeline with stratified sampling to ensure balanced coverage. Models are evaluated under varying evidence conditions, from closed-book to noisy evidence integration, enabling separation of internal knowledge from reasoning under imperfect information. For open-ended QA, we propose a human-verified keypoint-based evaluation protocol emphasizing factual completeness over verbosity. Experiments with 20 models reveal substantial divergences from general-purpose leaderboards such as MMLU-Pro. While recent open-weight models approach proprietary systems in clean settings, performance degrades sharply under realistic noise, exposing critical reliability gaps for disaster response. All code, data, and evaluation resources are available at https://github.com/TamuChen18/DisastQA_open.

[58] NeuronScope: A Multi-Agent Framework for Explaining Polysemantic Neurons in Language Models

Weiqi Liu,Yongliang Miao,Haiyan Zhao,Yanguang Liu,Mengnan Du

Main category: cs.CL

TL;DR: 提出NeuronScope，一种多智能体框架，通过迭代、激活引导的方式解析大语言模型中神经元的多重语义性。

Details

Motivation: 现有单次解释方法难以准确捕捉神经元的多语义（polysemanticity）现象，导致对神经元行为理解不充分。 Method: 将神经元解释重构为迭代的、激活引导的过程，使用多智能体框架分解神经元激活为原子语义成分，聚类为不同语义模式，并利用激活反馈迭代优化解释。 Result: 实验表明，NeuronScope能揭示隐藏的多语义性，生成的解释与神经元激活的相关性显著高于单次基线方法。 Conclusion: NeuronScope有效提升了对大语言模型中多语义神经元的理解能力，为细粒度可解释性研究提供了新路径。 Abstract: Neuron-level interpretation in large language models (LLMs) is fundamentally challenged by widespread polysemanticity, where individual neurons respond to multiple distinct semantic concepts. Existing single-pass interpretation methods struggle to faithfully capture such multi-concept behavior. In this work, we propose NeuronScope, a multi-agent framework that reformulates neuron interpretation as an iterative, activation-guided process. NeuronScope explicitly deconstructs neuron activations into atomic semantic components, clusters them into distinct semantic modes, and iteratively refines each explanation using neuron activation feedback. Experiments demonstrate that NeuronScope uncovers hidden polysemanticity and produces explanations with significantly higher activation correlation compared to single-pass baselines.

[59] Towards Compositional Generalization of LLMs via Skill Taxonomy Guided Data Synthesis

Yifan Wei,Li Du,Xiaoyan Yu,Yang Feng,Angsheng Li

Main category: cs.CL

TL;DR: 提出STEPS框架，通过技能分类法和熵基后训练数据合成，提升大模型在组合泛化任务中的表现。

Details

Motivation: 解决大语言模型和基于代理系统在组合泛化上的局限，尤其是由于复杂技能组合数据稀疏（长尾分布）导致的性能瓶颈。 Method: 利用结构信息理论构建可解释的层次化技能分类法，并将数据合成建模为约束下的信息最大化问题，选择能最大化结构信息且保持语义连贯的技能组合。 Result: 在具有挑战性的指令跟随基准测试中，STEPS优于现有数据合成方法，并在下游基于代理的任务中展现出更强的组合泛化能力。 Conclusion: STEPS通过有组织地合成高信息量的技能组合数据，有效提升了模型的组合泛化性能，验证了结构化数据合成的重要性。 Abstract: Large Language Models (LLMs) and agent-based systems often struggle with compositional generalization due to a data bottleneck in which complex skill combinations follow a long-tailed, power-law distribution, limiting both instruction-following performance and generalization in agent-centric tasks. To address this challenge, we propose STEPS, a Skill Taxonomy guided Entropy-based Post-training data Synthesis framework for generating compositionally challenging data. STEPS explicitly targets compositional generalization by uncovering latent relationships among skills and organizing them into an interpretable, hierarchical skill taxonomy using structural information theory. Building on this taxonomy, we formulate data synthesis as a constrained information maximization problem, selecting skill combinations that maximize marginal structural information within the hierarchy while preserving semantic coherence. Experiments on challenging instruction-following benchmarks show that STEPS outperforms existing data synthesis baselines, while also yielding improved compositional generalization in downstream agent-based evaluations.

[60] From Implicit to Explicit: Token-Efficient Logical Supervision for Mathematical Reasoning in LLMs

Shaojie Wang,Liang Zhang

Main category: cs.CL

TL;DR: 提出了一种名为FSLR的轻量级训练框架，专注于提升大语言模型在数学问题求解中的逻辑关系理解能力，相比CoT-SFT在性能和效率上均有显著提升。

Details

Motivation: 大语言模型在数学推理中主要依赖模式匹配和记忆，缺乏真正的逻辑推理能力，尤其是对逻辑关系的理解不足，导致错误率高。 Method: 提出First-Step Logical Reasoning（FSLR），通过仅训练模型识别解题的第一步（变量选择与操作类型），从而显式监督逻辑关系的理解。 Result: FSLR在多个模型和数据集上均优于CoT-SFT，分布内和分布外平均提升3.2%和4.6%，训练速度快4-6倍，训练token减少80%以上。 Conclusion: FSLR有效提升了大语言模型对逻辑关系的理解能力，解决了现有方法在逻辑推理上的关键瓶颈，同时具备高效训练的优势。 Abstract: Recent studies reveal that large language models (LLMs) exhibit limited logical reasoning abilities in mathematical problem-solving, instead often relying on pattern-matching and memorization. We systematically analyze this limitation, focusing on logical relationship understanding, which is a core capability underlying genuine logical reasoning, and reveal that errors related to this capability account for over 90\% of incorrect predictions, with Chain-of-Thought Supervised Fine-Tuning (CoT-SFT) failing to substantially reduce these errors. To address this bottleneck, we propose First-Step Logical Reasoning (FSLR), a lightweight training framework targeting logical relationship understanding. Our key insight is that the first planning step-identifying which variables to use and which operation to apply-encourages the model to derive logical relationships directly from the problem statement. By training models on this isolated step, FSLR provides explicit supervision for logical relationship understanding, unlike CoT-SFT which implicitly embeds such relationships within complete solution trajectories. Extensive experiments across multiple models and datasets demonstrate that FSLR consistently outperforms CoT-SFT under both in-distribution and out-of-distribution settings, with average improvements of 3.2\% and 4.6\%, respectively. Moreover, FSLR achieves 4-6x faster training and reduces training token consumption by over 80\%.

[61] Evaluation Framework for AI Creativity: A Case Study Based on Story Generation

Pharath Sathya,Yin Jou Huang,Fei Cheng

Main category: cs.CL

TL;DR: 提出了一种结构化的AI故事生成评估框架，包含新颖性、价值、一致性和共鸣四个维度，并通过“尖峰提示”和115名读者的众包研究验证了该框架的有效性，揭示了创造力判断的层次性和反思性影响。

Details

Motivation: 现有基于参考的指标无法捕捉创造力的主观性，导致对创意文本生成的评估存在不足。 Method: 设计了一个包含四个主成分和十一个子成分的结构化评估框架，结合控制生成（Spike Prompting）和众包人类评分实验，分析不同创造性因素对即时与反思性人类判断的影响。 Result: 发现创造力评估是层次化的而非累积的，不同维度在不同判断阶段起主导作用；反思性评估显著改变评分结果和评分者间的一致性。 Conclusion: 所提出的框架能有效揭示传统参考指标所掩盖的创造力维度，支持更全面的人类导向评估方法。 Abstract: Evaluating creative text generation remains a challenge because existing reference-based metrics fail to capture the subjective nature of creativity. We propose a structured evaluation framework for AI story generation comprising four components (Novelty, Value, Adherence, and Resonance) and eleven sub-components. Using controlled story generation via ``Spike Prompting'' and a crowdsourced study of 115 readers, we examine how different creative components shape both immediate and reflective human creativity judgments. Our findings show that creativity is evaluated hierarchically rather than cumulatively, with different dimensions becoming salient at different stages of judgment, and that reflective evaluation substantially alters both ratings and inter-rater agreement. Together, these results support the effectiveness of our framework in revealing dimensions of creativity that are obscured by reference-based evaluation.

[62] RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models

Quy-Anh Dang,Chris Ngo,Truong-Son Hy

Main category: cs.CL

TL;DR: 本文提出了RedBench，一个统一的对抗性提示评估数据集，整合了37个基准数据集，包含29,362个样本，涵盖22个风险类别和19个领域，旨在提升大语言模型在安全关键应用中的鲁棒性评估。

Details

Motivation: 现有红队测试数据集存在风险分类不一致、领域覆盖有限和评估过时等问题，难以系统评估大语言模型的漏洞。 Method: 构建了一个通用数据集RedBench，整合来自顶级会议和仓库的37个基准数据集，并采用标准化的22类风险分类和19个领域划分，提供统一评估框架。 Result: RedBench包含29,362个攻击和拒绝提示样本，提供了对现代大语言模型的基线评估结果，并开源了数据集与评测代码。 Conclusion: RedBench支持一致且全面的LLM漏洞评估，促进安全可靠的大语言模型研究与实际部署。 Abstract: As large language models (LLMs) become integral to safety-critical applications, ensuring their robustness against adversarial prompts is paramount. However, existing red teaming datasets suffer from inconsistent risk categorizations, limited domain coverage, and outdated evaluations, hindering systematic vulnerability assessments. To address these challenges, we introduce RedBench, a universal dataset aggregating 37 benchmark datasets from leading conferences and repositories, comprising 29,362 samples across attack and refusal prompts. RedBench employs a standardized taxonomy with 22 risk categories and 19 domains, enabling consistent and comprehensive evaluations of LLM vulnerabilities. We provide a detailed analysis of existing datasets, establish baselines for modern LLMs, and open-source the dataset and evaluation code. Our contributions facilitate robust comparisons, foster future research, and promote the development of secure and reliable LLMs for real-world deployment. Code: https://github.com/knoveleng/redeval

[63] ADEPT: Adaptive Dynamic Early-Exit Process for Transformers

Sangmin Yoo,Srikanth Malla,Chiho Choi,Wei D. Lu,Joon Hee Choi

Main category: cs.CL

TL;DR: 本文提出了ADEPT，一种用于Transformer的自适应动态早退机制，能够在预填充和生成阶段实现动态早退，通过解耦跳过层中的序列依赖关系来优化KV缓存生成，从而在语言生成任务中提高25%的效率，并在下游分类任务中实现4倍加速，性能提升高达45%。

Details

Motivation: 现有的早退策略仅适用于生成阶段的第一个令牌或预填充阶段的提示级别，导致跳过的层的KV缓存成为后续令牌生成的瓶颈，限制了早退的优势。 Method: 提出了一种名为ADEPT（Adaptive Dynamic Early-exit Process for Transformers）的新方法，该方法基于令牌复杂度动态调整计算，设计了自适应的令牌级别早退机制，并通过解耦跳过层中的序列依赖关系来增强KV生成过程。 Result: 实验结果表明，ADEPT在语言生成任务中效率提高了最多25%，在下游分类任务中实现了4倍的速度提升，性能改善最高达45%。 Conclusion: ADEPT有效克服了现有早退策略的局限性，实现了更高效的动态早退，显著提升了模型推理效率和性能。 Abstract: The inference of large language models imposes significant computational workloads, often requiring the processing of billions of parameters. Although early-exit strategies have proven effective in reducing computational demands by halting inference earlier, they apply either to only the first token in the generation phase or at the prompt level in the prefill phase. Thus, the Key-Value (KV) cache for skipped layers remains a bottleneck for subsequent token generation, limiting the benefits of early exit. We introduce ADEPT (Adaptive Dynamic Early-exit Process for Transformers), a novel approach designed to overcome this issue and enable dynamic early exit in both the prefill and generation phases. The proposed adaptive token-level early-exit mechanism adjusts computation dynamically based on token complexity, optimizing efficiency without compromising performance. ADEPT further enhances KV generation procedure by decoupling sequential dependencies in skipped layers, making token-level early exit more practical. Experimental results demonstrate that ADEPT improves efficiency by up to 25% in language generation tasks and achieves a 4x speed-up in downstream classification tasks, with up to a 45% improvement in performance.

Hengxing Cai,Yijie Rao,Ligang Huang,Zanyang Zhong,Jinhan Dong,Jingjun Tan,Wenhao Lu,Renxin Zhong

Main category: cs.CL

TL;DR: 提出AirNav，一个基于真实城市空中数据的大规模无人机视觉-语言导航基准，以及AirVLN-R1模型，结合监督和强化微调提升性能。

Details

Motivation: 现有无人机视觉-语言导航数据集依赖虚拟环境、指令缺乏自然性且规模有限。 Method: 构建真实环境下的大规模AirNav数据集，并提出结合监督微调与强化微调的AirVLN-R1模型。 Result: AirVLN-R1在新基准上表现良好，初步真实世界测试验证了可行性。 Conclusion: AirNav和AirVLN-R1有效解决了现有数据集的局限性，推动了真实场景下无人机视觉-语言导航的发展。 Abstract: Existing Unmanned Aerial Vehicle (UAV) Vision-Language Navigation (VLN) datasets face issues such as dependence on virtual environments, lack of naturalness in instructions, and limited scale. To address these challenges, we propose AirNav, a large-scale UAV VLN benchmark constructed from real urban aerial data, rather than synthetic environments, with natural and diverse instructions. Additionally, we introduce the AirVLN-R1, which combines Supervised Fine-Tuning and Reinforcement Fine-Tuning to enhance performance and generalization. The feasibility of the model is preliminarily evaluated through real-world tests. Our dataset and code are publicly available.

[65] Visual Merit or Linguistic Crutch? A Close Look at DeepSeek-OCR

Yunhao Liang,Ruixuan Ying,Bo Li,Hong Li,Kai Yan,Qingwen Li,Min Yang,Okamoto Satoshi,Zhe Cui,Shiwen Ni

Main category: cs.CL

TL;DR: 本研究探讨了DeepSeek-OCR在缺乏语言先验支持时的真实OCR能力，发现其性能从90%骤降至20%，揭示该模型严重依赖语言先验，且在长上下文和低视觉令牌情况下易产生幻觉，最终在约10,000文本令牌时崩溃。

Details

Motivation: 探究DeepSeek-OCR的高性能是源于视觉能力还是语言先验驱动，以厘清其在长上下文场景中的实际能力边界与潜在风险。 Method: 通过句子级和词级语义破坏方法，分离模型的内在OCR能力与语言先验影响，并在13个基线模型上进行对比评估，同时测试不同视觉令牌数量和上下文长度下的表现。 Result: DeepSeek-OCR在无语言支持下性能从约90%下降至20%；传统两阶段OCR方法对语义扰动更具鲁棒性；视觉令牌越少，对先验依赖越强，幻觉风险越高；在约10,000文本令牌时出现模型崩溃。 Conclusion: DeepSeek-OCR的性能高度依赖语言先验而非真正光学识别能力，当前光学压缩方法可能加剧长上下文瓶颈，需在未来设计中平衡视觉与语言模块以提升鲁棒性。 Abstract: DeepSeek-OCR utilizes an optical 2D mapping approach to achieve high-ratio vision-text compression, claiming to decode text tokens exceeding ten times the input visual tokens. While this suggests a promising solution for the LLM long-context bottleneck, we investigate a critical question: "Visual merit or linguistic crutch - which drives DeepSeek-OCR's performance?" By employing sentence-level and word-level semantic corruption, we isolate the model's intrinsic OCR capabilities from its language priors. Results demonstrate that without linguistic support, DeepSeek-OCR's performance plummets from approximately 90% to 20%. Comparative benchmarking against 13 baseline models reveals that traditional pipeline OCR methods exhibit significantly higher robustness to such semantic perturbations than end-to-end methods. Furthermore, we find that lower visual token counts correlate with increased reliance on priors, exacerbating hallucination risks. Context stress testing also reveals a total model collapse around 10,000 text tokens, suggesting that current optical compression techniques may paradoxically aggravate the long-context bottleneck. This study empirically defines DeepSeek-OCR's capability boundaries and offers essential insights for future optimizations of the vision-text compression paradigm. We release all data, results and scripts used in this study at https://github.com/dududuck00/DeepSeekOCR.

[66] MIND: From Passive Mimicry to Active Reasoning through Capability-Aware Multi-Perspective CoT Distillation

Jin Cui,Jiaqi Guo,Jiepeng Zhou,Ruixuan Yang,Jiayi Lu,Jiajun Xu,Jiangcheng Song,Boran Zhao,Pengju Ren

Main category: cs.CL

TL;DR: 提出MIND框架，通过能力自适应的反馈驱动机制，将知识蒸馏从被动模仿转变为主动认知构建，提升小模型在领域内和跨领域的推理能力。

Details

Motivation: 现有知识蒸馏方法让小模型仅模仿单一最优推理路径，忽略了学生模型的能力变化和偏好，导致推理分布退化和性能下降。 Method: 设计一个‘教学助手’网络合成多教师视角，并通过反馈驱动的惯性校准机制，利用过滤后的训练损失动态调整监督信号，匹配学生模型当前适应能力。 Result: 在多个分布内和分布外基准上达到最先进性能，潜空间分析验证了推理能力内化的机制。 Conclusion: MIND有效缓解了传统蒸馏中的分布偏移与灾难性遗忘问题，实现了更高效、泛化的推理能力迁移。 Abstract: While Large Language Models (LLMs) have emerged with remarkable capabilities in complex tasks through Chain-of-Thought reasoning, practical resource constraints have sparked interest in transferring these abilities to smaller models. However, achieving both domain performance and cross-domain generalization remains challenging. Existing approaches typically restrict students to following a single golden rationale and treat different reasoning paths independently. Due to distinct inductive biases and intrinsic preferences, alongside the student's evolving capacity and reasoning preferences during training, a teacher's "optimal" rationale could act as out-of-distribution noise. This misalignment leads to a degeneration of the student's latent reasoning distribution, causing suboptimal performance. To bridge this gap, we propose MIND, a capability-adaptive framework that transitions distillation from passive mimicry to active cognitive construction. We synthesize diverse teacher perspectives through a novel "Teaching Assistant" network. By employing a Feedback-Driven Inertia Calibration mechanism, this network utilizes inertia-filtered training loss to align supervision with the student's current adaptability, effectively enhancing performance while mitigating catastrophic forgetting. Extensive experiments demonstrate that MIND achieves state-of-the-art performance on both in-distribution and out-of-distribution benchmarks, and our sophisticated latent space analysis further confirms the mechanism of reasoning ability internalization.

[67] Stuttering-Aware Automatic Speech Recognition for Indonesian Language

Fadhil Muhammad,Alwin Djuliansah,Adrian Aryaputra Hamzah,Kurniawati Azizah

Main category: cs.CL

TL;DR: 提出了一种基于数据增强的框架，通过规则变换和大语言模型生成合成的结巴语音数据，用于增强低资源语言（如印尼语）的语音识别系统对非流利语音的鲁棒性。

Details

Motivation: 现有的自动语音识别系统在处理结巴语音时性能显著下降，尤其是在缺乏专门数据集的低资源语言（如印尼语）中，亟需提升对非流利语音的识别能力。 Method: 通过规则变换和大语言模型对流利文本注入重复和拖音，再结合文本到语音合成技术生成合成的结巴语音；利用该合成数据通过迁移学习微调预训练的印尼语Whisper模型。 Result: 实验表明，使用合成数据微调后的模型在结巴语音上的识别错误率持续降低，同时保持了对流利语音的良好识别性能。 Conclusion: 合成数据增强方法能有效提升低资源语言语音识别系统对非流利语音的适应能力，为构建更具包容性的语音技术提供了可行路径。 Abstract: Automatic speech recognition systems have achieved remarkable performance on fluent speech but continue to degrade significantly when processing stuttered speech, a limitation that is particularly acute for low-resource languages like Indonesian where specialized datasets are virtually non-existent. To overcome this scarcity, we propose a data augmentation framework that generates synthetic stuttered audio by injecting repetitions and prolongations into fluent text through a combination of rule-based transformations and large language models followed by text-to-speech synthesis. We apply this synthetic data to fine-tune a pre-trained Indonesian Whisper model using transfer learning, enabling the architecture to adapt to dysfluent acoustic patterns without requiring large-scale real-world recordings. Our experiments demonstrate that this targeted synthetic exposure consistently reduces recognition errors on stuttered speech while maintaining performance on fluent segments, validating the utility of synthetic data pipelines for developing more inclusive speech technologies in under-represented languages.

[68] O-Researcher: An Open Ended Deep Research Model via Multi-Agent Distillation and Agentic RL

Yi Yao,He Zhu,Piaohong Wang,Jincheng Ren,Xinlong Yang,Qianben Chen,Xiaowan Li,Dingfeng Shi,Jiaxian Li,Qiexiang Wang,Sinuo Wang,Xinpeng Liu,Jiaqi Wu,Minghao Liu,Wangchunshu Zhou

Main category: cs.CL

TL;DR: 提出了一种多智能体框架来自动生成高质量的指令数据，并结合两阶段训练策略，显著提升开源大模型在深度研究基准上的表现。

Details

Motivation: 缩小开源与闭源大语言模型之间的性能差距，主要由于获取高质量训练数据的不平等。 Method: 设计一个多智能体协作的工作流来合成复杂、研究级别的指令数据，并采用结合监督微调和新型强化学习的两阶段训练策略。 Result: 在多个模型规模上验证了该框架的有效性，使开源模型在主要深度研究基准上达到新的最先进性能。 Conclusion: 该方法为不依赖专有数据或模型的开源大语言模型发展提供了可扩展且有效的路径。 Abstract: The performance gap between closed-source and open-source large language models (LLMs) is largely attributed to disparities in access to high-quality training data. To bridge this gap, we introduce a novel framework for the automated synthesis of sophisticated, research-grade instructional data. Our approach centers on a multi-agent workflow where collaborative AI agents simulate complex tool-integrated reasoning to generate diverse and high-fidelity data end-to-end. Leveraging this synthesized data, we develop a two-stage training strategy that integrates supervised fine-tuning with a novel reinforcement learning method, designed to maximize model alignment and capability. Extensive experiments demonstrate that our framework empowers open-source models across multiple scales, enabling them to achieve new state-of-the-art performance on the major deep research benchmark. This work provides a scalable and effective pathway for advancing open-source LLMs without relying on proprietary data or models.

[69] Whose Facts Win? LLM Source Preferences under Knowledge Conflicts

Jakob Schuster,Vagrant Gautam,Katja Markert

Main category: cs.CL

TL;DR: 研究了大语言模型在知识冲突中对信息来源的偏好，发现其倾向于机构验证的信息，但重复低可信度信息可逆转该偏好，并提出一种新方法显著减少重复偏差。

Details

Motivation: 探讨大语言模型在检索增强生成中的知识冲突行为，特别是信息来源对其判断的影响，填补来源可信度研究的空白。 Method: 提出一个新框架，通过控制实验评估13个开源大语言模型，分析其在不同来源（如政府、报纸、社交媒体等）信息冲突下的选择倾向，并设计方法减轻重复带来的偏差。 Result: 发现大语言模型更偏好机构支持的信息，但简单重复低可信度信息可逆转这种偏好；所提方法能将重复偏差减少99.8%，同时保留至少88.8%的原始来源偏好。 Conclusion: 信息来源和重复效应显著影响大语言模型的知识选择行为，需在系统设计中考虑来源可信度与重复偏见的缓解机制。 Abstract: As large language models (LLMs) are more frequently used in retrieval-augmented generation pipelines, it is increasingly relevant to study their behavior under knowledge conflicts. Thus far, the role of the source of the retrieved information has gone unexamined. We address this gap with a novel framework to investigate how source preferences affect LLM resolution of inter-context knowledge conflicts in English, motivated by interdisciplinary research on credibility. With a comprehensive, tightly-controlled evaluation of 13 open-weight LLMs, we find that LLMs prefer institutionally-corroborated information (e.g., government or newspaper sources) over information from people and social media. However, these source preferences can be reversed by simply repeating information from less credible sources. To mitigate repetition effects and maintain consistent preferences, we propose a novel method that reduces repetition bias by up to 99.8%, while also maintaining at least 88.8% of original preferences. We release all data and code to encourage future work on credibility and source preferences in knowledge-intensive NLP.

Dominik Macko

Main category: cs.CL

TL;DR: 本研究探讨了多语言大模型在生成个性化虚假信息时的可检测性，覆盖10种语言和16个模型，发现针对社交媒体平台的个性化显著降低文本可检测性，尤其在英语中最为明显。

Details

Motivation: 担忧大语言模型被滥用于生成多语言个性化虚假信息，且已有研究仅关注英文，缺乏跨语言分析。 Method: 在10种语言中系统评估1080种提示个性化设置，使用16个不同语言模型生成共17,280篇文本，分析其个性化质量和可检测性差异。 Result: 不同语言中个性化生成质量存在差异；针对社交媒体平台的个性化比针对人群的个性化更显著降低机器生成文本的可检测性，尤其在英语中效果最强。 Conclusion: 个性化能力在多语言环境下既带来滥用风险也具潜在价值，需特别关注其对检测机制的影响，尤其是在高生成质量的语言（如英语）中。 Abstract: Capabilities of large language models to generate multilingual coherent text have continuously enhanced in recent years, which opens concerns about their potential misuse. Previous research has shown that they can be misused for generation of personalized disinformation in multiple languages. It has also been observed that personalization negatively affects detectability of machine-generated texts; however, this has been studied in the English language only. In this work, we examine this phenomenon across 10 languages, while we focus not only on potential misuse of personalization capabilities, but also on potential benefits they offer. Overall, we cover 1080 combinations of various personalization aspects in the prompts, for which the texts are generated by 16 distinct language models (17,280 texts in total). Our results indicate that there are differences in personalization quality of the generated texts when targeting demographic groups and when targeting social-media platforms across languages. Personalization towards platforms affects detectability of the generated texts in a higher scale, especially in English, where the personalization quality is the highest.

[71] Do LLM Self-Explanations Help Users Predict Model Behavior? Evaluating Counterfactual Simulatability with Pragmatic Perturbations

Pingjun Hong,Benjamin Roth

Main category: cs.CL

TL;DR: 该研究探讨了大语言模型（LLM）生成的自我解释是否有助于人类和模型判断者预测模型行为，结果表明，尽管这些解释可能不完全反映真实决策过程，但它们确实提升了对反事实问题回答的可模拟性。

Details

Motivation: 尽管已有研究表明LLM的推理链可能不准确反映其决策过程，但尚不清楚这些解释是否仍有助于用户预测模型行为。本研究旨在评估自解释在提升模型可解释性和可预测性方面的实际价值。 Method: 使用StrategyQA数据集，通过比较有无链式思维或事后解释的情况下，人类与LLM判断者对模型在反事实后续问题上回答的预测准确性，并采用LLM生成与基于语用扰动的方法构造测试用例。 Result: 自解释显著提高了人类和LLM判断者的预测准确性，但提升程度和稳定性受扰动策略和判断者能力影响；定性分析显示，解释帮助人类形成更准确的预测。 Conclusion: LLM生成的自我解释虽然未必揭示真实决策机制，但在提升模型行为可预测性方面具有实际作用，尤其在结合合适扰动策略时更为明显。 Abstract: Large Language Models (LLMs) can produce verbalized self-explanations, yet prior studies suggest that such rationales may not reliably reflect the model's true decision process. We ask whether these explanations nevertheless help users predict model behavior, operationalized as counterfactual simulatability. Using StrategyQA, we evaluate how well humans and LLM judges can predict a model's answers to counterfactual follow-up questions, with and without access to the model's chain-of-thought or post-hoc explanations. We compare LLM-generated counterfactuals with pragmatics-based perturbations as alternative ways to construct test cases for assessing the potential usefulness of explanations. Our results show that self-explanations consistently improve simulation accuracy for both LLM judges and humans, but the degree and stability of gains depend strongly on the perturbation strategy and judge strength. We also conduct a qualitative analysis of free-text justifications written by human users when predicting the model's behavior, which provides evidence that access to explanations helps humans form more accurate predictions on the perturbed questions.

[72] Tracing the complexity profiles of different linguistic phenomena through the intrinsic dimension of LLM representations

Marco Baroni,Emily Cheng,Iria deDios-Flores,Francesca Franzon

Main category: cs.CL

TL;DR: 本文探讨了大型语言模型（LLM）表征的内在维度（ID）作为语言复杂性的指标，发现不同层次的ID变化能够反映句法形式与功能复杂性，并关联到特定的语言处理阶段。

Details

Motivation: 研究LLM中语言复杂性的内在表征机制，探索是否可以通过内在维度（ID）区分形式与功能复杂性。 Method: 分析LLM各层的内在维度（ID），结合表征相似性分析和层级消融实验，比较不同类型句子结构在ID上的差异。 Result: 发现协调或从属从句等句法形式复杂性在抽象语言处理阶段引起显著的ID变化；右分支与中心嵌套、明确与歧义修饰等功能性复杂性也能被ID捕捉，但信号较弱且不对应同一处理阶段。表征相似性和消融实验验证了该趋势。 Conclusion: ID是衡量LLM中语言复杂性的有效指标，能区分不同类型的复杂性，并揭示跨模型一致的语言处理阶段。 Abstract: We explore the intrinsic dimension (ID) of LLM representations as a marker of linguistic complexity, asking if different ID profiles across LLM layers differentially characterize formal and functional complexity. We find the formal contrast between sentences with multiple coordinated or subordinated clauses to be reflected in ID differences whose onset aligns with a phase of more abstract linguistic processing independently identified in earlier work. The functional contrasts between sentences characterized by right branching vs. center embedding or unambiguous vs. ambiguous relative clause attachment are also picked up by ID, but in a less marked way, and they do not correlate with the same processing phase. Further experiments using representational similarity and layer ablation confirm the same trends. We conclude that ID is a useful marker of linguistic complexity in LLMs, that it allows to differentiate between different types of complexity, and that it points to similar stages of linguistic processing across disparate LLMs.

[73] HearSay Benchmark: Do Audio LLMs Leak What They Hear?

Jin Wang,Liang Lin,Kaiwen Luo,Weiliu Wang,Yitian Chen,Moayad Aloqaily,Xuehai Tang,Zhenhong Zhou,Kun Wang,Li Sun,Qingsong Wen

Main category: cs.CL

TL;DR: 本文提出了HearSay，首个用于评估音频大语言模型（ALLM）通过声纹泄露用户隐私风险的基准测试，基于超过22,000个真实音频片段，揭示了ALLM在性别识别等隐私属性推断上的高准确性、现有安全机制的不足以及思维链推理加剧隐私风险的问题。

Details

Motivation: 尽管音频大语言模型（ALLM）在理解和生成方面取得了显著进展，但其潜在的隐私影响尚未被充分探索。本研究旨在探究ALLM是否仅通过声纹就可能泄露用户隐私。 Method: 构建了一个名为HearSay的大规模基准数据集，包含超过22,000个真实世界音频片段，并通过自动化分析与人工验证相结合的严格流程确保数据质量，所有隐私标签均基于事实记录。在此基础上对多个ALLM进行系统性实验评估。 Result: 实验发现：1）ALLM能从声纹中提取敏感隐私信息，如性别识别准确率达92.89%；2）现有安全机制严重不足，模型几乎不拒绝涉及生理特征的隐私查询；3）思维链（CoT）推理会增强模型对声学相关性的挖掘，从而放大隐私风险。 Conclusion: 研究表明ALLM存在严重的隐私泄露漏洞，强调亟需开发针对性的隐私对齐技术以应对声纹带来的隐私威胁。 Abstract: While Audio Large Language Models (ALLMs) have achieved remarkable progress in understanding and generation, their potential privacy implications remain largely unexplored. This paper takes the first step to investigate whether ALLMs inadvertently leak user privacy solely through acoustic voiceprints and introduces $\textit{HearSay}$, a comprehensive benchmark constructed from over 22,000 real-world audio clips. To ensure data quality, the benchmark is meticulously curated through a rigorous pipeline involving automated profiling and human verification, guaranteeing that all privacy labels are grounded in factual records. Extensive experiments on $\textit{HearSay}$ yield three critical findings: $\textbf{Significant Privacy Leakage}$: ALLMs inherently extract private attributes from voiceprints, reaching 92.89% accuracy on gender and effectively profiling social attributes. $\textbf{Insufficient Safety Mechanisms}$: Alarmingly, existing safeguards are severely inadequate; most models fail to refuse privacy-intruding requests, exhibiting near-zero refusal rates for physiological traits. $\textbf{Reasoning Amplifies Risk}$: Chain-of-Thought (CoT) reasoning exacerbates privacy risks in capable models by uncovering deeper acoustic correlations. These findings expose critical vulnerabilities in ALLMs, underscoring the urgent need for targeted privacy alignment. The codes and dataset are available at https://github.com/JinWang79/HearSay_Benchmark

[74] Membox: Weaving Topic Continuity into Long-Range Memory for LLM Agents

Dehao Tao,Guoliang Ma,Yongfeng Huang,Minghu Jiang

Main category: cs.CL

TL;DR: Membox是一种新型的分层记忆架构，通过主题连贯性建模提升大语言模型代理在对话中的时间推理与上下文连贯性，同时显著减少上下文token使用。

Details

Motivation: 现有LLM代理记忆系统破坏了对话的主题连续性和因果流，且依赖词法相似性检索，导致连贯性差和效率低。 Method: 提出Membox，包含Topic Loom（滑动窗口监测并聚合同一主题对话轮次为“记忆盒”）和Trace Weaver（构建跨不连续事件的长程时序轨迹），在存储时即结构化组织记忆。 Result: 在LoCoMo数据集上，Membox在时间推理任务中F1分数最高提升68%，优于Mem0、A-MEM等基线方法，并仅使用更少的上下文token。 Conclusion: 通过显式建模主题连续性，Membox提供了一种认知合理的机制，有效提升了LLM代理的记忆连贯性与运行效率。 Abstract: Human-agent dialogues often exhibit topic continuity-a stable thematic frame that evolves through temporally adjacent exchanges-yet most large language model (LLM) agent memory systems fail to preserve it. Existing designs follow a fragmentation-compensation paradigm: they first break dialogue streams into isolated utterances for storage, then attempt to restore coherence via embedding-based retrieval. This process irreversibly damages narrative and causal flow, while biasing retrieval towards lexical similarity. We introduce membox, a hierarchical memory architecture centered on a Topic Loom that continuously monitors dialogue in a sliding-window fashion, grouping consecutive same-topic turns into coherent "memory boxes" at storage time. Sealed boxes are then linked by a Trace Weaver into long-range event-timeline traces, recovering macro-topic recurrences across discontinuities. Experiments on LoCoMo demonstrate that Membox achieves up to 68% F1 improvement on temporal reasoning tasks, outperforming competitive baselines (e.g., Mem0, A-MEM). Notably, Membox attains these gains while using only a fraction of the context tokens required by existing methods, highlighting a superior balance between efficiency and effectiveness. By explicitly modeling topic continuity, Membox offers a cognitively motivated mechanism for enhancing both coherence and efficiency in LLM agents.

[75] InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training

Ziyun Zhang,Zezhou Wang,Xiaoyi Zhang,Zongyu Guo,Jiahao Li,Bin Li,Yan Lu

Main category: cs.CL

TL;DR: 本文提出InfiniteWeb系统，通过自动生成大规模功能性网页环境来解决GUI智能体训练中环境稀缺的问题，并结合统一规范、任务驱动开发和多样化设计，提升智能体训练效果。

Details

Motivation: 由于缺乏足够的合适环境，GUI智能体的训练受到限制，尤其是在构建具有多交互页面的真实功能性网站方面存在挑战。 Method: 提出InfiniteWeb系统，采用统一规范、以任务为中心的测试驱动开发方法，并结合网站种子与参考设计图像来生成多样化且功能完整的网页环境，同时生成可验证的任务评估器以提供强化学习中的密集奖励信号。 Result: 实验表明，InfiniteWeb在真实网站构建上超越商业编码智能体，且在其生成环境中训练的GUI智能体在OSWorld和Online-Mind2Web上表现出显著性能提升。 Conclusion: InfiniteWeb能有效生成多样化、功能完整且可评估的网页环境，显著提升GUI智能体的训练效果，验证了该方法的可行性与优势。 Abstract: GUI agents that interact with graphical interfaces on behalf of users represent a promising direction for practical AI assistants. However, training such agents is hindered by the scarcity of suitable environments. We present InfiniteWeb, a system that automatically generates functional web environments at scale for GUI agent training. While LLMs perform well on generating a single webpage, building a realistic and functional website with many interconnected pages faces challenges. We address these challenges through unified specification, task-centric test-driven development, and a combination of website seed with reference design image to ensure diversity. Our system also generates verifiable task evaluators enabling dense reward signals for reinforcement learning. Experiments show that InfiniteWeb surpasses commercial coding agents at realistic website construction, and GUI agents trained on our generated environments achieve significant performance improvements on OSWorld and Online-Mind2Web, demonstrating the effectiveness of proposed system.

[76] Compact Example-Based Explanations for Language Models

Loris Schoenegger,Benjamin Roth

Main category: cs.CL

TL;DR: 提出了一种无需重新训练的评估指标——选择相关性分数，用于衡量示例集对模型输出解释的有效性，并发现常见选择策略常不如随机选择；据此提出一种平衡影响力与代表性的新策略以提升解释质量。

Details

Motivation: 现有训练数据影响估计方法虽可用于示例解释，但因人类无法处理大量文档，需从训练数据中选择子集作为解释，而此前研究忽视了选择策略对解释质量的影响。 Method: 提出一种无需重新训练的选择相关性评分指标，并通过微调实验验证其预测能力；比较不同选择策略的表现，并设计一种平衡影响性与代表性的新策略。 Result: 实验表明常用选择策略常不如随机选择；所提相关性评分能有效预测示例集是否支持或削弱模型预测；新策略在有限预算下优于传统方法。 Conclusion: 选择策略显著影响解释质量，仅依赖高影响力样本并非最优；平衡影响性与代表性的策略可更有效地利用选择预算，提升解释可靠性。 Abstract: Training data influence estimation methods quantify the contribution of training documents to a model's output, making them a promising source of information for example-based explanations. As humans cannot interpret thousands of documents, only a small subset of the training data can be presented as an explanation. Although the choice of which documents to include directly affects explanation quality, previous evaluations of such systems have largely ignored any selection strategies. To address this, we propose a novel selection relevance score, a retraining-free metric that quantifies how useful a set of examples is for explaining a model's output. We validate this score through fine-tuning experiments, confirming that it can predict whether a set of examples supports or undermines the model's predictions. Using this metric, we further show that common selection strategies often underperform random selection. Motivated by this finding, we propose a strategy that balances influence and representativeness, enabling better use of selection budgets than naively selecting the highest-ranking examples.

[77] NeoAMT: Neologism-Aware Agentic Machine Translation with Reinforcement Learning

Zhongtao Miao,Kaiyan Zhao,Masaaki Nagata,Yoshimasa Tsuruoka

Main category: cs.CL

TL;DR: 本文提出了一种基于Wiktionary搜索工具的代理框架NeoAMT，用于新词感知的机器翻译，并构建了涵盖16种语言的新数据集和检索语料库，结合强化学习与自适应奖励机制提升翻译性能。

Details

Motivation: 新词在常规机器翻译中常被忽略或误译，现有方法对新词感知翻译支持不足，亟需专门框架来有效处理跨语言中新词的准确翻译。 Method: 构建了一个包含16种语言、75个翻译方向的新词感知机器翻译数据集，并开发基于Wiktionary的搜索工具；设计了一个强化学习训练框架，引入新的奖励机制和基于“翻译难度”的自适应rollout生成策略来训练翻译代理。 Result: 成功构建了大规模新词数据集和检索语料库，实验表明所提RL框架结合搜索工具显著提升了新词翻译的准确性与整体翻译质量。 Conclusion: NeoAMT框架通过结合外部知识检索与强化学习，有效提升了多语言环境下新词的翻译能力，为新词感知机器翻译提供了可扩展的解决方案。 Abstract: Neologism-aware machine translation aims to translate source sentences containing neologisms into target languages. This field remains underexplored compared with general machine translation (MT). In this paper, we propose an agentic framework, NeoAMT, for neologism-aware machine translation using a Wiktionary search tool. Specifically, we first create a new dataset for neologism-aware machine translation and develop a search tool based on Wiktionary. The new dataset covers 16 languages and 75 translation directions and is derived from approximately 10 million records of an English Wiktionary dump. The retrieval corpus of the search tool is also constructed from around 3 million cleaned records of the Wiktionary dump. We then use it for training the translation agent with reinforcement learning (RL) and evaluating the accuracy of neologism-aware machine translation. Based on this, we also propose an RL training framework that contains a novel reward design and an adaptive rollout generation approach by leveraging "translation difficulty" to further improve the translation quality of translation agents using our search tool.

[78] Do LLMs Really Memorize Personally Identifiable Information? Revisiting PII Leakage with a Cue-Controlled Memorization Framework

Xiaoyu Luo,Yiyi Chen,Qiongxiu Li,Johannes Bjerva

Main category: cs.CL

TL;DR: 本文提出了一种新的评估大语言模型（LLM）记忆化的框架——抗提示记忆化（CRM），强调在低词汇提示条件下评估个人身份信息（PII）泄露，发现此前报告的PII泄露主要由表面提示驱动，而非真正的记忆化。

Details

Motivation: 现有研究将PII泄露等同于模型记忆化，但缺乏对提示线索的控制，导致高估记忆化程度。本文旨在建立更严格、原则性的评估标准以准确衡量隐私相关记忆化行为。 Method: 提出了Cue-Resistant Memorization (CRM) 框架，通过控制提示与目标之间的词法重叠，在32种语言上大规模重新评估多种记忆范式（如前缀-后缀补全、关联重建、生成和成员推断）。 Result: 发现大多数PII重建成功依赖于明显的表面形式提示；当控制这些提示后，重建成功率大幅下降；在无提示生成和成员推断任务中，真正率极低。 Conclusion: 此前报告的PII泄露主要反映的是模型对提示的泛化或模式补全能力，而非真实记忆化；必须采用提示控制的评估方法才能可靠评估LLM中的记忆化与隐私风险。 Abstract: Large Language Models (LLMs) have been reported to "leak" Personally Identifiable Information (PII), with successful PII reconstruction often interpreted as evidence of memorization. We propose a principled revision of memorization evaluation for LLMs, arguing that PII leakage should be evaluated under low lexical cue conditions, where target PII cannot be reconstructed through prompt-induced generalization or pattern completion. We formalize Cue-Resistant Memorization (CRM) as a cue-controlled evaluation framework and a necessary condition for valid memorization evaluation, explicitly conditioning on prompt-target overlap cues. Using CRM, we conduct a large-scale multilingual re-evaluation of PII leakage across 32 languages and multiple memorization paradigms. Revisiting reconstruction-based settings, including verbatim prefix-suffix completion and associative reconstruction, we find that their apparent effectiveness is driven primarily by direct surface-form cues rather than by true memorization. When such cues are controlled for, reconstruction success diminishes substantially. We further examine cue-free generation and membership inference, both of which exhibit extremely low true positive rates. Overall, our results suggest that previously reported PII leakage is better explained by cue-driven behavior than by genuine memorization, highlighting the importance of cue-controlled evaluation for reliably quantifying privacy-relevant memorization in LLMs.

[79] VietMed-MCQ: A Consistency-Filtered Data Synthesis Framework for Vietnamese Traditional Medicine Evaluation

Huynh Trung Kiet,Dao Sy Duy Minh,Nguyen Dinh Ha Duong,Le Hoang Minh Huy,Long Nguyen,Dien Dinh

Main category: cs.CL

TL;DR: 本文提出了VietMed-MCQ，一个用于越南传统医学的多选题数据集，通过检索增强生成和双模型验证机制构建，旨在解决低资源医学领域中大型语言模型评估基准缺乏的问题。

Details

Motivation: 由于缺乏高质量、结构化的基准，大型语言模型在越南传统医学等特定文化医学领域表现不佳。本文旨在填补这一空白。 Method: 提出了一种基于检索增强生成（RAG）的数据集生成 pipeline，并引入双模型验证机制以确保推理一致性，生成包含3190个问题的VietMed-MCQ数据集，涵盖三个难度等级，并由医学专家和学生进行人工验证。 Result: VietMed-MCQ数据集获得94.2%的认可率，Fleiss' kappa为0.82，显示高一致性；在七种开源模型上的基准测试表明，具有较强中文先验的通用模型优于越南语专用模型，但所有模型在复杂诊断推理上仍存在困难。 Conclusion: 跨语言概念迁移有助于提升低资源医学领域的模型性能，VietMed-MCQ为推动此类研究提供了有价值的公共基准资源。 Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in general medical domains. However, their performance significantly degrades in specialized, culturally specific domains such as Vietnamese Traditional Medicine (VTM), primarily due to the scarcity of high-quality, structured benchmarks. In this paper, we introduce VietMed-MCQ, a novel multiple-choice question dataset generated via a Retrieval-Augmented Generation (RAG) pipeline with an automated consistency check mechanism. Unlike previous synthetic datasets, our framework incorporates a dual-model validation approach to ensure reasoning consistency through independent answer verification, though the substring-based evidence checking has known limitations. The complete dataset of 3,190 questions spans three difficulty levels and underwent validation by one medical expert and four students, achieving 94.2 percent approval with substantial inter-rater agreement (Fleiss' kappa = 0.82). We benchmark seven open-source models on VietMed-MCQ. Results reveal that general-purpose models with strong Chinese priors outperform Vietnamese-centric models, highlighting cross-lingual conceptual transfer, while all models still struggle with complex diagnostic reasoning. Our code and dataset are publicly available to foster research in low-resource medical domains.

[80] Where meaning lives: Layer-wise accessibility of psycholinguistic features in encoder and decoder language models

Taisiia Tikhomirova,Dirk U. Wulff

Main category: cs.CL

TL;DR: 该研究系统地探究了10种Transformer模型中58个心理语言学特征的分层表征，发现语义的定位高度依赖于嵌入提取方法，并揭示了不同语义维度在模型深度上的共享排序模式。

Details

Motivation: 理解Transformer语言模型如何编码心理上可解释的意义成分，对于认知科学和模型可解释性具有重要意义。 Method: 对10种Transformer模型（包括仅编码器和仅解码器架构）进行逐层探测分析，覆盖58个心理语言学特征，并比较三种嵌入提取方法（如上下文化嵌入与孤立嵌入）。 Result: 意义的表征位置强烈依赖于方法：上下文化嵌入显示出更高的特征选择性和不同的层间分布；最终层通常并非恢复心理语言学信息的最佳层；尽管存在差异，模型表现出共享的深度排序：词汇属性在较浅层达到峰值，而体验性和情感性维度在较深层更显著。 Conclusion: Transformer模型中意义的存储位置是方法选择与架构约束共同作用的结果，而非单纯的模型内部特性。 Abstract: Understanding where transformer language models encode psychologically meaningful aspects of meaning is essential for both theory and practice. We conduct a systematic layer-wise probing study of 58 psycholinguistic features across 10 transformer models, spanning encoder-only and decoder-only architectures, and compare three embedding extraction methods. We find that apparent localization of meaning is strongly method-dependent: contextualized embeddings yield higher feature-specific selectivity and different layer-wise profiles than isolated embeddings. Across models and methods, final-layer representations are rarely optimal for recovering psycholinguistic information with linear probes. Despite these differences, models exhibit a shared depth ordering of meaning dimensions, with lexical properties peaking earlier and experiential and affective dimensions peaking later. Together, these results show that where meaning "lives" in transformer models reflects an interaction between methodological choices and architectural constraints.

[81] AI Generated Text Detection

Adilkhan Alikhanov,Aidar Amangeldi,Diar Demeubay,Dilnaz Akhmetzhan,Nurbek Moldakhmetov,Omar Polat,Galymzhan Zharas

Main category: cs.CL

TL;DR: 本文评估了多种AI文本检测方法，使用HC3和DAIGT v2数据集构建统一基准，并采用基于主题的数据划分防止信息泄露，发现基于上下文语义建模的深度学习模型（如DistilBERT）性能最优。

Details

Motivation: 随着大语言模型的发展，学生越来越多地将AI生成内容冒充为自己的学术成果，损害学术诚信，因此需要有效的AI文本检测方法。 Method: 采用传统机器学习和基于Transformer的深度学习模型，在HC3和DAIGT v2数据集上构建统一基准，并采用基于主题的数据划分策略以防止信息泄漏，评估不同模型的检测性能。 Result: TF-IDF逻辑回归达到82.87%准确率；BiLSTM准确率达88.86%；DistilBERT准确率达88.11%，ROC-AUC达0.96，表现最佳。结果表明上下文语义建模优于词汇特征，且主题隔离评估协议对泛化性至关重要。 Conclusion: 基于上下文的深度学习模型在AI文本检测中表现更优，合理的评估协议能有效提升模型泛化能力，未来工作将聚焦于提升数据多样性、使用高效微调方法及优化推理效率。 Abstract: The rapid development of large language models has led to an increase in AI-generated text, with students increasingly using LLM-generated content as their own work, which violates academic integrity. This paper presents an evaluation of AI text detection methods, including both traditional machine learning models and transformer-based architectures. We utilize two datasets, HC3 and DAIGT v2, to build a unified benchmark and apply a topic-based data split to prevent information leakage. This approach ensures robust generalization across unseen domains. Our experiments show that TF-IDF logistic regression achieves a reasonable baseline accuracy of 82.87%. However, deep learning models outperform it. The BiLSTM classifier achieves an accuracy of 88.86%, while DistilBERT achieves a similar accuracy of 88.11% with the highest ROC-AUC score of 0.96, demonstrating the strongest overall performance. The results indicate that contextual semantic modeling is significantly superior to lexical features and highlight the importance of mitigating topic memorization through appropriate evaluation protocols. The limitations of this work are primarily related to dataset diversity and computational constraints. In future work, we plan to expand dataset diversity and utilize parameter-efficient fine-tuning methods such as LoRA. We also plan to explore smaller or distilled models and employ more efficient batching strategies and hardware-aware optimization.

[82] Step Potential Advantage Estimation: Harnessing Intermediate Confidence and Correctness for Efficient Mathematical Reasoning

Fei Wu,Zhenrong Zhang,Qikai Chang,Jianshu Zhang,Quan Liu,Jun Du

Main category: cs.CL

TL;DR: 本文提出了一种基于可验证奖励的强化学习中细粒度信用分配的新方法SPAE，通过引入Step Potential信号来监督推理过程，提升大型语言模型在长链推理中的准确性和效率。

Details

Motivation: 现有的强化学习与可验证奖励方法缺乏对推理过程中每一步进展的语义级评估，导致模型难以区分必要推理与冗余验证，可能出现过度检查或推翻正确答案的问题。 Method: 提出一种无需训练的探针机制，提取每步推理的中间置信度和正确性，构建Step Potential信号，并在此基础上设计Step Potential Advantage Estimation (SPAE)，实现细粒度的优势估计，对潜在增益进行放大、对潜在下降进行惩罚，并在潜力饱和后施加终止惩罚以促进及时结束。 Result: 在多个基准测试上，SPAE显著提高了推理准确性，同时大幅缩短了输出长度，优于强RL基线及近期高效推理与词元级优势估计方法。 Conclusion: SPAE通过引入基于推理状态的细粒度过程监督，有效改善了大语言模型在长链推理中的信用分配问题，实现了更准确且更高效的推理。 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) elicits long chain-of-thought reasoning in large language models (LLMs), but outcome-based rewards lead to coarse-grained advantage estimation. While existing approaches improve RLVR via token-level entropy or sequence-level length control, they lack a semantically grounded, step-level measure of reasoning progress. As a result, LLMs fail to distinguish necessary deduction from redundant verification: they may continue checking after reaching a correct solution and, in extreme cases, overturn a correct trajectory into an incorrect final answer. To remedy the lack of process supervision, we introduce a training-free probing mechanism that extracts intermediate confidence and correctness and combines them into a Step Potential signal that explicitly estimates the reasoning state at each step. Building on this signal, we propose Step Potential Advantage Estimation (SPAE), a fine-grained credit assignment method that amplifies potential gains, penalizes potential drops, and applies penalty after potential saturates to encourage timely termination. Experiments across multiple benchmarks show SPAE consistently improves accuracy while substantially reducing response length, outperforming strong RL baselines and recent efficient reasoning and token-level advantage estimation methods. The code is available at https://github.com/cii030/SPAE-RL.

[83] Rethinking Table Pruning in TableQA: From Sequential Revisions to Gold Trajectory-Supervised Parallel Search

Yu Guo,Shenghao Ye,Shuangwu Chen,Zijian Wen,Tao Zhang,Qirui Bai,Dong Jin,Yunpeng Hou,Huasen He,Jian Yang,Xiaobin Tan

Main category: cs.CL

TL;DR: 本文提出了一种名为TabTrim的新型表格剪枝框架，通过黄金轨迹监督的并行搜索来提升表格问答性能。

Details

Motivation: 现有的表格剪枝方法依赖于不可靠的批评信号进行顺序修改，常导致关键答案数据丢失。 Method: 利用黄金SQL查询执行过程中的中间子表生成黄金剪枝轨迹，并训练剪枝器和验证器以对齐该轨迹，在推理时采用并行搜索策略探索多个候选路径。 Result: 在多种表格推理任务上实现了最先进的性能，TabTrim-8B平均准确率达到73.5%，在WikiTQ和TableBench上分别达到79.4%和61.2%。 Conclusion: TabTrim有效提升了表格问答中剪枝的准确性与鲁棒性，优于现有方法。 Abstract: Table Question Answering (TableQA) benefits significantly from table pruning, which extracts compact sub-tables by eliminating redundant cells to streamline downstream reasoning. However, existing pruning methods typically rely on sequential revisions driven by unreliable critique signals, often failing to detect the loss of answer-critical data. To address this limitation, we propose TabTrim, a novel table pruning framework which transforms table pruning from sequential revisions to gold trajectory-supervised parallel search. TabTrim derives a gold pruning trajectory using the intermediate sub-tables in the execution process of gold SQL queries, and trains a pruner and a verifier to make the step-wise pruning result align with the gold pruning trajectory. During inference, TabTrim performs parallel search to explore multiple candidate pruning trajectories and identify the optimal sub-table. Extensive experiments demonstrate that TabTrim achieves state-of-the-art performance across diverse tabular reasoning tasks: TabTrim-8B reaches 73.5% average accuracy, outperforming the strongest baseline by 3.2%, including 79.4% on WikiTQ and 61.2% on TableBench.

[84] What Does Loss Optimization Actually Teach, If Anything? Knowledge Dynamics in Continual Pre-training of LLMs

Seyed Mahed Mousavi,Simone Alghisi,Giuseppe Riccardi

Main category: cs.CL

TL;DR: 该论文研究了持续预训练（CPT）中的知识学习动态，发现损失优化与实际知识学习进程不一致，提出应基于任务级学习动态来评估CPT。

Details

Motivation: 现有CPT方法以损失作为知识学习的代理指标，但缺乏对知识获取过程的真实理解，作者希望揭示CPT中知识学习的实际动态及其与优化过程的关系。 Method: 构建了一个受控且分布匹配的事实文档基准，在CPT循环中插入诊断探针，实现对知识获取和领域外技能变化的逐轮测量，并结合电路分析研究知识通路的演化。 Result: 发现尽管损失单调下降，但事实知识的学习不稳定且非单调；新知识难以巩固，学习严重依赖先前经验，早期即出现领域外性能退化；知识通路在各轮间快速重构，导致记忆窗口狭窄和系统性遗忘。 Conclusion: 损失优化不能准确反映CPT中的学习进展，应采用基于任务级学习动态的评估和停止标准，以提升CPT的有效性。 Abstract: Continual Pre-Training (CPT) is widely used for acquiring and updating factual knowledge in LLMs. This practice treats loss as a proxy for knowledge learning, while offering no grounding into how it changes during training. We study CPT as a knowledge learning process rather than a solely optimization problem. We construct a controlled, distribution-matched benchmark of factual documents and interleave diagnostic probes directly into the CPT loop, enabling epoch-level measurement of knowledge acquisition dynamics and changes in Out-Of-Domain (OOD) general skills (e.g., math). We further analyze how CPT reshapes knowledge circuits during training. Across three instruction-tuned LLMs and multiple CPT strategies, optimization and learning systematically diverge as loss decreases monotonically while factual learning is unstable and non-monotonic. Acquired facts are rarely consolidated, learning is strongly conditioned on prior exposure, and OOD performance degrades from early epochs. Circuit analysis reveals rapid reconfiguration of knowledge pathways across epochs, providing an explanation for narrow acquisition windows and systematic forgetting. These results show that loss optimization is misaligned with learning progress in CPT and motivate evaluation of stopping criteria based on task-level learning dynamics.

[85] PartisanLens: A Multilingual Dataset of Hyperpartisan and Conspiratorial Immigration Narratives in European Media

Michele Joshua Maggini,Paloma Piot,Anxo Pérez,Erik Bran Marino,Lúa Santamaría Montesinos,Ana Lisboa,Marta Vázquez Abuín,Javier Parapar,Pablo Gamallo

Main category: cs.CL

TL;DR: 本文介绍了首个针对西班牙语、意大利语和葡萄牙语的多语言超党派新闻标题数据集\textsc{PartisanLens}，用于检测超党派叙事和人口替代阴谋论（PRCT），并通过大语言模型进行分类与自动标注评估。

Details

Motivation: 现有资源稀缺且以英语为主，难以全面捕捉政治话语中的超党派性、立场和修辞偏见之间的关联，亟需多语言、多维度的数据支持相关研究。 Method: 构建包含1617条多语言超党派新闻标题的数据集\textsc{PartisanLens}，使用大语言模型进行分类性能评估，并探索其作为自动标注工具的能力，同时模拟不同社会经济与意识形态背景下的标注偏好。 Result: 建立了超党派与PRCT叙事分类的强基线，发现大语言模型在自动标注方面具有潜力但仍有局限，且能通过条件设定模拟不同人类标注视角。 Conclusion: \textsc{PartisanLens}为欧洲语境下党派与阴谋论叙事的检测提供了重要资源，推动未来多语言 misinformation 研究。 Abstract: Detecting hyperpartisan narratives and Population Replacement Conspiracy Theories (PRCT) is essential to addressing the spread of misinformation. These complex narratives pose a significant threat, as hyperpartisanship drives political polarisation and institutional distrust, while PRCTs directly motivate real-world extremist violence, making their identification critical for social cohesion and public safety. However, existing resources are scarce, predominantly English-centric, and often analyse hyperpartisanship, stance, and rhetorical bias in isolation rather than as interrelated aspects of political discourse. To bridge this gap, we introduce \textsc{PartisanLens}, the first multilingual dataset of \num{1617} hyperpartisan news headlines in Spanish, Italian, and Portuguese, annotated in multiple political discourse aspects. We first evaluate the classification performance of widely used Large Language Models (LLMs) on this dataset, establishing robust baselines for the classification of hyperpartisan and PRCT narratives. In addition, we assess the viability of using LLMs as automatic annotators for this task, analysing their ability to approximate human annotation. Results highlight both their potential and current limitations. Next, moving beyond standard judgments, we explore whether LLMs can emulate human annotation patterns by conditioning them on socio-economic and ideological profiles that simulate annotator perspectives. At last, we provide our resources and evaluation, \textsc{PartisanLens} supports future research on detecting partisan and conspiratorial narratives in European contexts.

[86] What Matters For Safety Alignment?

Xing Li,Hui-Ling Zhen,Lihao Yin,Xianzhi Yu,Zhenhua Dong,Mingxuan Yuan

Main category: cs.CL

TL;DR: 本文对大语言模型（LLM）和逻辑推理模型（LRM）的安全对齐能力进行了大规模实证研究，评估了六种内在特性与三种外部攻击技术的影响，基于32个主流模型、13个模型家族、5个安全数据集及56种越狱技术，共执行460万次API调用。研究发现：具备推理与自省机制的LRM更安全；后训练与知识蒸馏可能削弱安全对齐；响应前缀形式的思维链攻击可使攻击成功率提升3.34倍，暴露文本补全接口的重大风险；角色扮演、提示注入与梯度搜索是引发模型失对齐的主要手段。

Details

Motivation: 随着大语言模型在各领域的广泛应用，其安全性问题日益突出。如何有效提升模型的安全对齐能力成为关键挑战。现有研究缺乏系统性实证分析，尤其在不同模型架构、训练策略与攻击方式下的表现差异尚不清晰。因此，亟需一项全面的经验研究，识别影响安全对齐的关键因素，为构建更安全可靠的AI系统提供指导。 Method: 本研究采用大规模实证评估方法，选取32个近期流行的LLM与LRM，涵盖13个模型家族，参数规模从3B到235B。评估涵盖六个关键内在特征（如是否具备推理机制、训练方式等）和三种外部攻击技术（如角色扮演、提示注入、梯度搜索）。使用五个公认的安全部署数据集，并结合56种越狱技巧与四种思维链（CoT）攻击策略，累计发起460万次API调用，系统分析各类因素对模型安全性的具体影响。 Result: 第一，GPT-OSS-20B、Qwen3-Next-80B-A3B-Thinking 和 GPT-OSS-120B 是最安全的三个模型，表明集成推理与自我反思机制显著增强安全对齐能力；第二，后训练和知识蒸馏可能导致安全对齐系统性下降，说明安全应作为显式约束或核心优化目标；第三，通过响应前缀实施CoT攻击可将平均攻击成功率提高3.34倍，在Seed-OSS-36B-Instruct上从0.6%飙升至96.3%，揭示文本补全接口的重大安全隐患；第四，角色扮演、提示注入和基于梯度的对抗提示搜索是诱发现代模型失对齐行为的主要手段。 Conclusion: 安全对齐不能依赖于通用能力提升的副产品，而必须在模型设计、训练（尤其是后训练与知识蒸馏阶段）中作为首要目标进行优化。引入推理与自我反思机制能显著提升安全性，但需警惕特定攻击路径（如响应前缀操控）带来的巨大风险。未来模型部署应限制用户定义响应前缀等功能，并加强架构级防护以抵御高级越狱攻击。 Abstract: This paper presents a comprehensive empirical study on the safety alignment capabilities. We evaluate what matters for safety alignment in LLMs and LRMs to provide essential insights for developing more secure and reliable AI systems. We systematically investigate and compare the influence of six critical intrinsic model characteristics and three external attack techniques. Our large-scale evaluation is conducted using 32 recent, popular LLMs and LRMs across thirteen distinct model families, spanning a parameter scale from 3B to 235B. The assessment leverages five established safety datasets and probes model vulnerabilities with 56 jailbreak techniques and four CoT attack strategies, resulting in 4.6M API calls. Our key empirical findings are fourfold. First, we identify the LRMs GPT-OSS-20B, Qwen3-Next-80B-A3B-Thinking, and GPT-OSS-120B as the top-three safest models, which substantiates the significant advantage of integrated reasoning and self-reflection mechanisms for robust safety alignment. Second, post-training and knowledge distillation may lead to a systematic degradation of safety alignment. We thus argue that safety must be treated as an explicit constraint or a core optimization objective during these stages, not merely subordinated to the pursuit of general capability. Third, we reveal a pronounced vulnerability: employing a CoT attack via a response prefix can elevate the attack success rate by 3.34x on average and from 0.6% to 96.3% for Seed-OSS-36B-Instruct. This critical finding underscores the safety risks inherent in text-completion interfaces and features that allow user-defined response prefixes in LLM services, highlighting an urgent need for architectural and deployment safeguards. Fourth, roleplay, prompt injection, and gradient-based search for adversarial prompts are the predominant methodologies for eliciting unaligned behaviors in modern models.

[87] Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning

Jinyang Wu,Guocheng Zhai,Ruihan Jin,Jiahao Yuan,Yuhao Shen,Shuai Zhang,Zhengqi Wen,Jianhua Tao

Main category: cs.CL

TL;DR: 本文提出了ATLAS，一种用于跨领域复杂推理中动态工具使用的双路径框架，通过无训练聚类路由和强化学习多步路由提升模型-工具组合的适应性与泛化能力。

Details

Motivation: 随着大语言模型和外部工具的多样化，选择最优模型-工具组合成为一个高维优化问题，现有方法因依赖单一模型或固定调用逻辑而未能充分利用不同组合间的性能差异。 Method: 提出ATLAS框架，包含两条路径：一是无需训练的基于聚类的路由，利用经验先验实现领域特定对齐；二是基于强化学习的多步路由，探索自主轨迹以实现分布外泛化。 Result: 在15个基准测试上实验表明，ATLAS优于GPT-4o等闭源模型，在分布内任务上性能提升+10.1%，分布外任务上提升+13.1%，并在视觉推理中通过协调多模态工具展现出显著优势。 Conclusion: ATLAS通过动态选择最优模型-工具组合，有效提升了复杂推理任务中的性能与泛化能力，为构建更灵活、高效的AI代理提供了新思路。 Abstract: The integration of large language models (LLMs) with external tools has significantly expanded the capabilities of AI agents. However, as the diversity of both LLMs and tools increases, selecting the optimal model-tool combination becomes a high-dimensional optimization challenge. Existing approaches often rely on a single model or fixed tool-calling logic, failing to exploit the performance variations across heterogeneous model-tool pairs. In this paper, we present ATLAS (Adaptive Tool-LLM Alignment and Synergistic Invocation), a dual-path framework for dynamic tool usage in cross-domain complex reasoning. ATLAS operates via a dual-path approach: (1) \textbf{training-free cluster-based routing} that exploits empirical priors for domain-specific alignment, and (2) \textbf{RL-based multi-step routing} that explores autonomous trajectories for out-of-distribution generalization. Extensive experiments across 15 benchmarks demonstrate that our method outperforms closed-source models like GPT-4o, surpassing existing routing methods on both in-distribution (+10.1%) and out-of-distribution (+13.1%) tasks. Furthermore, our framework shows significant gains in visual reasoning by orchestrating specialized multi-modal tools.

[88] Evaluating Small Decoder-Only Language Models for Grammar Correction and Text Simplification

Anthony Lamelas

Main category: cs.CL

TL;DR: 本文探讨了小型解码器-only语言模型在语法纠错和文本简化任务中的表现，发现尽管它们具有效率优势，但性能仍低于当前的大型语言模型，且存在意义保持和幻觉问题，尚需进一步训练改进。

Details

Motivation: 由于大型语言模型在部署、访问和安全性方面存在困难，本文旨在探索小型语言模型是否能作为高效替代方案用于文本重写任务。 Method: 通过在JFLEG和ASSET数据集上对小型语言模型进行零样本测试、微调和序列化运行，并使用标准指标评估其在语法纠错和文本简化任务中的表现。 Result: 实验结果表明，小型语言模型虽然能学习某些行为，但在保留原意和避免幻觉方面表现较差，整体性能仍落后于强基线和当前的大型语言模型。 Conclusion: 尽管小型语言模型在计算效率上有优势，但目前尚不足以与现代大型语言模型在重写任务上竞争，需要进一步的训练方法改进以缩小性能差距。 Abstract: Large language models have become extremely popular recently due to their ability to achieve strong performance on a variety of tasks, such as text generation and rewriting, but their size and computation cost make them difficult to access, deploy, and secure in many settings. This paper investigates whether small, decoder-only language models can provide an efficient alternative for the tasks of grammar correction and text simplification. The experiments in this paper focus on testing small language models out of the box, fine-tuned, and run sequentially on the JFLEG and ASSET datasets using established metrics. The results show that while SLMs may learn certain behaviors well, their performance remains below strong baselines and current LLMs. The results also show that SLMs struggle with retaining meaning and hallucinations. These findings suggest that despite their efficiency advantages, current SLMs are not yet competitive enough with modern LLMs for rewriting, and further advances in training are required for SLMs to close the performance gap between them and today's LLMs.

[89] Decide Then Retrieve: A Training-Free Framework with Uncertainty-Guided Triggering and Dual-Path Retrieval

Wang Chen,Guanqiang Qi,Weikang Li,Yang Li,Deguo Xia,Jizhou Huang

Main category: cs.CL

TL;DR: 提出一种无需训练的检索增强生成框架DTR，通过生成不确定性决定何时检索，并采用双路径机制自适应选择外部信息，有效提升开放域问答性能。

Details

Motivation: 现有检索增强生成方法 indiscriminately 触发检索且依赖单一证据路径，易引入噪声并限制性能提升。 Method: 利用生成不确定性来指导检索触发，设计双路径检索机制与自适应信息选择策略，以更好处理稀疏和模糊查询。 Result: 在五个开放域问答基准、多种模型规模和不同检索器上实验表明，DTR consistently 提升EM和F1指标，优于标准RAG及其他强基线，同时减少不必要的检索。 Conclusion: DTR是一种有效的训练-free框架，能够自适应地决定检索时机和信息选择，显著提升RAG系统性能。 Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge, but existing approaches indiscriminately trigger retrieval and rely on single-path evidence construction, often introducing noise and limiting performance gains. In this work, we propose Decide Then Retrieve (DTR), a training-free framework that adaptively determines when retrieval is necessary and how external information should be selected. DTR leverages generation uncertainty to guide retrieval triggering and introduces a dual-path retrieval mechanism with adaptive information selection to better handle sparse and ambiguous queries. Extensive experiments across five open-domain QA benchmarks, multiple model scales, and different retrievers demonstrate that DTR consistently improves EM and F1 over standard RAG and strong retrieval-enhanced baselines, while reducing unnecessary retrievals. The code and data used in this paper are available at https://github.com/ChenWangHKU/DTR.

[90] When Models Decide and When They Bind: A Two-Stage Computation for Multiple-Choice Question-Answering

Hugh Mee Wong,Rick Nouwen,Albert Gatt

Main category: cs.CL

TL;DR: 研究发现语言模型在处理多项选择题时采用两阶段机制：先在内容空间中选出正确答案，再将其绑定到对应的输出符号上。

Details

Motivation: 解决多项选择题时，模型不仅要推理出正确答案，还需正确输出代表该答案的符号，这可能导致混淆推理错误和符号绑定失败的问题。因此需要研究语言模型内部如何实现这一过程。 Method: 使用表示分析（如PCA、线性探针）和因果干预方法，分析模型在选项边界处的残差状态，并探测获胜答案的内容位置与输出符号的表示动态。 Result: 发现选项边界处的残差状态包含可线性解码的正确性信号；答案的‘内容位置’在最后一个选项处理后即可解码，而‘输出符号’则在接近答案输出位置时才被表示；符号和内容置换实验支持两阶段机制的存在。 Conclusion: 语言模型在处理多项选择题时遵循两阶段机制：首先在内容空间中确定最佳答案，然后将该答案绑定到正确的输出符号上。 Abstract: Multiple-choice question answering (MCQA) is easy to evaluate but adds a meta-task: models must both solve the problem and output the symbol that *represents* the answer, conflating reasoning errors with symbol-binding failures. We study how language models implement MCQA internally using representational analyses (PCA, linear probes) as well as causal interventions. We find that option-boundary (newline) residual states often contain strong linearly decodable signals related to per-option correctness. Winner-identity probing reveals a two-stage progression: the winning *content position* becomes decodable immediately after the final option is processed, while the *output symbol* is represented closer to the answer emission position. Tests under symbol and content permutations support a two-stage mechanism in which models first select a winner in content space and then bind or route that winner to the appropriate symbol to emit.

[91] Doc-PP: Document Policy Preservation Benchmark for Large Vision-Language Models

Haeun Jang,Hwan Chang,Hwanhee Lee

Main category: cs.CL

TL;DR: 本文提出了Doc-PP基准，用于评估大型视觉-语言模型在遵循文档披露政策方面的表现，并揭示了推理过程中导致敏感信息泄露的系统性安全漏洞；为此提出DVA框架以解耦推理与策略验证过程，有效提升多模态文档理解中的安全性。

Details

Motivation: 现有安全研究多关注隐式社会规范或纯文本场景，忽视了多模态文档中基于上下文的动态信息披露策略的复杂性，亟需针对真实场景下文档问答系统的策略一致性进行评估与改进。 Method: 构建了基于真实报告的多模态基准Doc-PP，要求模型在严格非披露政策下跨视觉与文本元素进行推理；提出DVA（分解-验证-聚合）框架，将推理过程与策略验证分离，以防止敏感信息泄露。 Result: 实验发现模型在需跨模态综合推理时易泄露敏感信息，形成‘推理诱导安全缺口’；提供提取文本虽改善感知但加剧泄露风险；DVA框架显著优于标准提示防御方法。 Conclusion: DVA通过结构化推理机制有效缓解多模态文档问答中的策略违背问题，为符合政策约束的文档理解提供了可靠基线。 Abstract: The deployment of Large Vision-Language Models (LVLMs) for real-world document question answering is often constrained by dynamic, user-defined policies that dictate information disclosure based on context. While ensuring adherence to these explicit constraints is critical, existing safety research primarily focuses on implicit social norms or text-only settings, overlooking the complexities of multimodal documents. In this paper, we introduce Doc-PP (Document Policy Preservation Benchmark), a novel benchmark constructed from real-world reports requiring reasoning across heterogeneous visual and textual elements under strict non-disclosure policies. Our evaluation highlights a systemic Reasoning-Induced Safety Gap: models frequently leak sensitive information when answers must be inferred through complex synthesis or aggregated across modalities, effectively circumventing existing safety constraints. Furthermore, we identify that providing extracted text improves perception but inadvertently facilitates leakage. To address these vulnerabilities, we propose DVA (Decompose-Verify-Aggregation), a structural inference framework that decouples reasoning from policy verification. Experimental results demonstrate that DVA significantly outperforms standard prompting defenses, offering a robust baseline for policy-compliant document understanding

[92] Large-Scale Aspect-Based Sentiment Analysis with Reasoning-Infused LLMs

Paweł Liskowski,Krzysztof Jankowski

Main category: cs.CL

TL;DR: 本文提出了Arctic-ABSA，一系列用于实际场景的方面情感分析（ABSA）模型，扩展了情感类别并支持多语言，在多个基准上达到最先进性能。

Details

Motivation: 为了满足商业应用中对更细粒度、多语言和高准确率的方面情感分析的需求，现有三分类ABSA模型已不足以应对复杂真实场景。 Method: 基于大规模真实与合成数据训练；将情感类别从三类扩展至五类（增加混合与未知类）；联合预测文本整体情感；采用推理注入技术并在编码器模型上引入新的推理预训练方法；构建单一大规模多语言模型。 Result: 395M参数编码器和8B参数解码器在SemEval14上比GPT-4o和Claude 3.5 Sonnet高出最多10个百分点；单一多语言模型在六种语言中保持87-91%准确率且不损害英文性能；发布包含17个公开数据集的大规模ABSA-mix基准。 Conclusion: Arctic-ABSA通过扩展情感分类、引入推理能力及多语言支持，显著提升了ABSA模型的实际适用性和性能，推动了该领域的技术发展。 Abstract: We introduce Arctic-ABSA, a collection of powerful models for real-life aspect-based sentiment analysis (ABSA). Our models are tailored to commercial needs, trained on a large corpus of public data alongside carefully generated synthetic data, resulting in a dataset 20 times larger than SemEval14. We extend typical ABSA models by expanding the number of sentiment classes from the standard three (positive, negative, neutral) to five, adding mixed and unknown classes, while also jointly predicting overall text sentiment and supporting multiple languages. We experiment with reasoning injection by fine-tuning on Chain-of-Thought (CoT) examples and introduce a novel reasoning pretraining technique for encoder-only models that significantly improves downstream fine-tuning and generalization. Our 395M-parameter encoder and 8B-parameter decoder achieve up to 10 percentage points higher accuracy than GPT-4o and Claude 3.5 Sonnet, while setting new state-of-the-art results on the SemEval14 benchmark. A single multilingual model maintains 87-91% accuracy across six languages without degrading English performance. We release ABSA-mix, a large-scale benchmark aggregating 17 public ABSA datasets across 92 domains.

Song-Duo Ma,Yi-Hung Liu,Hsin-Yu Lin,Pin-Yu Chen,Hong-Yan Huang,Shau-Yung Hsu,Yun-Nung Chen

Main category: cs.CL

TL;DR: 本文提出RADAR，一种结合检索增强和对抗性优化的高效检测LLM生成虚假信息的方法，通过生成器与轻量级检测器的协同进化及自然语言形式的对抗反馈提升检测性能。

Details

Motivation: 为了有效应对大模型生成的虚假信息传播问题，需要更鲁棒且高效的自动检测方法。 Method: 采用检索增强的检测框架，使用生成器对真实文章进行事实扰动改写，并结合基于密集段落检索的轻量级检测器进行声明验证；引入自然语言形式的对抗反馈（VAF）促进生成器与检测器的协同进化。 Result: 在虚假新闻检测基准上，RADAR达到86.98%的ROC-AUC，显著优于带检索的通用大模型；消融实验表明检索机制贡献最大，VAF和少样本演示对训练鲁棒性至关重要。 Conclusion: RADAR通过结构化对抗反馈和检索增强实现了更强大的虚假新闻检测能力，为应对生成模型滥用提供了有效方案。 Abstract: To efficiently combat the spread of LLM-generated misinformation, we present RADAR, a retrieval-augmented detector with adversarial refinement for robust fake news detection. Our approach employs a generator that rewrites real articles with factual perturbations, paired with a lightweight detector that verifies claims using dense passage retrieval. To enable effective co-evolution, we introduce verbal adversarial feedback (VAF). Rather than relying on scalar rewards, VAF issues structured natural-language critiques; these guide the generator toward more sophisticated evasion attempts, compelling the detector to adapt and improve. On a fake news detection benchmark, RADAR achieves 86.98% ROC-AUC, significantly outperforming general-purpose LLMs with retrieval. Ablation studies confirm that detector-side retrieval yields the largest gains, while VAF and few-shot demonstrations provide critical signals for robust training.

[94] Benchmark^2: Systematic Evaluation of LLM Benchmarks

Qi Qian,Chengsong Huang,Jingwen Xu,Changze Lv,Muling Wu,Wenhao Liu,Xiaohua Wang,Zhenghua Wang,Zisu Huang,Muzhao Tian,Jianhan Xu,Kun Hu,He-Da Wang,Yao Hu,Xuanjing Huang,Xiaoqing Zheng

Main category: cs.CL

TL;DR: 提出Benchmark^2框架，通过三个指标系统评估大语言模型基准的质量，实验表明该框架可有效识别高质量基准并减少测试集规模。

Details

Motivation: 现有大语言模型基准 proliferate，但缺乏系统性方法评估这些基准本身的质量。 Method: 提出Benchmark^2框架，包含跨基准排名一致性、可区分性得分和能力对齐偏差三个指标，并在15个基准和11个大语言模型上进行实验验证。 Result: 发现现有基准质量差异显著，基于该框架选择性构建基准可在保持评估性能的同时大幅减少测试集规模。 Conclusion: Benchmark^2为评估和优化基准质量提供了有效工具，有助于提升大语言模型评测的可靠性和效率。 Abstract: The rapid proliferation of benchmarks for evaluating large language models (LLMs) has created an urgent need for systematic methods to assess benchmark quality itself. We propose Benchmark^2, a comprehensive framework comprising three complementary metrics: (1) Cross-Benchmark Ranking Consistency, measuring whether a benchmark produces model rankings aligned with peer benchmarks; (2) Discriminability Score, quantifying a benchmark's ability to differentiate between models; and (3) Capability Alignment Deviation, identifying problematic instances where stronger models fail but weaker models succeed within the same model family. We conduct extensive experiments across 15 benchmarks spanning mathematics, reasoning, and knowledge domains, evaluating 11 LLMs across four model families. Our analysis reveals significant quality variations among existing benchmarks and demonstrates that selective benchmark construction based on our metrics can achieve comparable evaluation performance with substantially reduced test sets.

[95] VotIE: Information Extraction from Meeting Minutes

José Pedro Evans,Luís Filipe Cunha,Purificação Silvano,Alípio Jorge,Nuno Guimarães,Sérgio Nunes,Ricardo Campos

Main category: cs.CL

TL;DR: 本文提出了VotIE任务，旨在从非结构化的市政会议记录中提取投票事件信息，并构建了首个基于葡萄牙语市政记录的基准。实验发现微调编码器在域内表现最佳，而大语言模型在跨区域泛化中更鲁棒，但计算成本较高。

Details

Motivation: 市政会议记录格式多样、非标准化，难以自动提取投票结果，缺乏相关任务和基准数据集。 Method: 基于CitiLink语料库构建VotIE基准任务，采用XLM-R-CRF等微调模型与生成式大模型进行对比实验，评估其在域内和跨市政场景下的表现。 Result: XLM-R-CRF在域内达到93.2% macro F1；跨市政设置下性能显著下降，而少样本大模型泛化能力更强但计算开销大。 Conclusion: 轻量级微调编码器更适合大规模实际应用，生成模型具泛化优势但实用性受限，作者公开了数据、模型与评测框架以促进研究。 Abstract: Municipal meeting minutes record key decisions in local democratic processes. Unlike parliamentary proceedings, which typically adhere to standardized formats, they encode voting outcomes in highly heterogeneous, free-form narrative text that varies widely across municipalities, posing significant challenges for automated extraction. In this paper, we introduce VotIE (Voting Information Extraction), a new information extraction task aimed at identifying structured voting events in narrative deliberative records, and establish the first benchmark for this task using Portuguese municipal minutes, building on the recently introduced CitiLink corpus. Our experiments yield two key findings. First, under standard in-domain evaluation, fine-tuned encoders, specifically XLM-R-CRF, achieve the strongest performance, reaching 93.2\% macro F1, outperforming generative approaches. Second, in a cross-municipality setting that evaluates transfer to unseen administrative contexts, these models suffer substantial performance degradation, whereas few-shot LLMs demonstrate greater robustness, with significantly smaller declines in performance. Despite this generalization advantage, the high computational cost of generative models currently constrains their practicality. As a result, lightweight fine-tuned encoders remain a more practical option for large-scale, real-world deployment. To support reproducible research in administrative NLP, we publicly release our benchmark, trained models, and evaluation framework.

[96] Simulated Students in Tutoring Dialogues: Substance or Illusion?

Alexander Scarlatos,Jaewook Lee,Simon Woodhead,Andrew Lan

Main category: cs.CL

TL;DR: 本文研究了大语言模型在教育中的应用，特别是学生模拟的重要性及其评估方法，提出了一套涵盖语言、行为和认知方面的评价指标，并对多种学生模拟方法进行了基准测试。

Details

Motivation: 由于评估新技术的有效性需要真实的学生参与，这既耗时又难以扩大规模，因此使用模拟学生进行训练和评估成为一种趋势，但目前对于确保或衡量模拟学生的质量的研究较少。 Method: 正式定义了学生模拟任务，提出了包括语言、行为和认知方面的一系列评估指标，并在一个真实的数学辅导对话数据集上对广泛的模拟方法进行了基准测试。 Result: 实验结果显示，用于学生模拟的提示策略表现不佳；监督微调和偏好优化虽然效果更好但仍有限。 Conclusion: 研究表明现有的学生模拟方法仍有局限性，未来需进一步研究以提高模拟质量。 Abstract: Advances in large language models (LLMs) enable many new innovations in education. However, evaluating the effectiveness of new technology requires real students, which is time-consuming and hard to scale up. Therefore, many recent works on LLM-powered tutoring solutions have used simulated students for both training and evaluation, often via simple prompting. Surprisingly, little work has been done to ensure or even measure the quality of simulated students. In this work, we formally define the student simulation task, propose a set of evaluation metrics that span linguistic, behavioral, and cognitive aspects, and benchmark a wide range of student simulation methods on these metrics. We experiment on a real-world math tutoring dialogue dataset, where both automated and human evaluation results show that prompting strategies for student simulation perform poorly; supervised fine-tuning and preference optimization yield much better but still limited performance, motivating future work on this challenging task.

[97] SpeakerSleuth: Evaluating Large Audio-Language Models as Judges for Multi-turn Speaker Consistency

Jonggeun Lee,Junseong Pyo,Gyuhyeon Seo,Yohan Jo

Main category: cs.CL

TL;DR: 本文提出了SpeakerSleuth基准，用于评估大型音频-语言模型（LALMs）在多轮对话中判断说话人一致性的能力，发现现有LALMs在声学不一致性检测上表现不佳，存在重文本轻声学的模态偏差。

Details

Motivation: 尽管LALMs被广泛用作语音生成质量的评估工具，但其在多轮对话中判断说话人一致性的能力尚未被探索，亟需系统性评估以揭示其局限性。 Method: 构建了包含1,818个经人工验证实例的SpeakerSleuth基准，涵盖四个多样化数据集，并设计三项任务来测试LALMs在不同条件下的说话人一致性判断能力。 Result: 九个主流LALMs在检测声学不一致时表现差，易受文本连贯性干扰，难以定位问题语句，甚至无法识别明显的性别切换；但在选择最匹配声学特征的音频时表现较好。 Conclusion: LALMs存在严重的模态不平衡问题，过度依赖文本信息而忽视声学线索，需改进以成为可靠的音频-语言评估工具。 Abstract: Large Audio-Language Models (LALMs) as judges have emerged as a prominent approach for evaluating speech generation quality, yet their ability to assess speaker consistency across multi-turn conversations remains unexplored. We present SpeakerSleuth, a benchmark evaluating whether LALMs can reliably judge speaker consistency in multi-turn dialogues through three tasks reflecting real-world requirements. We construct 1,818 human-verified evaluation instances across four diverse datasets spanning synthetic and real speech, with controlled acoustic difficulty. Evaluating nine widely-used LALMs, we find that models struggle to reliably detect acoustic inconsistencies. For instance, given audio samples of the same speaker's turns, some models overpredict inconsistency, whereas others are overly lenient. Models further struggle to identify the exact turns that are problematic. When other interlocutors' turns are provided together, performance degrades dramatically as models prioritize textual coherence over acoustic cues, failing to detect even obvious gender switches for a speaker. On the other hand, models perform substantially better in choosing the audio that best matches the speaker among several acoustic variants, demonstrating inherent acoustic discrimination capabilities. These findings expose a significant bias in LALMs: they tend to prioritize text over acoustics, revealing fundamental modality imbalances that need to be addressed to build reliable audio-language judges.

[98] Analyzing and Improving Cross-lingual Knowledge Transfer for Machine Translation

David Stap

Main category: cs.CL

TL;DR: 本论文研究了多语言神经模型中的跨语言知识迁移，以机器翻译为核心测试平台，探讨了语言相似性、检索增强、辅助监督和微调策略对低资源语言翻译的影响，并提出通过增加训练中的语言多样性来提升泛化能力和减少错误输出。

Details

Motivation: 多语言机器翻译在实现跨语言知识共享方面具有重要意义，但低资源语言由于平行数据有限，跨语言表示学习面临挑战。理解模型如何在不同语言间共享知识，有助于提升多语言系统的鲁棒性和泛化能力。 Method: 以机器翻译为实验平台，分析语言相似性对迁移效果的影响；引入检索机制和辅助监督信号增强低资源语言翻译；研究在平行数据上微调大语言模型时可能引入的权衡问题；评估训练中语言多样性的角色。 Result: 发现语言相似性正向影响跨语言迁移；检索与辅助监督能有效提升低资源语言翻译性能；微调可能带来意料之外的负面权衡；增加训练语言的多样性可改善泛化并减少误译（如目标外语言输出）。 Conclusion: 建模选择与数据构成深刻影响多语言学习效果，合理的训练策略和更广泛的语言覆盖有助于构建更具包容性和鲁棒性的多语言NLP系统。 Abstract: Multilingual machine translation systems aim to make knowledge accessible across languages, yet learning effective cross-lingual representations remains challenging. These challenges are especially pronounced for low-resource languages, where limited parallel data constrains generalization and transfer. Understanding how multilingual models share knowledge across languages requires examining the interaction between representations, data availability, and training strategies. In this thesis, we study cross-lingual knowledge transfer in neural models and develop methods to improve robustness and generalization in multilingual settings, using machine translation as a central testbed. We analyze how similarity between languages influences transfer, how retrieval and auxiliary supervision can strengthen low-resource translation, and how fine-tuning on parallel data can introduce unintended trade-offs in large language models. We further examine the role of language diversity during training and show that increasing translation coverage improves generalization and reduces off-target behavior. Together, this work highlights how modeling choices and data composition shape multilingual learning and offers insights toward more inclusive and resilient multilingual NLP systems.

[99] When Helpers Become Hazards: A Benchmark for Analyzing Multimodal LLM-Powered Safety in Daily Life

Xinyue Lou,Jinan Xu,Jingyi Yin,Xiaolong Wang,Zhaolu Kang,Youwei Liao,Yixuan Wang,Xiangyu Shi,Fengran Mo,Su Yao,Kaiyu Huang

Main category: cs.CL

TL;DR: 本文提出了SaLAD，一个包含2013个真实世界图文样本的多模态安全基准，用于评估多模态大语言模型（MLLMs）在日常生活场景中的安全性表现。实验表明当前MLLMs在危险情境下的安全响应率仅为57.2%，揭示了现有模型和安全对齐方法的局限性。

Details

Motivation: 随着多模态大语言模型（MLLMs）在人类生活中的广泛应用，其生成的不安全内容可能对人类行为造成负面影响。然而现有安全评估基准缺乏真实视觉输入和细粒度跨模态推理能力，难以反映现实风险。因此需要构建更贴近实际的多模态安全评测基准。 Method: 提出SaLAD基准，包含10个常见类别的2013个真实图文样本，涵盖不安全场景与过度敏感情况；设计基于安全警告的评估框架，鼓励模型提供具体而非泛化拒绝的安全回应；强调真实风险暴露、真实视觉输入和细粒度跨模态推理，防止仅通过文本推断风险。 Result: 在18个MLLM上的实验显示，最优模型在不安全查询上的安全响应率仅为57.2%；即使采用主流的安全对齐方法，模型在该基准上仍表现不佳，暴露出当前MLLM在识别日常危险行为方面的严重缺陷。 Conclusion: 当前多模态大语言模型在面对现实世界安全风险时存在显著漏洞，亟需改进其跨模态安全推理能力和安全响应机制，SaLAD为未来研究提供了有效的评测工具。 Abstract: As Multimodal Large Language Models (MLLMs) become an indispensable assistant in human life, the unsafe content generated by MLLMs poses a danger to human behavior, perpetually overhanging human society like a sword of Damocles. To investigate and evaluate the safety impact of MLLMs responses on human behavior in daily life, we introduce SaLAD, a multimodal safety benchmark which contains 2,013 real-world image-text samples across 10 common categories, with a balanced design covering both unsafe scenarios and cases of oversensitivity. It emphasizes realistic risk exposure, authentic visual inputs, and fine-grained cross-modal reasoning, ensuring that safety risks cannot be inferred from text alone. We further propose a safety-warning-based evaluation framework that encourages models to provide clear and informative safety warnings, rather than generic refusals. Results on 18 MLLMs demonstrate that the top-performing models achieve a safe response rate of only 57.2% on unsafe queries. Moreover, even popular safety alignment methods limit effectiveness of the models in our scenario, revealing the vulnerabilities of current MLLMs in identifying dangerous behaviors in daily life. Our dataset is available at https://github.com/xinyuelou/SaLAD.

[100] Modular Prompt Optimization: Optimizing Structured Prompts with Section-Local Textual Gradients

Prith Sharma,Austin Z. Henley

Main category: cs.CL

TL;DR: 本文提出了模块化提示优化（MPO）框架，通过将提示分解为固定语义模块并进行局部优化，提升小规模开源大模型的推理性能。

Details

Motivation: 现有提示优化方法通常将提示视为整体文本块，难以定位错误、保留关键指令或控制提示膨胀，因此需要一种更结构化、可解释的优化方式。 Method: 提出MPO框架，将提示划分为系统角色、上下文、任务描述等语义模块，利用批评语言模型生成各模块的文本梯度，独立优化并去重合并更新。 Result: 在ARC-Challenge和MMLU两个推理基准上，使用LLaMA-3 8B-Instruct和Mistral-7B-Instruct模型，MPO均优于未调优提示和TextGrad基线。 Conclusion: 保持固定提示结构的同时进行模块化局部优化，是提升小规模开源语言模型推理能力的有效且实用的方法。 Abstract: Prompt quality plays a central role in controlling the behavior, reliability, and reasoning performance of large language models (LLMs), particularly for smaller open-source instruction-tuned models that depend heavily on explicit structure. While recent work has explored automatic prompt optimization using textual gradients and self-refinement, most existing methods treat prompts as monolithic blocks of text, making it difficult to localize errors, preserve critical instructions, or prevent uncontrolled prompt growth. We introduce Modular Prompt Optimization (MPO), a schema-based prompt optimization framework that treats prompts as structured objects composed of fixed semantic sections, including system role, context, task description, constraints, and output format. MPO applies section-local textual gradients, generated by a critic language model, to refine each section independently while keeping the overall prompt schema fixed. Section updates are consolidated through de-duplication to reduce redundancy and interference between components, yielding an interpretable and robust optimization process. We evaluate MPO on two reasoning benchmarks, ARC-Challenge and MMLU, using LLaMA-3 8B-Instruct and Mistral-7B-Instruct as solver models. Across both benchmarks and models, MPO consistently outperforms an untuned structured prompt and the TextGrad baseline, achieving substantial accuracy gains without modifying model parameters or altering prompt structure. These results demonstrate that maintaining a fixed prompt schema while applying localized, section-wise optimization is an effective and practical approach for improving reasoning performance in small open-source LMs.

[101] Bridging the Discrete-Continuous Gap: Unified Multimodal Generation via Coupled Manifold Discrete Absorbing Diffusion

Yuanfeng Xu,Yuhao Chen,Liang Lin,Guangrun Wang

Main category: cs.CL

TL;DR: 本文提出了CoM-DAD，一种用于统一多模态生成的新型概率框架，通过分层双过程机制将语义规划与标记生成解耦。

Details

Motivation: 现有的生成模型在离散数据（如文本）和连续数据（如图像）上采用不同范式，阻碍了统一多模态系统的发展；同时，掩码语言模型缺乏生成保真度和语义连续性，扩展到多模态时还面临对齐困难和训练不稳定问题。 Method: 提出CoM-DAD框架：首先通过连续潜在扩散过程建模语义流形，然后将标记生成视为受变量率噪声调度调节的离散吸收扩散过程，并基于语义先验进行条件化；引入随机混合模态传输策略以实现跨模态对齐，无需强对比双编码器。 Result: 该方法在多模态生成中表现出比标准掩码建模更高的训练稳定性，并有效支持文本到图像和图像到文本的双向生成。 Conclusion: CoM-DAD为可扩展、统一的文本-图像生成提供了一种新范式，弥合了自回归与扩散模型在多模态建模范式上的鸿沟。 Abstract: The bifurcation of generative modeling into autoregressive approaches for discrete data (text) and diffusion approaches for continuous data (images) hinders the development of truly unified multimodal systems. While Masked Language Models (MLMs) offer efficient bidirectional context, they traditionally lack the generative fidelity of autoregressive models and the semantic continuity of diffusion models. Furthermore, extending masked generation to multimodal settings introduces severe alignment challenges and training instability. In this work, we propose \textbf{CoM-DAD} (\textbf{Co}upled \textbf{M}anifold \textbf{D}iscrete \textbf{A}bsorbing \textbf{D}iffusion), a novel probabilistic framework that reformulates multimodal generation as a hierarchical dual-process. CoM-DAD decouples high-level semantic planning from low-level token synthesis. First, we model the semantic manifold via a continuous latent diffusion process; second, we treat token generation as a discrete absorbing diffusion process, regulated by a \textbf{Variable-Rate Noise Schedule}, conditioned on these evolving semantic priors. Crucially, we introduce a \textbf{Stochastic Mixed-Modal Transport} strategy that aligns disparate modalities without requiring heavy contrastive dual-encoders. Our method demonstrates superior stability over standard masked modeling, establishing a new paradigm for scalable, unified text-image generation.

[102] KDCM: Reducing Hallucination in LLMs through Explicit Reasoning Structures

Jinbo Hao,Kai Yang,Qingzhen Su,Yifan Li,Chao Jiang

Main category: cs.CL

TL;DR: 提出一种基于代码引导的推理框架，通过在提示中嵌入可执行模块来指导知识图谱探索，从而减少大语言模型中的提示诱导幻觉。

Details

Motivation: 为缓解大语言模型因提示引发的幻觉问题，特别是由不完整或误导性提示导致的错误推理。 Method: 扩展链式知识蒸馏方法，引入一个可编程模块，该模块以可执行代码形式嵌入推理提示中，引导模型在推理过程中利用外部结构化知识，显式调控中间推理步骤。 Result: 在多个公开基准上使用GPT-4和LLaMA-3.3进行评估，结果显示HIT@1、HIT@3和HIT@5分别提升15.64%、13.38%和13.28%，多个设置下得分超过95%，显著减少幻觉并提升上下文建模能力。 Conclusion: 该方法能有效约束错误推理，显著提高预测的准确性与可解释性，尤其在减少提示诱导的幻觉方面表现突出。 Abstract: To mitigate hallucinations in large language models (LLMs), we propose a framework that focuses on errors induced by prompts. Our method extends a chain-style knowledge distillation approach by incorporating a programmable module that guides knowledge graph exploration. This module is embedded as executable code within the reasoning prompt, allowing the model to leverage external structured knowledge during inference. Based on this design, we develop an enhanced distillation-based reasoning framework that explicitly regulates intermediate reasoning steps, resulting in more reliable predictions. We evaluate the proposed approach on multiple public benchmarks using GPT-4 and LLaMA-3.3. Experimental results show that code-guided reasoning significantly improves contextual modeling and reduces prompt-induced hallucinations. Specifically, HIT@1, HIT@3, and HIT@5 increase by 15.64%, 13.38%, and 13.28%, respectively, with scores exceeding 95% across several evaluation settings. These findings indicate that the proposed method effectively constrains erroneous reasoning while improving both accuracy and interpretability.

[103] SearchAttack: Red-Teaming LLMs against Real-World Threats via Framing Unsafe Web Information-Seeking Tasks

Yu Yan,Sheng Sun,Mingfeng Li,Zheming Yang,Chiwei Zhu,Fei Ma,Benfeng Xu,Min Liu

Main category: cs.CL

TL;DR: 本文提出了SearchAttack，一种通过操控网络搜索查询来攻击增强型语言模型（LLM）的新方法，利用搜索结果中的有害内容绕过模型的安全防护。

Details

Motivation: 由于在开放和知识密集型任务中大型语言模型（LLM）的不可靠性增加，人们转向使用结合搜索的LLMs以缓解问题；然而，当搜索引擎返回直接包含有害信息的结果时，LLM无法控制这些输出，因此识别出网络搜索作为一个关键攻击面。 Method: 提出SearchAttack方法，将有害语义外包给网络搜索，仅保留查询的骨架和碎片化线索，并通过结构化指引引导LLM重构检索内容以达成恶意目标。 Result: 实验表明SearchAttack在对搜索增强型LLM进行红队测试时表现出强大的攻击有效性。 Conclusion: 网络搜索构成LLM系统的关键安全漏洞，SearchAttack揭示了现有系统在面对操纵性搜索查询时的脆弱性，强调需加强对此类攻击的防御机制。 Abstract: Recently, people have suffered and become increasingly aware of the unreliability gap in LLMs for open and knowledge-intensive tasks, and thus turn to search-augmented LLMs to mitigate this issue. However, when the search engine is triggered for harmful tasks, the outcome is no longer under the LLM's control. Once the returned content directly contains targeted, ready-to-use harmful takeaways, the LLM's safeguards cannot withdraw that exposure. Motivated by this dilemma, we identify web search as a critical attack surface and propose \textbf{\textit{SearchAttack}} for red-teaming. SearchAttack outsources the harmful semantics to web search, retaining only the query's skeleton and fragmented clues, and further steers LLMs to reconstruct the retrieved content via structural rubrics to achieve malicious goals. Extensive experiments are conducted to red-team the search-augmented LLMs for responsible vulnerability assessment. Empirically, SearchAttack demonstrates strong effectiveness in attacking these systems.

[104] Layer-wise Positional Bias in Short-Context Language Modeling

Maryam Rahimi,Mahdi Nouri,Yadollah Yaghoobzadeh

Main category: cs.CL

TL;DR: 提出一种基于归因的框架来分析短上下文语言建模中的位置效应，使用层传导和滑动窗口方法量化各层对不同输入位置的重要性，发现位置重要性分布在不同架构中具有特异性、稳定性，并表现出随深度增强的近因偏差和减弱的首因偏差，以及早期层对实词的偏好。

Details

Motivation: 研究语言模型在不同输入位置上的偏好是否独立于语义相关性，并揭示这种位置偏差如何在不同层和位置演化，以及其与任务复杂度的关系。 Method: 引入基于归因的框架，采用层传导（layer conductance）结合滑动窗口方法，量化每个层对各个输入位置分配的重要性，构建逐层的位置重要性分布图谱。 Result: 发现位置重要性分布具有架构特异性、跨输入稳定性和对词汇打乱的不变性；存在随模型深度增加而增强的显著近因偏差和逐渐减弱的微弱首因偏差；早期层在所有位置上更重视实词而非功能词，而后期层则失去这种区分。 Conclusion: 语言模型的 positional bias 是系统性的、架构依赖的，并在处理过程中动态演变，表明模型不仅受位置影响，还存在从词类型敏感到不敏感的表征转变过程。 Abstract: Language models often show a preference for using information from specific positions in the input regardless of semantic relevance. While positional bias has been studied in various contexts, from attention sinks to task performance degradation in long-context settings, prior work has not established how these biases evolve across individual layers and input positions, or how they vary independent of task complexity. We introduce an attribution-based framework to analyze positional effects in short-context language modeling. Using layer conductance with a sliding-window approach, we quantify how each layer distributes importance across input positions, yielding layer-wise positional importance profiles. We find that these profiles are architecture-specific, stable across inputs, and invariant to lexical scrambling. Characterizing these profiles, we find prominent recency bias that increases with depth and subtle primacy bias that diminishes through model depth. Beyond positional structure, we also show that early layers preferentially weight content words over function words across all positions, while later layers lose this word-type differentiation.

[105] ContextFocus: Activation Steering for Contextual Faithfulness in Large Language Models

Nikhil Anand,Shwetha Somasundaram,Anirudh Phukan,Apoorv Saxena,Koyel Mukherjee

Main category: cs.CL

TL;DR: 本文提出了一种名为ContextFocus的轻量级激活引导方法，用于提高大语言模型在知识冲突场景下的上下文忠实性，无需微调且推理开销极低。

Details

Motivation: 大语言模型在预训练中存储了大量参数化知识，但当外部检索到的上下文与其内部知识冲突时，模型倾向于依赖记忆中的事实，导致输出不忠实。因此需要一种能有效提升模型遵循外部上下文能力的方法。 Method: 提出ContextFocus，一种轻量级的激活引导技术，通过调整模型内部激活来增强对输入上下文的关注，从而在不进行模型微调的情况下提升上下文忠实性，并保持生成流畅性和推理效率。 Result: 在ConFiQA基准上评估，ContextFocus优于ContextDPO、COIECD和基于提示的方法等强基线，且与提示策略互补，在更大规模模型上也有效，显著提升了上下文忠实性。 Conclusion: ContextFocus是一种高效、鲁棒且可扩展的方法，能显著提升大语言模型在知识冲突情况下的上下文忠实性，同时保持推理效率和生成质量。 Abstract: Large Language Models (LLMs) encode vast amounts of parametric knowledge during pre-training. As world knowledge evolves, effective deployment increasingly depends on their ability to faithfully follow externally retrieved context. When such evidence conflicts with the model's internal knowledge, LLMs often default to memorized facts, producing unfaithful outputs. In this work, we introduce ContextFocus, a lightweight activation steering approach that improves context faithfulness in such knowledge-conflict settings while preserving fluency and efficiency. Unlike prior approaches, our solution requires no model finetuning and incurs minimal inference-time overhead, making it highly efficient. We evaluate ContextFocus on the ConFiQA benchmark, comparing it against strong baselines including ContextDPO, COIECD, and prompting-based methods. Furthermore, we show that our method is complementary to prompting strategies and remains effective on larger models. Extensive experiments show that ContextFocus significantly improves contextual-faithfulness. Our results highlight the effectiveness, robustness, and efficiency of ContextFocus in improving contextual-faithfulness of LLM outputs.

[106] LLMberjack: Guided Trimming of Debate Trees for Multi-Party Conversation Creation

Leonardo Bottona,Nicolò Penzo,Bruno Lepri,Marco Guerini,Sara Tonelli

Main category: cs.CL

TL;DR: LLMberjack是一个开源平台，用于从现有的辩论树中创建多方对话，支持可视化讨论树并利用大语言模型辅助生成连贯的线性对话序列。

Details

Motivation: 现有资源缺乏透明且可复现的多方对话创建工具，需要一种能够保留发言者身份和话语关系的系统。 Method: 开发一个交互式平台，可视化回复树结构，并允许用户构建线性化对话序列，集成大语言模型辅助编辑消息和发言者描述。 Result: 该平台能有效促进连贯、有意义的对话线程生成，大语言模型的辅助提升了输出质量并减少了人工工作量。 Conclusion: LLMberjack填补了多方对话资源创建工具的空白，支持透明、可复现的研究与应用。 Abstract: We present LLMberjack, a platform for creating multi-party conversations starting from existing debates, originally structured as reply trees. The system offers an interactive interface that visualizes discussion trees and enables users to construct coherent linearized dialogue sequences while preserving participant identity and discourse relations. It integrates optional large language model (LLM) assistance to support automatic editing of the messages and speakers' descriptions. We demonstrate the platform's utility by showing how tree visualization facilitates the creation of coherent, meaningful conversation threads and how LLM support enhances output quality while reducing human effort. The tool is open-source and designed to promote transparent and reproducible workflows to create multi-party conversations, addressing a lack of resources of this type.

[107] FLEx: Language Modeling with Few-shot Language Explanations

Adar Avsian,Christopher Richardson,Anirudh Sundar,Larry Heck

Main category: cs.CL

TL;DR: FLEx是一种通过少量解释性示例来改进语言模型行为的方法，利用嵌入聚类选择代表性错误，验证并总结为推理时提示前缀，无需修改模型权重即可显著减少错误。

Details

Motivation: 语言模型在多种任务中仍会犯错，且错误常在相关查询中重复；自然语言解释可纠正错误，但大规模收集解释成本高，尤其在需要专家标注的领域。 Method: 提出FLEx方法：使用基于嵌入的聚类选择代表性模型错误，验证对应解释是否能纠正错误，并将这些解释汇总为推理时添加的提示前缀，以引导模型避免类似错误，不修改模型权重。 Result: 在CounterBench、GSM8K和ReasonIF三个数据集上，FLEx consistently优于思维链（CoT）提示方法，最多减少了83%的CoT剩余错误。 Conclusion: FLEx通过少量解释性示例有效提升模型表现，是一种无需微调、可推广的错误缓解策略。 Abstract: Language models have become effective at a wide range of tasks, from math problem solving to open-domain question answering. However, they still make mistakes, and these mistakes are often repeated across related queries. Natural language explanations can help correct these errors, but collecting them at scale may be infeasible, particularly in domains where expert annotators are required. To address this issue, we introduce FLEx ($\textbf{F}$ew-shot $\textbf{L}$anguage $\textbf{Ex}$planations), a method for improving model behavior using a small number of explanatory examples. FLEx selects representative model errors using embedding-based clustering, verifies that the associated explanations correct those errors, and summarizes them into a prompt prefix that is prepended at inference-time. This summary guides the model to avoid similar errors on new inputs, without modifying model weights. We evaluate FLEx on CounterBench, GSM8K, and ReasonIF. We find that FLEx consistently outperforms chain-of-thought (CoT) prompting across all three datasets and reduces up to 83\% of CoT's remaining errors.

[108] All That Glisters Is Not Gold: A Benchmark for Reference-Free Counterfactual Financial Misinformation Detection

Yuechen Jiang,Zhiwei Liu,Yupeng Cao,Yueru He,Ziyang Xu,Chen Xu,Zhiyang Deng,Prayag Tiwari,Xi Chen,Alejandro Lopez-Lira,Jimin Huang,Junichi Tsujii,Sophia Ananiadou

Main category: cs.CL

TL;DR: RFC Bench是一个用于评估大语言模型在真实金融新闻中识别虚假信息能力的基准，强调在无参考文本情况下模型表现较弱，揭示了模型在缺乏外部支撑时难以保持连贯信念状态的问题。

Details

Motivation: 当前大语言模型在金融 misinformation 检测中缺乏对上下文复杂性和无参考推理能力的系统评估，RFC Bench 旨在填补这一空白。 Method: 提出 RFC Bench 基准，包含两个任务：无参考虚假信息检测和基于配对原文与扰动输入的比较诊断，操作于段落级别以捕捉金融新闻中的分散语义线索。 Result: 实验表明，在有比较上下文时模型表现更强；而在无参考设置下，模型表现出预测不稳定和无效输出增多的问题。 Conclusion: 当前模型在缺乏外部参照时难以进行可靠的推理，RFC Bench 为研究无参考推理和提升现实场景中金融 misinformation 检测提供了结构化测试平台。 Abstract: We introduce RFC Bench, a benchmark for evaluating large language models on financial misinformation under realistic news. RFC Bench operates at the paragraph level and captures the contextual complexity of financial news where meaning emerges from dispersed cues. The benchmark defines two complementary tasks: reference free misinformation detection and comparison based diagnosis using paired original perturbed inputs. Experiments reveal a consistent pattern: performance is substantially stronger when comparative context is available, while reference free settings expose significant weaknesses, including unstable predictions and elevated invalid outputs. These results indicate that current models struggle to maintain coherent belief states without external grounding. By highlighting this gap, RFC Bench provides a structured testbed for studying reference free reasoning and advancing more reliable financial misinformation detection in real world settings.

cs.CV [Back]

[109] HyperCLOVA X 32B Think

NAVER Cloud HyperCLOVA X Team

Main category: cs.CV

TL;DR: HyperCLOVA X 32B Think 是一个专注于韩语语言和文化背景的视觉-语言模型，强调推理能力和代理行为，经过多阶段训练，在韩语文本到文本、图像到文本任务及代理导向任务中表现优异，并已开源以促进学术和工业界的研究创新。

Details

Motivation: 为了在韩语语言和文化背景下提升视觉-语言模型的推理能力与代理行为，填补现有模型在本地化需求和复杂任务处理上的不足。 Method: 采用分阶段训练策略：首先重点预训练模型的推理能力，随后进行后训练以支持多模态理解、增强推理、代理行为以及人类偏好对齐。 Result: 在与同规模模型的对比实验中，该模型在韩语文本到文本、视觉到文本基准测试以及代理导向任务上均表现出色。 Conclusion: HyperCLOVA X 32B Think 在韩语场景下具备强大的多模态推理和代理能力，其开源有助于推动学术界和工业界的进一步研究与应用。 Abstract: In this report, we present HyperCLOVA X 32B Think, a vision-language model designed with particular emphasis on reasoning within the Korean linguistic and cultural context, as well as agentic ability. HyperCLOVA X 32B Think is pre-trained with a strong focus on reasoning capabilities and subsequently post-trained to support multimodal understanding, enhanced reasoning, agentic behaviors, and alignment with human preferences. Experimental evaluations against comparably sized models demonstrate that our model achieves strong performance on Korean text-to-text and vision-to-text benchmarks, as well as on agent-oriented evaluation tasks. By open-sourcing HyperCLOVA X 32B Think, we aim to support broader adoption and facilitate further research and innovation across both academic and industrial communities.

[110] CageDroneRF: A Large-Scale RF Benchmark and Toolkit for Drone Perception

Mohammad Rostami,Atik Faysal,Hongtao Xia,Hadi Kasasbeh,Ziang Gao,Huaxia Wang

Main category: cs.CV

TL;DR: CageDroneRF (CDRF) 是一个大规模的射频无人机检测与识别基准数据集，结合真实采集与系统化合成数据，支持多样化场景下的模型评估与开发。

Details

Motivation: 现有射频数据集稀缺且多样性不足，限制了无人机检测模型的鲁棒性和泛化能力。 Method: 构建真实世界采集的数据集，并设计系统化的数据增强流程，包括信噪比控制、干扰信号注入和频率偏移及标签一致的边界框变换。 Result: 数据集涵盖多种当代无人机型号和采集条件，覆盖现有公开数据集未包含的设备，并提供开源工具支持分类、开放集识别和目标检测任务的标准化评估。 Conclusion: CDRF 通过提供全面的基准数据和工具链，推动鲁棒、可复现的射频感知模型研究进展。 Abstract: We present CageDroneRF (CDRF), a large-scale benchmark for Radio-Frequency (RF) drone detection and identification built from real-world captures and systematically generated synthetic variants. CDRF addresses the scarcity and limited diversity of existing RF datasets by coupling extensive raw recordings with a principled augmentation pipeline that (i) precisely controls Signal-to-Noise Ratio (SNR), (ii) injects interfering emitters, and (iii) applies frequency shifts with label-consistent bounding-box transformations for detection. This dataset spans a wide range of contemporary drone models, many unavailable in current public datasets, and acquisition conditions, derived from data collected at the Rowan University campus and within a controlled RF-cage facility. CDRF is released with interoperable open-source tools for data generation, preprocessing, augmentation, and evaluation that also operate on existing public benchmarks. CDRF enables standardized benchmarking for classification, open-set recognition, and object detection, supporting rigorous comparisons and reproducible pipelines. By releasing this comprehensive benchmark and tooling, CDRF aims to accelerate progress toward robust, generalizable RF perception models.

[111] Mass Concept Erasure in Diffusion Models with Concept Hierarchy

Jiahang Tu,Ye Li,Yiming Wu,Hanbin Zhao,Chao Zhang,Hui Qian

Main category: cs.CV

TL;DR: 本文提出了一种基于超类型-子类型层次结构的概念擦除方法，通过分组共享参数实现高效、有效的多概念抑制，并引入SuPLoRA技术缓解生成性能退化。

Details

Motivation: 随着扩散模型的发展，有害内容生成问题日益突出，现有概念擦除方法在处理大量概念时效率低且影响生成质量，亟需更高效的多概念抑制方案。 Method: 构建超类型-子类型概念层级结构，将语义相关的被擦除概念归入同一父节点；采用组级联合擦除策略，共享单组可学习参数；提出SuPLoRA方法，在冻结下投影矩阵的同时更新上投影矩阵以保留超类型信息，并结合标准扩散正则化保持去噪能力。 Result: 在跨领域（名人、物体、色情内容）的多概念联合擦除基准测试中表现出色，显著优于逐个擦除方法，在抑制目标概念的同时更好保持了生成质量和未擦除内容的完整性。 Conclusion: 通过引入概念层次结构和SuPLoRA参数更新机制，实现了高效、高质量的多概念联合擦除，为扩散模型的安全控制提供了可扩展且有效的解决方案。 Abstract: The success of diffusion models has raised concerns about the generation of unsafe or harmful content, prompting concept erasure approaches that fine-tune modules to suppress specific concepts while preserving general generative capabilities. However, as the number of erased concepts grows, these methods often become inefficient and ineffective, since each concept requires a separate set of fine-tuned parameters and may degrade the overall generation quality. In this work, we propose a supertype-subtype concept hierarchy that organizes erased concepts into a parent-child structure. Each erased concept is treated as a child node, and semantically related concepts (e.g., macaw, and bald eagle) are grouped under a shared parent node, referred to as a supertype concept (e.g., bird). Rather than erasing concepts individually, we introduce an effective and efficient group-wise suppression method, where semantically similar concepts are grouped and erased jointly by sharing a single set of learnable parameters. During the erasure phase, standard diffusion regularization is applied to preserve denoising process in unmasked regions. To mitigate the degradation of supertype generation caused by excessive erasure of semantically related subtypes, we propose a novel method called Supertype-Preserving Low-Rank Adaptation (SuPLoRA), which encodes the supertype concept information in the frozen down-projection matrix and updates only the up-projection matrix during erasure. Theoretical analysis demonstrates the effectiveness of SuPLoRA in mitigating generation performance degradation. We construct a more challenging benchmark that requires simultaneous erasure of concepts across diverse domains, including celebrities, objects, and pornographic content.

[112] VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models

Jianke Zhang,Xiaoyu Chen,Qiuyue Wang,Mingsheng Li,Yanjiang Guo,Yucheng Hu,Jiajun Zhang,Shuai Bai,Junyang Lin,Jianyu Chen

Main category: cs.CV

TL;DR: 本文提出了VLM4VLA，一个将通用视觉-语言模型（VLM）转化为视觉-语言-动作（VLA）策略的极简适配框架，系统研究了VLM的选择和能力如何影响下游控制性能，并揭示了当前VLM预训练目标与具身动作规划需求之间存在领域差距。

Details

Motivation: 探讨VLM的选择及其能力如何转化为下游VLA策略的性能表现，挑战了VLM通用能力强则VLA性能好的普遍假设。 Method: 提出VLM4VLA框架，仅用少量可学习参数将VLM转换为VLA策略；在三个基准上进行广泛实验，并通过模态级消融分析视觉与语言模块的影响；进一步在七种具身辅助任务上微调VLM以探究特定能力的作用。 Result: 发现VLM初始化虽有益，但其通用能力不能预测下游性能；提升特定具身技能不保证更好控制表现；视觉模块是主要瓶颈，注入控制相关监督信号到视觉编码器能持续提升性能。 Conclusion: 标准VLM的通用能力对具身控制是必要但不充分的，当前VLM预训练目标与实际动作规划需求之间存在持续领域差距，改进视觉模块并引入控制相关监督更为关键。 Abstract: Vision-Language-Action (VLA) models, which integrate pretrained large Vision-Language Models (VLM) into their policy backbone, are gaining significant attention for their promising generalization capabilities. This paper revisits a fundamental yet seldom systematically studied question: how VLM choice and competence translate to downstream VLA policies performance? We introduce VLM4VLA, a minimal adaptation pipeline that converts general-purpose VLMs into VLA policies using only a small set of new learnable parameters for fair and efficient comparison. Despite its simplicity, VLM4VLA proves surprisingly competitive with more sophisticated network designs. Through extensive empirical studies on various downstream tasks across three benchmarks, we find that while VLM initialization offers a consistent benefit over training from scratch, a VLM's general capabilities are poor predictors of its downstream task performance. This challenges common assumptions, indicating that standard VLM competence is necessary but insufficient for effective embodied control. We further investigate the impact of specific embodied capabilities by fine-tuning VLMs on seven auxiliary embodied tasks (e.g., embodied QA, visual pointing, depth estimation). Contrary to intuition, improving a VLM's performance on specific embodied skills does not guarantee better downstream control performance. Finally, modality-level ablations identify the visual module in VLM, rather than the language component, as the primary performance bottleneck. We demonstrate that injecting control-relevant supervision into the vision encoder of the VLM yields consistent gains, even when the encoder remains frozen during downstream fine-tuning. This isolates a persistent domain gap between current VLM pretraining objectives and the requirements of embodied action-planning.

[113] Deep Learning-Based Image Recognition for Soft-Shell Shrimp Classification

Yun-Hao Zhang,I-Hsien Ting,Dario Liberona,Yun-Hsiu Liu,Kazunori Minetaki

Main category: cs.CV

TL;DR: 本研究利用基于深度学习的图像识别技术，通过卷积神经网络（CNN）模型实现白虾收获后的自动分类，以提高分类准确性、效率和一致性，从而保持虾的新鲜度并改善加工外观质量。

Details

Motivation: 由于消费者对高品质水产品的需求增加，虾类产品在收获后新鲜度迅速下降，且软壳虾在烹饪或冷冻后易出现头体分离，影响产品外观和消费者感知，因此需要更高效的分类方法来提升产品质量和运输效率。 Method: 采用基于深度学习的卷积神经网络（CNN）模型进行图像识别，实现白虾收获后的自动化分类，替代传统的人工分拣方式。 Result: 该CNN模型提高了白虾分类的准确率、效率和一致性，缩短了处理时间，有助于保持产品新鲜度，并减少加工过程中的外观损伤。 Conclusion: 基于深度学习的图像识别技术可有效应用于水产养殖中的虾类分拣，提升了生产自动化水平和产品市场竞争力。 Abstract: With the integration of information technology into aquaculture, production has become more stable and continues to grow annually. As consumer demand for high-quality aquatic products rises, freshness and appearance integrity are key concerns. In shrimp-based processed foods, freshness declines rapidly post-harvest, and soft-shell shrimp often suffer from head-body separation after cooking or freezing, affecting product appearance and consumer perception. To address these issues, this study leverages deep learning-based image recognition for automated classification of white shrimp immediately after harvest. A convolutional neural network (CNN) model replaces manual sorting, enhancing classification accuracy, efficiency, and consistency. By reducing processing time, this technology helps maintain freshness and ensures that shrimp transportation businesses meet customer demands more effectively.

[114] Higher order PCA-like rotation-invariant features for detailed shape descriptors modulo rotation

Jarek Duda

Main category: cs.CV

TL;DR: 提出了一种基于高阶张量和多项式高斯模型的旋转不变特征方法，用于精确描述复杂形状并实现无需旋转优化的高效形状比较。

Details

Motivation: 传统PCA通过二阶协方差矩阵提取旋转不变特征，但难以准确描述复杂形状，因此需要扩展到高阶统计量以提升描述能力。 Method: 将PCA推广到三阶及以上张量（如p_{abc}）以建模形状的高阶中心矩，并结合多项式乘高斯分布构造可解码的高精度形状描述符及其旋转不变量。 Result: 能够生成任意高精度的旋转不变形状描述符，适用于分子形状描述、2D/3D物体识别及形状相似性度量，避免了旋转优化带来的计算开销。 Conclusion: 该方法显著增强了对复杂形状的表达能力，同时保持旋转不变性，为形状分析提供了高效且灵活的新工具。 Abstract: PCA can be used for rotation invariant features, describing a shape with its $p_{ab}=E[(x_i-E[x_a])(x_b-E[x_b])]$ covariance matrix approximating shape by ellipsoid, allowing for rotation invariants like its traces of powers. However, real shapes are usually much more complicated, hence there is proposed its extension to e.g. $p_{abc}=E[(x_a-E[x_a])(x_b-E[x_b])(x_c-E[x_c])]$ order-3 or higher tensors describing central moments, or polynomial times Gaussian allowing decodable shape descriptors of arbitrarily high accuracy, and their analogous rotation invariants. Its practical applications could be rotation-invariant features to include shape modulo rotation e.g. for molecular shape descriptors, or for up to rotation object recognition in 2D images/3D scans, or shape similarity metric allowing their inexpensive comparison (modulo rotation) without costly optimization over rotations.

[115] MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models

Yang Shi,Yifeng Xie,Minzhe Guo,Liangsi Lu,Mingxuan Huang,Jingchao Wang,Zhihong Zhu,Boyan Xu,Zhiqi Huang

Main category: cs.CV

TL;DR: 本文提出了MMErroR，一个包含2013个样本的多模态基准，用于评估视觉-语言模型在识别推理错误及其类型方面的能力，强调现有模型在此任务上的局限性。

Details

Motivation: 探究视觉-语言模型是否真正理解其处理内容，特别是能否检测错误推理并识别错误类型。 Method: 构建了一个涵盖6大类、24个子领域的多模态错误推理基准MMErroR，并对20个先进VLM进行评估，要求模型在图文上下文中检测并分类推理错误。 Result: 即使表现最好的模型Gemini-3.0-Pro也仅能在66.47%的情况下正确分类错误，表明当前模型在识别错误推理方面仍面临挑战。 Conclusion: 准确识别推理错误有助于深入理解多模态模型的推理能力，MMErroR为未来研究提供了有价值的评估工具。 Abstract: Recent advances in Vision-Language Models (VLMs) have improved performance in multi-modal learning, raising the question of whether these models truly understand the content they process. Crucially, can VLMs detect when a reasoning process is wrong and identify its error type? To answer this, we present MMErroR, a multi-modal benchmark of 2,013 samples, each embedding a single coherent reasoning error. These samples span 24 subdomains across six top-level domains, ensuring broad coverage and taxonomic richness. Unlike existing benchmarks that focus on answer correctness, MMErroR targets a process-level, error-centric evaluation that requires models to detect incorrect reasoning and classify the error type within both visual and linguistic contexts. We evaluate 20 advanced VLMs, even the best model (Gemini-3.0-Pro) classifies the error in only 66.47\% of cases, underscoring the challenge of identifying erroneous reasoning. Furthermore, the ability to accurately identify errors offers valuable insights into the capabilities of multi-modal reasoning models. Project Page: https://mmerror-benchmark.github.io

[116] RelightAnyone: A Generalized Relightable 3D Gaussian Head Model

Yingyan Xu,Pramod Rao,Sebastian Weiss,Gaspard Zoss,Markus Gross,Christian Theobalt,Marc Habermann,Derek Bradley

Main category: cs.CV

TL;DR: 提出一种新的可重光照3D高斯点阵头像模型，能够在无需OLAT数据的情况下对任意单/多视角图像中的主体进行高质量重光照。

Details

Motivation: 现有方法需要复杂的时间复用照明（如OLAT）来实现高质量重光照，限制了应用场景。 Method: 采用两阶段方法：第一阶段建模无OLAT照明的平光3DGS头像；第二阶段学习映射到基于物理反射参数的空间，实现高质量重光照。利用多视角数据集训练第一阶段以确保跨主体泛化，并通过自监督照明对齐学习数据集特定的照明编码；第二阶段在小规模OLAT数据集上训练。 Result: 模型能在没有OLAT数据的情况下对任意主体进行高质量重光照，支持仅用单张图像进行拟合，适用于新视角合成和数字头像重光照应用。 Conclusion: 该方法实现了对未见过主体的良好泛化能力，显著降低了对复杂照明条件的依赖，拓展了3D高斯点阵在实际场景中的应用。 Abstract: 3D Gaussian Splatting (3DGS) has become a standard approach to reconstruct and render photorealistic 3D head avatars. A major challenge is to relight the avatars to match any scene illumination. For high quality relighting, existing methods require subjects to be captured under complex time-multiplexed illumination, such as one-light-at-a-time (OLAT). We propose a new generalized relightable 3D Gaussian head model that can relight any subject observed in a single- or multi-view images without requiring OLAT data for that subject. Our core idea is to learn a mapping from flat-lit 3DGS avatars to corresponding relightable Gaussian parameters for that avatar. Our model consists of two stages: a first stage that models flat-lit 3DGS avatars without OLAT lighting, and a second stage that learns the mapping to physically-based reflectance parameters for high-quality relighting. This two-stage design allows us to train the first stage across diverse existing multi-view datasets without OLAT lighting ensuring cross-subject generalization, where we learn a dataset-specific lighting code for self-supervised lighting alignment. Subsequently, the second stage can be trained on a significantly smaller dataset of subjects captured under OLAT illumination. Together, this allows our method to generalize well and relight any subject from the first stage as if we had captured them under OLAT lighting. Furthermore, we can fit our model to unseen subjects from as little as a single image, allowing several applications in novel view synthesis and relighting for digital avatars.

[117] Guardians of the Hair: Rescuing Soft Boundaries in Depth, Stereo, and Novel Views

Xiang Zhang,Yang Zhang,Lukas Mehl,Markus Gross,Christopher Schroers

Main category: cs.CV

TL;DR: 本文提出了一种名为HairGuard的框架，用于恢复3D视觉中细粒度的软边界细节，通过数据策展、深度修复网络和生成式场景绘制，在多个3D任务中实现了最先进的性能。

Details

Motivation: 软边界（如细发丝）在自然和计算机生成图像中普遍存在，但由于前景与背景线索的混合模糊，给3D视觉带来了挑战，现有方法难以保留这些细粒度细节。 Method: 提出一种新的数据策展流程，利用图像抠图数据集进行训练；设计带门控残差模块的深度修复网络，精修软边界区域的深度；采用基于深度的前向扭曲和生成式场景绘制来填充遮挡区域并消除背景伪影；通过颜色融合器自适应合成结果。 Result: 在单目深度估计、立体图像/视频转换和新视角合成任务上均达到最先进水平，尤其在软边界区域有显著提升。 Conclusion: HairGuard能有效恢复3D视觉中的软边界细节，具有良好的通用性和插件式集成能力，显著提升了复杂边界区域的重建质量。 Abstract: Soft boundaries, like thin hairs, are commonly observed in natural and computer-generated imagery, but they remain challenging for 3D vision due to the ambiguous mixing of foreground and background cues. This paper introduces Guardians of the Hair (HairGuard), a framework designed to recover fine-grained soft boundary details in 3D vision tasks. Specifically, we first propose a novel data curation pipeline that leverages image matting datasets for training and design a depth fixer network to automatically identify soft boundary regions. With a gated residual module, the depth fixer refines depth precisely around soft boundaries while maintaining global depth quality, allowing plug-and-play integration with state-of-the-art depth models. For view synthesis, we perform depth-based forward warping to retain high-fidelity textures, followed by a generative scene painter that fills disoccluded regions and eliminates redundant background artifacts within soft boundaries. Finally, a color fuser adaptively combines warped and inpainted results to produce novel views with consistent geometry and fine-grained details. Extensive experiments demonstrate that HairGuard achieves state-of-the-art performance across monocular depth estimation, stereo image/video conversion, and novel view synthesis, with significant improvements in soft boundary regions.

[118] RiskCueBench: Benchmarking Anticipatory Reasoning from Early Risk Cues in Video-Language Models

Sha Luo,Yogesh Prabhu,Tim Ossowski,Kaiping Chen,Junjie Hu

Main category: cs.CV

TL;DR: 本文提出了一个新的视频理解基准RiskCueBench，旨在通过识别最早的风险信号片段来评估视觉数据中的潜在安全风险，以更真实地反映现实世界条件。

Details

Motivation: 现有风险评估数据集通常包含事故全过程，降低了任务难度，无法真实反映实际应用中需从早期信号预测风险的需求。 Method: 构建了一个名为RiskCueBench的新基准，对视频进行精细标注，定位最早可指示潜在安全风险的‘风险信号片段’，用于评估模型从早期视觉线索预测未来风险的能力。 Result: 实验结果表明，当前系统在理解和预判由早期视觉信号演化而来的风险事件方面存在显著不足。 Conclusion: RiskCueBench揭示了现有视频风险预测模型在实际部署中面临的重要挑战，推动了对未来风险预判能力的研究。 Abstract: With the rapid growth of video centered social media, the ability to anticipate risky events from visual data is a promising direction for ensuring public safety and preventing real world accidents. Prior work has extensively studied supervised video risk assessment across domains such as driving, protests, and natural disasters. However, many existing datasets provide models with access to the full video sequence, including the accident itself, which substantially reduces the difficulty of the task. To better reflect real world conditions, we introduce a new video understanding benchmark RiskCueBench in which videos are carefully annotated to identify a risk signal clip, defined as the earliest moment that indicates a potential safety concern. Experimental results reveal a significant gap in current systems ability to interpret evolving situations and anticipate future risky events from early visual signals, highlighting important challenges for deploying video risk prediction models in practice.

[119] A Novel Unified Approach to Deepfake Detection

Lord Sen,Shyamapada Mukherjee

Main category: cs.CV

TL;DR: 本文提出了一种用于图像和视频中Deepfake检测的新架构，结合空间域与频域特征的交叉注意力机制及血液检测模块，实现了优于当前最先进方法的性能，并具有良好的跨数据集泛化能力。

Details

Motivation: 随着AI技术的发展，Deepfakes的合成与滥用构成严重威胁，亟需有效的检测与标记方法以维护数字时代的信任体系。 Method: 采用跨域交叉注意力机制融合空间域和频率域特征，并引入血液检测模块，结合Swin Transformer、BERT或EfficientNet-B4等模型进行真假图像分类。 Result: 在FF++和Celeb-DF数据集上分别达到99.80%、99.88% AUC（使用Swin Transformer和BERT），以及99.55%、99.38% AUC（使用EfficientNet-B4和BERT），且在跨数据集测试中表现良好。 Conclusion: 所提出的统一架构在Deepfake检测任务中表现出卓越性能和强泛化能力，为应对深度伪造威胁提供了有效解决方案。 Abstract: The advancements in the field of AI is increasingly giving rise to various threats. One of the most prominent of them is the synthesis and misuse of Deepfakes. To sustain trust in this digital age, detection and tagging of deepfakes is very necessary. In this paper, a novel architecture for Deepfake detection in images and videos is presented. The architecture uses cross attention between spatial and frequency domain features along with a blood detection module to classify an image as real or fake. This paper aims to develop a unified architecture and provide insights into each step. Though this approach we achieve results better than SOTA, specifically 99.80%, 99.88% AUC on FF++ and Celeb-DF upon using Swin Transformer and BERT and 99.55, 99.38 while using EfficientNet-B4 and BERT. The approach also generalizes very well achieving great cross dataset results as well.

[120] Better, But Not Sufficient: Testing Video ANNs Against Macaque IT Dynamics

Matteo Dunnhofer,Christian Micheloni,Kohitij Kar

Main category: cs.CV

TL;DR: 研究比较了灵长类IT皮层在自然视频下的响应与不同类型人工神经网络模型的响应，发现当前视频模型虽有一定改进，但未能捕捉到IT中跨形态不变的动态计算特性。

Details

Motivation: 探究灵长类IT皮层对动态视觉刺激的响应是否超越了静态前馈计算，是否包含更丰富的动态处理能力。 Method: 通过比较猕猴IT皮层在观看自然视频时的神经响应与静态、循环及视频训练的ANN模型的响应，并使用外观去除但保留运动的‘无外观’视频进行解码泛化测试。 Result: 视频模型在后期响应阶段有适度提升，但在外观不变的动态测试中，所有ANN模型均无法泛化，而IT群体活动可以。 Conclusion: 当前基于视频的ANN模型主要捕捉与外观绑定的动态，未能反映IT皮层中出现的外观不变动态计算，需发展能编码生物时间统计与不变性的新目标函数。 Abstract: Feedforward artificial neural networks (ANNs) trained on static images remain the dominant models of the the primate ventral visual stream, yet they are intrinsically limited to static computations. The primate world is dynamic, and the macaque ventral visual pathways, specifically the inferior temporal (IT) cortex not only supports object recognition but also encodes object motion velocity during naturalistic video viewing. Does IT's temporal responses reflect nothing more than time-unfolded feedforward transformations, framewise features with shallow temporal pooling, or do they embody richer dynamic computations? We tested this by comparing macaque IT responses during naturalistic videos against static, recurrent, and video-based ANN models. Video models provided modest improvements in neural predictivity, particularly at later response stages, raising the question of what kind of dynamics they capture. To probe this, we applied a stress test: decoders trained on naturalistic videos were evaluated on "appearance-free" variants that preserve motion but remove shape and texture. IT population activity generalized across this manipulation, but all ANN classes failed. Thus, current video models better capture appearance-bound dynamics rather than the appearance-invariant temporal computations expressed in IT, underscoring the need for new objectives that encode biological temporal statistics and invariances.

[121] Eye-Q: A Multilingual Benchmark for Visual Word Puzzle Solving and Image-to-Phrase Reasoning

Ali Najar,Alireza Mirrokni,Arshia Izadyari,Sadegh Mohammadian,Amir Homayoon Sharifizade,Asal Meskin,Mobin Bagherian,Ehsaneddin Asgari

Main category: cs.CV

TL;DR: 本文提出了一个名为Eye-Q的多语言视觉词谜基准测试，用于评估视觉-语言模型在复杂视觉理解任务上的表现，这些任务需要发现隐含的视觉线索、生成和修正假设，并将感知证据映射到非字面概念。实验结果显示现有模型在此类任务上存在显著性能差距。

Details

Motivation: 现有的视觉-语言模型虽然在标准基准上表现良好，但通常依赖表面识别而非深层推理。为了挑战这一局限，需要设计能够评估更深层次视觉理解能力的任务。 Method: 提出视觉词谜作为更具挑战性的替代方案，构建包含1,343个谜题的Eye-Q多语言基准，涵盖英语、波斯语、阿拉伯语及跨语言谜题；采用开放式、与人类对齐的评估协议，测试模型在轻量辅助下的假设形成与修订能力。 Result: 当前最先进的视觉-语言模型在Eye-Q上的最高准确率仅为60.27%，尤其在抽象和跨语言谜题上表现较差，暴露出其在构建和搜索适当概念表征方面的能力不足。 Conclusion: 视觉词谜是一种有效评估视觉-语言模型深层推理能力的方式，现有模型在灵活的图像到短语推断任务中仍存在重大缺陷，需进一步提升其抽象和关联推理能力。 Abstract: Vision-Language Models (VLMs) have achieved strong performance on standard vision-language benchmarks, yet often rely on surface-level recognition rather than deeper reasoning. We propose visual word puzzles as a challenging alternative, as they require discovering implicit visual cues, generating and revising hypotheses, and mapping perceptual evidence to non-literal concepts in ways that are difficult to solve via literal grounding, OCR-heavy shortcuts, or simple retrieval-style matching. We introduce Eye-Q, a multilingual benchmark designed to assess this form of complex visual understanding. Eye-Q contains 1,343 puzzles in which a model observes a conceptually dense scene with a brief description and must infer a specific target word or phrase. The puzzles are intentionally unstructured and cue-implicit, with distractors and contextual relationships that demand selective attention, abstraction, and associative inference. The benchmark spans English, Persian, Arabic, and cross-lingual puzzles. We evaluate state-of-the-art VLMs using an open-ended, human-aligned protocol that probes hypothesis formation and revision under lightweight assistance. Results reveal substantial performance gaps, especially on abstract and cross-lingual puzzles, highlighting limitations in current models' ability to construct and search over appropriate conceptual representations for flexible image-to-phrase inference; maximum accuracy reaches only 60.27%.

[122] GAMBIT: A Gamified Jailbreak Framework for Multimodal Large Language Models

Xiangdong Hu,Yangyang Jiang,Qin Hu,Xiaojun Jia

Main category: cs.CV

TL;DR: 提出GAMBI，一种通过构建游戏化场景来诱导多模态大语言模型在推理过程中主动完成越狱的新型攻击框架，显著提升对具备思维链能力模型的攻击成功率。

Details

Motivation: 现有对抗攻击在针对具备推理能力的多模态大语言模型时效果有限，因未充分利用模型自身的推理动机。本文旨在通过影响模型的认知阶段决策，实现更有效的越狱。 Method: 提出GAMBI框架，将有害视觉语义分解并重组，构造具有引导性的游戏化场景，驱动模型在追求目标的过程中自发重建恶意意图并回答问题，从而形成结构化推理链，削弱其安全性注意力。 Result: 在多个主流多模态大语言模型上实验表明，GAMBI实现了高达92.13%（Gemini 2.5 Flash）、91.20%（QvQ-MAX）和85.87%（GPT-4o）的攻击成功率，显著优于基线方法。 Conclusion: 通过利用模型的推理机制设计游戏化陷阱，可有效破坏多模态大语言模型的安全对齐，揭示了当前安全机制在认知层面的脆弱性。 Abstract: Multimodal Large Language Models (MLLMs) have become widely deployed, yet their safety alignment remains fragile under adversarial inputs. Previous work has shown that increasing inference steps can disrupt safety mechanisms and lead MLLMs to generate attacker-desired harmful content. However, most existing attacks focus on increasing the complexity of the modified visual task itself and do not explicitly leverage the model's own reasoning incentives. This leads to them underperforming on reasoning models (Models with Chain-of-Thoughts) compared to non-reasoning ones (Models without Chain-of-Thoughts). If a model can think like a human, can we influence its cognitive-stage decisions so that it proactively completes a jailbreak? To validate this idea, we propose GAMBI} (Gamified Adversarial Multimodal Breakout via Instructional Traps), a novel multimodal jailbreak framework that decomposes and reassembles harmful visual semantics, then constructs a gamified scene that drives the model to explore, reconstruct intent, and answer as part of winning the game. The resulting structured reasoning chain increases task complexity in both vision and text, positioning the model as a participant whose goal pursuit reduces safety attention and induces it to answer the reconstructed malicious query. Extensive experiments on popular reasoning and non-reasoning MLLMs demonstrate that GAMBIT achieves high Attack Success Rates (ASR), reaching 92.13% on Gemini 2.5 Flash, 91.20% on QvQ-MAX, and 85.87% on GPT-4o, significantly outperforming baselines.

[123] WeedRepFormer: Reparameterizable Vision Transformers for Real-Time Waterhemp Segmentation and Gender Classification

Toqi Tahamid Sarker,Taminul Islam,Khaled R. Ahmed,Cristiana Bernardi Rankrape,Kaitlin E. Creager,Karla Gage

Main category: cs.CV

TL;DR: WeedRepFormer是一种轻量级多任务Vision Transformer，用于同时进行水麻的分割与性别分类，在自研数据集上实现了高精度与高效推理。

Details

Motivation: 现有农业模型在生物属性分类所需的细粒度特征提取与实时部署所需的效率之间难以平衡。 Method: 提出WeedRepFormer，系统地在整个架构中引入结构重参数化，包括Vision Transformer主干、Lite R-ASPP解码器和新颖的可重参数化分类头，以解耦训练时容量与推理时延迟。 Result: 在包含10,264帧标注数据的水麻数据集上，模型达到92.18% mIoU（分割）和81.91%准确率（性别分类），仅使用3.59M参数和3.80 GFLOPs，并以108.95 FPS运行。 Conclusion: WeedRepFormer在性别分类准确率上超越最先进的iFormer-T达4.40%，同时减少1.9倍参数量，兼顾性能与效率，适合实时农业应用。 Abstract: We present WeedRepFormer, a lightweight multi-task Vision Transformer designed for simultaneous waterhemp segmentation and gender classification. Existing agricultural models often struggle to balance the fine-grained feature extraction required for biological attribute classification with the efficiency needed for real-time deployment. To address this, WeedRepFormer systematically integrates structural reparameterization across the entire architecture - comprising a Vision Transformer backbone, a Lite R-ASPP decoder, and a novel reparameterizable classification head - to decouple training-time capacity from inference-time latency. We also introduce a comprehensive waterhemp dataset containing 10,264 annotated frames from 23 plants. On this benchmark, WeedRepFormer achieves 92.18% mIoU for segmentation and 81.91% accuracy for gender classification using only 3.59M parameters and 3.80 GFLOPs. At 108.95 FPS, our model outperforms the state-of-the-art iFormer-T by 4.40% in classification accuracy while maintaining competitive segmentation performance and significantly reducing parameter count by 1.9x.

[124] FROST-Drive: Scalable and Efficient End-to-End Driving with a Frozen Vision Encoder

Zeyu Dong,Yimin Zhu,Yu Wu,Yu Sun

Main category: cs.CV

TL;DR: 本文提出FROST-Drive，一种通过冻结预训练视觉编码器来提升端到端自动驾驶模型泛化能力的新架构，在Waymo数据集上显著优于全量微调方法。

Details

Motivation: 现有端到端自动驾驶模型在微调视觉编码器时可能过度特化于训练数据，导致在新颖复杂场景中泛化能力不足，本文旨在挑战这一训练范式。 Method: 提出FROST-Drive架构，冻结来自视觉语言模型（VLM）的视觉编码器权重，保留其通用世界知识；结合基于Transformer的多模态融合适配器和基于GRU的解码器生成平滑路径点，并设计了直接优化Rater Feedback Score（RFS）的损失函数。 Result: 在大规模Waymo Open E2E数据集（聚焦长尾场景）上的实验表明，该方法显著优于全量微调的模型，验证了冻结编码器策略在轨迹规划鲁棒性方面的优势。 Conclusion: 保持强大VLM的广泛知识比深度领域适应更有利于实现鲁棒且可泛化的驾驶性能，为视觉驱动模型在真实复杂环境中的应用提供了新方向。 Abstract: End-to-end (E2E) models in autonomous driving aim to directly map sensor inputs to control commands, but their ability to generalize to novel and complex scenarios remains a key challenge. The common practice of fully fine-tuning the vision encoder on driving datasets potentially limits its generalization by causing the model to specialize too heavily in the training data. This work challenges the necessity of this training paradigm. We propose FROST-Drive, a novel E2E architecture designed to preserve and leverage the powerful generalization capabilities of a pretrained vision encoder from a Vision-Language Model (VLM). By keeping the encoder's weights frozen, our approach directly transfers the rich, generalized world knowledge from the VLM to the driving task. Our model architecture combines this frozen encoder with a transformer-based adapter for multimodal fusion and a GRU-based decoder for smooth waypoint generation. Furthermore, we introduce a custom loss function designed to directly optimize for Rater Feedback Score (RFS), a metric that prioritizes robust trajectory planning. We conduct extensive experiments on Waymo Open E2E Dataset, a large-scale datasets deliberately curated to capture the long-tail scenarios, demonstrating that our frozen-encoder approach significantly outperforms models that employ full fine-tuning. Our results provide substantial evidence that preserving the broad knowledge of a capable VLM is a more effective strategy for achieving robust, generalizable driving performance than intensive domain-specific adaptation. This offers a new pathway for developing vision-based models that can better handle the complexities of real-world application domains.

[125] Experimental Comparison of Light-Weight and Deep CNN Models Across Diverse Datasets

Md. Hefzul Hossain Papon,Shadman Rabby

Main category: cs.CV

TL;DR: 一项研究表明，经过良好正则化的浅层架构在多种领域中表现优异，无需大型GPU或预训练模型，适用于低资源环境下的实际部署。

Details

Motivation: 探索在低资源环境下无需复杂模型和硬件的高效视觉解决方案。 Method: 采用具有良好正则化的浅层卷积神经网络，并在多个孟加拉国视觉数据集上建立统一且可复现的基准。 Result: 浅层架构在智能城市监控和农作物品种分类等多个异构领域表现出色。 Conclusion: 轻量级CNN在低资源环境中具有重要的实际应用价值，是强有力的基线方法。 Abstract: Our results reveal that a well-regularized shallow architecture can serve as a highly competitive baseline across heterogeneous domains - from smart-city surveillance to agricultural variety classification - without requiring large GPUs or specialized pre-trained models. This work establishes a unified, reproducible benchmark for multiple Bangladeshi vision datasets and highlights the practical value of lightweight CNNs for real-world deployment in low-resource settings.

[126] Latent Geometry of Taste: Scalable Low-Rank Matrix Factorization

Joshua Salako

Main category: cs.CV

TL;DR: 本文研究了基于大规模交互数据的协同过滤中可扩展性和数据稀疏性问题，使用MovieLens 32M数据集构建了一个高性能并行化ALS框架，发现低秩模型在泛化性能上优于高维模型，并通过嵌入空间可视化揭示了语义类别簇的自发形成，同时提出一种可调参数有效应对冷启动和流行度偏差问题。

Details

Motivation: 解决协同过滤在大规模稀疏数据下的可扩展性和推荐质量瓶颈。 Method: 采用并行化交替最小二乘法（ALS）进行矩阵分解，对超参数进行优化，并可视化用户-项目嵌入空间结构。 Result: 低秩约束模型在RMSE和排序精度上表现更优；嵌入空间中自然形成语义相关的 genre 聚类；在冷启动场景下通过可调评分参数有效平衡个性化与流行度偏差。 Conclusion: 仅从交互数据出发，低秩ALS模型不仅能高效捕捉用户偏好深层结构，还具备良好的实际应用潜力，尤其在处理数据稀疏和冷启动问题时表现出灵活性和有效性。 Abstract: Scalability and data sparsity remain critical bottlenecks for collaborative filtering on massive interaction datasets. This work investigates the latent geometry of user preferences using the MovieLens 32M dataset, implementing a high-performance, parallelized Alternating Least Squares (ALS) framework. Through extensive hyperparameter optimization, we demonstrate that constrained low-rank models significantly outperform higher dimensional counterparts in generalization, achieving an optimal balance between Root Mean Square Error (RMSE) and ranking precision. We visualize the learned embedding space to reveal the unsupervised emergence of semantic genre clusters, confirming that the model captures deep structural relationships solely from interaction data. Finally, we validate the system's practical utility in a cold-start scenario, introducing a tunable scoring parameter to manage the trade-off between popularity bias and personalized affinity effectively. The codebase for this research can be found here: https://github.com/joshsalako/recommender.git

[127] ThinkRL-Edit: Thinking in Reinforcement Learning for Reasoning-Centric Image Editing

Hengjia Li,Liming Jiang,Qing Yan,Yizhi Song,Hao Kang,Zichuan Liu,Xin Lu,Boxi Wu,Deng Cai

Main category: cs.CV

TL;DR: 本文提出ThinkRL-Edit，一种以推理为中心的强化学习框架，通过解耦视觉推理与图像生成、引入基于思维链的推理采样和无偏奖励分组策略，显著提升了指令驱动图像编辑中对复杂语义推理任务的编辑质量。

Details

Motivation: 现有统一多模态生成模型在指令驱动图像编辑中的视觉推理能力有限，导致在依赖复杂推理的编辑任务上表现不佳；尽管强化学习被用于提升编辑质量，但仍面临推理探索受限、奖励融合偏差和基于视觉语言模型（VLM）的指令奖励不稳定三大挑战。 Method: 提出ThinkRL-Edit框架：1）在在线采样阶段前引入包含规划与反思阶段的思维链（CoT）推理采样，扩展超越去噪过程的推理探索；2）采用跨多个奖励维度的无偏链偏好分组策略，避免加权聚合带来的偏差；3）用二值化检查清单替代基于区间的VLM评分，获得更精确、低方差且可解释的奖励信号。 Result: 实验表明，ThinkRL-Edit在推理密集型图像编辑任务上显著优于先前方法，生成的编辑结果更符合指令、视觉上更连贯且语义更合理。 Conclusion: ThinkRL-Edit通过解耦推理与生成、增强推理探索和改进奖励机制，有效提升了多模态生成模型在复杂语义推理驱动图像编辑中的性能，为构建更具推理能力的视觉生成系统提供了新方向。 Abstract: Instruction-driven image editing with unified multimodal generative models has advanced rapidly, yet their underlying visual reasoning remains limited, leading to suboptimal performance on reasoning-centric edits. Reinforcement learning (RL) has been investigated for improving the quality of image editing, but it faces three key challenges: (1) limited reasoning exploration confined to denoising stochasticity, (2) biased reward fusion, and (3) unstable VLM-based instruction rewards. In this work, we propose ThinkRL-Edit, a reasoning-centric RL framework that decouples visual reasoning from image synthesis and expands reasoning exploration beyond denoising. To the end, we introduce Chain-of-Thought (CoT)-based reasoning sampling with planning and reflection stages prior to generation in online sampling, compelling the model to explore multiple semantic hypotheses and validate their plausibility before committing to a visual outcome. To avoid the failures of weighted aggregation, we propose an unbiased chain preference grouping strategy across multiple reward dimensions. Moreover, we replace interval-based VLM scores with a binary checklist, yielding more precise, lower-variance, and interpretable rewards for complex reasoning. Experiments show our method significantly outperforms prior work on reasoning-centric image editing, producing instruction-faithful, visually coherent, and semantically grounded edits.

[128] Understanding Reward Hacking in Text-to-Image Reinforcement Learning

Yunqi Hong,Kuei-Chun Kao,Hengguang Zhou,Cho-Jui Hsieh

Main category: cs.CV

TL;DR: 本文系统分析了文本到图像强化学习中的奖励黑客行为，提出了一种轻量级的自适应伪影奖励模型以减少奖励滥用并提升生成图像的真实感。

Details

Motivation: 现有奖励函数常为人类判断的不完美代理，导致模型易出现奖励黑客行为，生成低质量或不真实图像。 Method: 分析美学和提示一致性奖励对奖励黑客的影响，提出一种基于小规模标注数据集训练的轻量级伪影奖励模型，并将其作为正则化项集成至现有RL流程。 Result: 实验表明，引入伪影奖励能显著降低多种T2I RL设置下的奖励黑客现象，提升图像视觉真实性和生成质量。 Conclusion: 轻量级的奖励增强可有效作为防范奖励黑客的安全机制，增强RL驱动图像生成模型的鲁棒性。 Abstract: Reinforcement learning (RL) has become a standard approach for post-training large language models and, more recently, for improving image generation models, which uses reward functions to enhance generation quality and human preference alignment. However, existing reward designs are often imperfect proxies for true human judgment, making models prone to reward hacking--producing unrealistic or low-quality images that nevertheless achieve high reward scores. In this work, we systematically analyze reward hacking behaviors in text-to-image (T2I) RL post-training. We investigate how both aesthetic/human preference rewards and prompt-image consistency rewards individually contribute to reward hacking and further show that ensembling multiple rewards can only partially mitigate this issue. Across diverse reward models, we identify a common failure mode: the generation of artifact-prone images. To address this, we propose a lightweight and adaptive artifact reward model, trained on a small curated dataset of artifact-free and artifact-containing samples. This model can be integrated into existing RL pipelines as an effective regularizer for commonly used reward models. Experiments demonstrate that incorporating our artifact reward significantly improves visual realism and reduces reward hacking across multiple T2I RL setups, demonstrating the effectiveness of lightweight reward augment serving as a safeguard against reward hacking.

[129] CroBIM-U: Uncertainty-Driven Referring Remote Sensing Image Segmentation

Yuzhe Sun,Zhe Dong,Haochen Jiang,Tianzhu Liu,Yanfeng Gu

Main category: cs.CV

TL;DR: 提出一种基于不确定性引导的框架，通过像素级的指代不确定性图来实现遥感图像中自然语言描述目标的自适应分割，提升复杂场景下的鲁棒性和几何精度。

Details

Motivation: 现有方法在处理遥感图像指代分割时采用全局统一的融合与优化策略，难以应对空间上不均匀的跨模态对齐可靠性问题，尤其在尺度变化大、干扰物密集和边界复杂的区域表现不佳。 Method: 设计一个可插拔的指代不确定性评分模块（RUS），通过在线误差一致性监督训练生成像素级不确定性图；并基于该图构建两个模块：不确定性门控融合（UGF）动态调节语言注入强度，不确定性驱动局部优化（UDLR）聚焦于高不确定性区域的精细优化。 Result: 实验表明该方法作为统一的即插即用解决方案，在不改变主干网络的情况下显著提升了复杂遥感场景中的分割性能和边界精度。 Conclusion: 通过引入空间感知的不确定性先验，实现了更可靠的跨模态对齐与自适应推理，为指代遥感图像分割提供了有效且通用的框架。 Abstract: Referring remote sensing image segmentation aims to localize specific targets described by natural language within complex overhead imagery. However, due to extreme scale variations, dense similar distractors, and intricate boundary structures, the reliability of cross-modal alignment exhibits significant \textbf{spatial non-uniformity}. Existing methods typically employ uniform fusion and refinement strategies across the entire image, which often introduces unnecessary linguistic perturbations in visually clear regions while failing to provide sufficient disambiguation in confused areas. To address this, we propose an \textbf{uncertainty-guided framework} that explicitly leverages a pixel-wise \textbf{referring uncertainty map} as a spatial prior to orchestrate adaptive inference. Specifically, we introduce a plug-and-play \textbf{Referring Uncertainty Scorer (RUS)}, which is trained via an online error-consistency supervision strategy to interpretably predict the spatial distribution of referential ambiguity. Building on this prior, we design two plug-and-play modules: 1) \textbf{Uncertainty-Gated Fusion (UGF)}, which dynamically modulates language injection strength to enhance constraints in high-uncertainty regions while suppressing noise in low-uncertainty ones; and 2) \textbf{Uncertainty-Driven Local Refinement (UDLR)}, which utilizes uncertainty-derived soft masks to focus refinement on error-prone boundaries and fine details. Extensive experiments demonstrate that our method functions as a unified, plug-and-play solution that significantly improves robustness and geometric fidelity in complex remote sensing scenes without altering the backbone architecture.

[130] SDCD: Structure-Disrupted Contrastive Decoding for Mitigating Hallucinations in Large Vision-Language Models

Yuxuan Xia,Siheng Wang,Peng Li

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的算法SDCD，通过结构打乱的对比解码来抑制大视觉语言模型中的物体幻觉，提升多模态理解能力。

Details

Motivation: 现有的大视觉语言模型在多模态理解和推理方面进展显著，但存在物体幻觉问题，现有研究忽略了视觉编码过程中的内部复杂性，特别是视觉统计偏差的影响。 Method: 提出了Structure-Disrupted Contrastive Decoding (SDCD) 算法，通过引入打乱结构的视图进行对比校准输出分布，惩罚在无结构视图下仍保持高置信度的token，从而抑制纹理驱动偏差。 Result: 实验结果表明，SDCD在多个基准上显著减少了幻觉现象，并增强了LVLMs的整体多模态能力。 Conclusion: SDCD有效缓解了由于视觉编码器的Bag-of-Patches行为导致的物体幻觉问题，为改进多模态模型提供了新思路。 Abstract: Large Vision-Language Models (LVLMs) demonstrate significant progress in multimodal understanding and reasoning, yet object hallucination remains a critical challenge. While existing research focuses on mitigating language priors or high-level statistical biases, they often overlook the internal complexities of the visual encoding process. We identify that visual statistical bias, arising from the inherent Bag-of-Patches behavior of Vision Encoders under weak structural supervision, acts as a contributing factor of object hallucinations. Under this bias, models prioritize local texture features within individual patches over holistic geometric structures. This tendency may induce spurious visual confidence and result in hallucinations. To address this, we introduce a training-free algorithm called Structure-Disrupted Contrastive Decoding (SDCD), which performs contrastive calibration of the output distribution by introducing a shuffled structure-disrupted view. By penalizing tokens that maintain high confidence under this structure-less view, SDCD effectively suppresses the texture-driven bias. Experimental results demonstrate that SDCD significantly mitigates hallucinations across multiple benchmarks and enhances the overall multimodal capabilities of LVLMs.

[131] REFA: Real-time Egocentric Facial Animations for Virtual Reality

Qiang Zhang,Tong Xiao,Haroun Habeeb,Larissa Laich,Sofien Bouaziz,Patrick Snape,Wenjing Zhang,Matthew Cioffi,Peizhao Zhang,Pavel Pidlypenskyi,Winnie Lin,Luming Ma,Mengjiao Wang,Kunpeng Li,Chengjiang Long,Steven Song,Martin Prazak,Alexander Sjoholm,Ajinkya Deogade,Jaebong Lee,Julio Delgado Mangas,Amaury Aubel

Main category: cs.CV

TL;DR: 提出了一种基于VR头显的实时面部表情追踪系统，利用红外相机和蒸馏学习方法实现无需校准的非侵入式表情驱动。

Details

Motivation: 为了在虚拟环境中实现自然、实时且无需复杂校准的面部表情捕捉，提升用户在VR中的表达与交互体验。 Method: 采用嵌入VR头显的红外相机获取第一人称视角图像，结合合成与真实多源数据，通过知识蒸馏方法训练机器学习模型，并构建可微分渲染流水线以自动生成表情标签。 Result: 收集了包含1.8万名多样个体的数据集，实现了高精度的实时面部表情追踪，无需用户校准，适用于视频会议、游戏等场景。 Conclusion: 该系统为虚拟环境中的自然表情交互提供了实用且可扩展的解决方案，推动了非侵入式表情捕捉技术的发展。 Abstract: We present a novel system for real-time tracking of facial expressions using egocentric views captured from a set of infrared cameras embedded in a virtual reality (VR) headset. Our technology facilitates any user to accurately drive the facial expressions of virtual characters in a non-intrusive manner and without the need of a lengthy calibration step. At the core of our system is a distillation based approach to train a machine learning model on heterogeneous data and labels coming form multiple sources, \eg synthetic and real images. As part of our dataset, we collected 18k diverse subjects using a lightweight capture setup consisting of a mobile phone and a custom VR headset with extra cameras. To process this data, we developed a robust differentiable rendering pipeline enabling us to automatically extract facial expression labels. Our system opens up new avenues for communication and expression in virtual environments, with applications in video conferencing, gaming, entertainment, and remote collaboration.

[132] G2P: Gaussian-to-Point Attribute Alignment for Boundary-Aware 3D Semantic Segmentation

Hojun Song,Chae-yeong Song,Jeong-hun Hong,Chaewon Moon,Dong-hwi Kim,Gahyeon Kim,Soo Ye Kim,Yiyi Liao,Jaehyup Lee,Sang-hyo Park

Main category: cs.CV

TL;DR: 提出G2P方法，将3D高斯点阵的外观感知属性迁移到点云语义分割中，提升几何复杂场景下的分割性能。

Details

Motivation: 点云分布稀疏且不规则，仅依赖几何特征难以区分形状相似但外观不同的物体。 Method: 通过建立点级对应关系，将3D高斯点阵中的不透明度和尺度属性转移到点云，解决几何歧义与对齐问题。 Result: 在标准基准上取得优异性能，尤其在几何复杂的类别上有显著提升，无需任何2D或语言监督。 Conclusion: G2P有效融合外观信息，增强了点云语义分割的判别能力与边界定位精度。 Abstract: Semantic segmentation on point clouds is critical for 3D scene understanding. However, sparse and irregular point distributions provide limited appearance evidence, making geometry-only features insufficient to distinguish objects with similar shapes but distinct appearances (e.g., color, texture, material). We propose Gaussian-to-Point (G2P), which transfers appearance-aware attributes from 3D Gaussian Splatting to point clouds for more discriminative and appearance-consistent segmentation. Our G2P address the misalignment between optimized Gaussians and original point geometry by establishing point-wise correspondences. By leveraging Gaussian opacity attributes, we resolve the geometric ambiguity that limits existing models. Additionally, Gaussian scale attributes enable precise boundary localization in complex 3D scenes. Extensive experiments demonstrate that our approach achieves superior performance on standard benchmarks and shows significant improvements on geometrically challenging classes, all without any 2D or language supervision.

[133] Semantic Belief-State World Model for 3D Human Motion Prediction

Sarim Chaudhry

Main category: cs.CV

TL;DR: 提出语义信念状态世界模型（SBWM），将人体运动预测重构为在人体流形上的潜在动力学模拟，通过分离观测重建与动态建模，实现稳定且低计算成本的长时程运动预测。

Details

Motivation: 传统方法将运动预测视为序列回归问题，难以分离观测重建与动力学建模，导致长期预测中出现漂移、均值姿态坍缩和不确定性校准差的问题。 Method: 提出SBWM模型，维护一个递归概率信念状态，其演化独立于姿态重建，并与SMPL-X解剖参数化显式对齐；采用随机潜在转移和以 rollout 为中心的训练策略，专注于稳定前向模拟而非重建精度。 Result: 实现了连贯的长时程rollout，预测精度具有竞争力，且计算成本显著降低。 Conclusion: 将人体作为世界模型状态空间的一部分而非输出，能从根本上改变运动的模拟与预测方式。 Abstract: Human motion prediction has traditionally been framed as a sequence regression problem where models extrapolate future joint coordinates from observed pose histories. While effective over short horizons this approach does not separate observation reconstruction with dynamics modeling and offers no explicit representation of the latent causes governing motion. As a result, existing methods exhibit compounding drift, mean-pose collapse, and poorly calibrated uncertainty when rolled forward beyond the training regime. Here we propose a Semantic Belief-State World Model (SBWM) that reframes human motion prediction as latent dynamical simulation on the human body manifold. Rather than predicting poses directly, SBWM maintains a recurrent probabilistic belief state whose evolution is learned independently of pose reconstruction and explicitly aligned with the SMPL-X anatomical parameterization. This alignment imposes a structural information bottleneck that prevents the latent state from encoding static geometry or sensor noise, forcing it to capture motion dynamics, intent, and control-relevant structure. Inspired by belief-state world models developed for model-based reinforcement learning, SBWM adapts stochastic latent transitions and rollout-centric training to the domain of human motion. In contrast to RSSM-based, transformer, and diffusion approaches optimized for reconstruction fidelity, SBWM prioritizes stable forward simulation. We demonstrate coherent long-horizon rollouts, and competitive accuracy at substantially lower computational cost. These results suggest that treating the human body as part of the world models state space rather than its output fundamentally changes how motion is simulated, and predicted.

[134] Physics-Constrained Cross-Resolution Enhancement Network for Optics-Guided Thermal UAV Image Super-Resolution

Zhicheng Zhao,Fengjiao Peng,Jinquan Yan,Wei Lu,Chenglong Li,Jin Tang

Main category: cs.CV

TL;DR: 本文提出PCNet，通过跨分辨率互增强和物理引导的热传导机制，实现光学与热成像模态间的高效融合，提升无人机热图像超分辨率性能。

Details

Motivation: 现有方法在光学引导热图像超分辨率中存在高频信息丢失和模态间物理不一致的问题，如纹理失真和边缘模糊，需更鲁棒的跨模态对齐方法。 Method: 提出PCNet，包含跨分辨率互增强模块（CRME）实现双模态双向特征优化，设计物理驱动的热传导模块（PDTM）引入二维热传导模型，并采用温度一致性损失保证生成结果符合真实热辐射规律。 Result: 在VGTSR2.0和DroneVehicle数据集上实验表明，PCNet在图像重建质量及语义分割、目标检测等下游任务中均优于现有最先进方法。 Conclusion: PCNet通过物理约束与跨分辨率互增强有效提升了热图像超分辨率的精度与真实性，具有良好的应用潜力。 Abstract: Optics-guided thermal UAV image super-resolution has attracted significant research interest due to its potential in all-weather monitoring applications. However, existing methods typically compress optical features to match thermal feature dimensions for cross-modal alignment and fusion, which not only causes the loss of high-frequency information that is beneficial for thermal super-resolution, but also introduces physically inconsistent artifacts such as texture distortions and edge blurring by overlooking differences in the imaging physics between modalities. To address these challenges, we propose PCNet to achieve cross-resolution mutual enhancement between optical and thermal modalities, while physically constraining the optical guidance process via thermal conduction to enable robust thermal UAV image super-resolution. In particular, we design a Cross-Resolution Mutual Enhancement Module (CRME) to jointly optimize thermal image super-resolution and optical-to-thermal modality conversion, facilitating effective bidirectional feature interaction across resolutions while preserving high-frequency optical priors. Moreover, we propose a Physics-Driven Thermal Conduction Module (PDTM) that incorporates two-dimensional heat conduction into optical guidance, modeling spatially-varying heat conduction properties to prevent inconsistent artifacts. In addition, we introduce a temperature consistency loss that enforces regional distribution consistency and boundary gradient smoothness to ensure generated thermal images align with real-world thermal radiation principles. Extensive experiments on VGTSR2.0 and DroneVehicle datasets demonstrate that PCNet significantly outperforms state-of-the-art methods on both reconstruction quality and downstream tasks including semantic segmentation and object detection.

[135] CloudMatch: Weak-to-Strong Consistency Learning for Semi-Supervised Cloud Detection

Jiayi Zhao,Changlu Chen,Jingsheng Li,Tianxiang Xue,Kun Zhan

Main category: cs.CV

TL;DR: 提出了一种名为CloudMatch的半监督框架，通过视图一致性学习和场景混合增强来有效利用未标记的遥感图像进行云检测。

Details

Motivation: 由于精确像素级标签标注成本高，需要一种能有效利用未标记数据的半监督方法用于云检测。 Method: CloudMatch通过生成弱增强和两种强增强视图（包括跨场景和同场景混合）来提升模型对云模式结构多样性和上下文丰富性的学习能力，并通过预测一致性约束和伪标签机制进行训练。 Result: 大量实验表明，CloudMatch在多个设置下均表现出色，能高效利用未标记数据，显著提升半监督云检测性能。 Conclusion: CloudMatch为半监督云检测提供了一个有效框架，通过多样化增强策略增强了模型的泛化能力。 Abstract: Due to the high cost of annotating accurate pixel-level labels, semi-supervised learning has emerged as a promising approach for cloud detection. In this paper, we propose CloudMatch, a semi-supervised framework that effectively leverages unlabeled remote sensing imagery through view-consistency learning combined with scene-mixing augmentations. An observation behind CloudMatch is that cloud patterns exhibit structural diversity and contextual variability across different scenes and within the same scene category. Our key insight is that enforcing prediction consistency across diversely augmented views, incorporating both inter-scene and intra-scene mixing, enables the model to capture the structural diversity and contextual richness of cloud patterns. Specifically, CloudMatch generates one weakly augmented view along with two complementary strongly augmented views for each unlabeled image: one integrates inter-scene patches to simulate contextual variety, while the other employs intra-scene mixing to preserve semantic coherence. This approach guides pseudolabel generation and enhances generalization. Extensive experiments show that CloudMatch achieves good performance, demonstrating its capability to utilize unlabeled data efficiently and advance semi-supervised cloud detection.

[136] EASLT: Emotion-Aware Sign Language Translation

Guobin Tu,Di Weng

Main category: cs.CV

TL;DR: 本文提出了EASLT，一种情感感知的手语翻译框架，通过专门的情绪编码器和情感感知融合模块，利用面部表情作为语义锚点来解决无标注符号的手语翻译中的语义歧义问题。

Details

Motivation: 现有无标注符号的手语翻译方法通常忽略面部表情的语义重要性，导致相同手势在不同情感下产生歧义，因此需要将非手动信号（尤其是情绪）纳入翻译过程以提升准确性。 Method: 提出EASLT框架，包含一个专用的情绪编码器用于捕捉连续的情感动态，并设计了情感感知融合（EAF）模块，自适应地根据情感上下文调整时空手语特征。 Result: 在PHOENIX14T和CSL-Daily数据集上达到领先性能，BLEU-4分别为26.15和22.80，BLEURT为61.0和57.8；消融实验表明显式建模情绪能有效解耦情感语义与手动动作，显著提高翻译保真度。 Conclusion: 将面部情感作为核心语义成分而非辅助信息，可显著提升无标注符号手语翻译的准确性和鲁棒性，EASLT为此提供了有效解决方案。 Abstract: Sign Language Translation (SLT) is a complex cross-modal task requiring the integration of Manual Signals (MS) and Non-Manual Signals (NMS). While recent gloss-free SLT methods have made strides in translating manual gestures, they frequently overlook the semantic criticality of facial expressions, resulting in ambiguity when distinct concepts share identical manual articulations. To address this, we present **EASLT** (**E**motion-**A**ware **S**ign **L**anguage **T**ranslation), a framework that treats facial affect not as auxiliary information, but as a robust semantic anchor. Unlike methods that relegate facial expressions to a secondary role, EASLT incorporates a dedicated emotional encoder to capture continuous affective dynamics. These representations are integrated via a novel *Emotion-Aware Fusion* (EAF) module, which adaptively recalibrates spatio-temporal sign features based on affective context to resolve semantic ambiguities. Extensive evaluations on the PHOENIX14T and CSL-Daily benchmarks demonstrate that EASLT establishes advanced performance among gloss-free methods, achieving BLEU-4 scores of 26.15 and 22.80, and BLEURT scores of 61.0 and 57.8, respectively. Ablation studies confirm that explicitly modeling emotion effectively decouples affective semantics from manual dynamics, significantly enhancing translation fidelity. Code is available at https://github.com/TuGuobin/EASLT.

Tianyi Shang,Pengjie Xu,Zhaojun Deng,Zhenyu Li,Zhicong Chen,Lijun Wu

Main category: cs.CV

TL;DR: SpatiaLoc是一种用于文本与点云跨模态定位的框架，采用从粗到细的策略，强调实例级和全局级的空间关系建模，在KITTI360Pose上显著优于现有方法。

Details

Motivation: 由于对象在文本和点云中频繁重复出现，空间关系成为定位中最具有区分性的线索，因此需要一个能够有效利用这些空间信息的方法。 Method: 提出SpatiaLoc框架：在粗略阶段使用Bezier增强的对象空间编码器（BEOSE）建模实例级空间关系，并用频率感知编码器（FAE）提取全局频域空间表示；在精细阶段，采用不确定性感知的高斯精确定位器（UGFL）将预测建模为高斯分布并回归2D位置。 Result: 在KITTI360Pose数据集上的大量实验表明，SpatiaLoc显著优于现有的最先进方法。 Conclusion: SpatiaLoc通过在多个层次上显式建模空间关系，有效提升了基于自然语言描述的跨模态定位性能，具有在自主导航和人机交互中的应用潜力。 Abstract: Cross-modal localization using text and point clouds enables robots to localize themselves via natural language descriptions, with applications in autonomous navigation and interaction between humans and robots. In this task, objects often recur across text and point clouds, making spatial relationships the most discriminative cues for localization. Given this characteristic, we present SpatiaLoc, a framework utilizing a coarse-to-fine strategy that emphasizes spatial relationships at both the instance and global levels. In the coarse stage, we introduce a Bezier Enhanced Object Spatial Encoder (BEOSE) that models spatial relationships at the instance level using quadratic Bezier curves. Additionally, a Frequency Aware Encoder (FAE) generates spatial representations in the frequency domain at the global level. In the fine stage, an Uncertainty Aware Gaussian Fine Localizer (UGFL) regresses 2D positions by modeling predictions as Gaussian distributions with a loss function aware of uncertainty. Extensive experiments on KITTI360Pose demonstrate that SpatiaLoc significantly outperforms existing state-of-the-art (SOTA) methods.

[138] Detecting AI-Generated Images via Distributional Deviations from Real Images

Yakun Niu,Yingjian Chen,Lei Zhang

Main category: cs.CV

TL;DR: 本文提出了一种基于掩码的预训练模型微调方法（MPFT），通过引入纹理感知掩码机制（TAM）提升CLIP-ViT在AI生成图像检测中的泛化能力，在少量微调图像下显著优于现有方法。

Details

Motivation: 现有的基于冻结CLIP模型的AI生成图像检测方法未能充分挖掘图像编码器的潜力，且缺乏真正的真/假图像区分能力，亟需一种能增强泛化性的新方法。 Method: 提出Masking-based Pre-trained model Fine-Tuning（MPFT）策略，结合Texture-Aware Masking（TAM）机制，在微调时遮蔽包含生成模型特有纹理的区域，迫使CLIP-ViT关注真实图像分布偏差，从而提升检测性能。 Result: 在GenImage和UniversalFakeDetect数据集上，仅用极少图像微调，本方法分别达到98.2%和94.6%的平均准确率，显著优于现有方法。 Conclusion: 通过深入分析CLIP-ViT的特征空间行为，MPFT有效提升了其在未知生成模型上的泛化检测能力，验证了利用分布偏差进行AI生成图像检测的可行性与优越性。 Abstract: The rapid advancement of generative models has significantly enhanced the quality of AI-generated images, raising concerns about misinformation and the erosion of public trust. Detecting AI-generated images has thus become a critical challenge, particularly in terms of generalizing to unseen generative models. Existing methods using frozen pre-trained CLIP models show promise in generalization but treat the image encoder as a basic feature extractor, failing to fully exploit its potential. In this paper, we perform an in-depth analysis of the frozen CLIP image encoder (CLIP-ViT), revealing that it effectively clusters real images in a high-level, abstract feature space. However, it does not truly possess the ability to distinguish between real and AI-generated images. Based on this analysis, we propose a Masking-based Pre-trained model Fine-Tuning (MPFT) strategy, which introduces a Texture-Aware Masking (TAM) mechanism to mask textured areas containing generative model-specific patterns during fine-tuning. This approach compels CLIP-ViT to attend to the "distributional deviations"from authentic images for AI-generated image detection, thereby achieving enhanced generalization performance. Extensive experiments on the GenImage and UniversalFakeDetect datasets demonstrate that our method, fine-tuned with only a minimal number of images, significantly outperforms existing approaches, achieving up to 98.2% and 94.6% average accuracy on the two datasets, respectively.

[139] Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions

Zhongbin Guo,Zhen Yang,Yushan Li,Xinyue Zhang,Wenyu Gao,Jiacheng Wang,Chengzhi Li,Xiangrui Liu,Ping Jian

Main category: cs.CV

TL;DR: 本文提出了SiT-Bench，一个用于评估大语言模型（LLM）空间智能（SI）的新基准，通过将视觉场景转化为坐标感知的文本描述，测试LLM在无像素输入下的空间推理能力。结果揭示了当前模型在全局一致性上存在“空间差距”，并表明显式空间推理可显著提升性能，暗示LLM具备潜在的世界建模能力。

Details

Motivation: 探讨空间理解是源于视觉编码器还是推理主干，并检验大语言模型在缺乏像素级输入的情况下是否具备空间智能。 Method: 构建包含3800多个专家标注项目的SiT-Bench基准，涵盖5个主要类别和17个子任务；将单/多视角场景转换为高保真、坐标感知的文本描述，评估LLM在符号化文本推理任务中的表现。 Result: 当前最先进的LLM在局部语义任务中表现良好，但在全局一致性方面存在显著的“空间差距”；引入显式空间推理机制可显著提升模型性能。 Conclusion: 大语言模型具备潜在的空间世界建模能力，显式空间推理对提升空间智能至关重要；SiT-Bench为发展具身智能体和下一代视觉-语言模型提供了基础资源。 Abstract: Recent advancements in Spatial Intelligence (SI) have predominantly relied on Vision-Language Models (VLMs), yet a critical question remains: does spatial understanding originate from visual encoders or the fundamental reasoning backbone? Inspired by this question, we introduce SiT-Bench, a novel benchmark designed to evaluate the SI performance of Large Language Models (LLMs) without pixel-level input, comprises over 3,800 expert-annotated items across five primary categories and 17 subtasks, ranging from egocentric navigation and perspective transformation to fine-grained robotic manipulation. By converting single/multi-view scenes into high-fidelity, coordinate-aware textual descriptions, we challenge LLMs to perform symbolic textual reasoning rather than visual pattern matching. Evaluation results of state-of-the-art (SOTA) LLMs reveals that while models achieve proficiency in localized semantic tasks, a significant "spatial gap" remains in global consistency. Notably, we find that explicit spatial reasoning significantly boosts performance, suggesting that LLMs possess latent world-modeling potential. Our proposed dataset SiT-Bench serves as a foundational resource to foster the development of spatially-grounded LLM backbones for future VLMs and embodied agents. Our code and benchmark will be released at https://github.com/binisalegend/SiT-Bench .

[140] Adaptive Attention Distillation for Robust Few-Shot Segmentation under Environmental Perturbations

Qianyu Guo,Jingrong Wu,Jieji Ren,Weifeng Ge,Wenqiang Zhang

Main category: cs.CV

TL;DR: 本文提出了一种环境鲁棒的少样本分割（ER-FSS）新设置，并构建了涵盖八个多场景真实世界数据集的基准，同时提出自适应注意力蒸馏（AAD）方法，通过在支持和查询图像间反复对比与提炼关键语义，提升模型在复杂环境下的分割性能。

Details

Motivation: 现有少样本分割方法多在实验室条件下训练，忽视了现实环境中光照、背景、视角等复杂因素，导致模型在实际应用中表现不佳，因此需要提升模型在复杂环境下的鲁棒性。 Method: 提出环境鲁棒的FSS设定和ER-FSS基准，并设计自适应注意力蒸馏（AAD）方法，通过在支持图像和查询图像之间反复对比与蒸馏共享语义，生成针对新类别的类别特定注意力机制。 Result: 在八个真实场景数据集上实验表明，AAD方法在所有设置下mIoU提升了3.3%–8.5%，展现出更强的泛化能力和鲁棒性。 Conclusion: 所提出的ER-FSS设定和AAD方法显著提升了少样本分割模型在复杂现实环境中的性能，推动了该技术向实际应用落地迈进。 Abstract: Few-shot segmentation (FSS) aims to rapidly learn novel class concepts from limited examples to segment specific targets in unseen images, and has been widely applied in areas such as medical diagnosis and industrial inspection. However, existing studies largely overlook the complex environmental factors encountered in real world scenarios-such as illumination, background, and camera viewpoint-which can substantially increase the difficulty of test images. As a result, models trained under laboratory conditions often fall short of practical deployment requirements. To bridge this gap, in this paper, an environment-robust FSS setting is introduced that explicitly incorporates challenging test cases arising from complex environments-such as motion blur, small objects, and camouflaged targets-to enhance model's robustness under realistic, dynamic conditions. An environment robust FSS benchmark (ER-FSS) is established, covering eight datasets across multiple real world scenarios. In addition, an Adaptive Attention Distillation (AAD) method is proposed, which repeatedly contrasts and distills key shared semantics between known (support) and unknown (query) images to derive class-specific attention for novel categories. This strengthens the model's ability to focus on the correct targets in complex environments, thereby improving environmental robustness. Comparative experiments show that AAD improves mIoU by 3.3% - 8.5% across all datasets and settings, demonstrating superior performance and strong generalization. The source code and dataset are available at: https://github.com/guoqianyu-alberta/Adaptive-Attention-Distillation-for-FSS.

[141] Unveiling Text in Challenging Stone Inscriptions: A Character-Context-Aware Patching Strategy for Binarization

Pratyush Jena,Amal Joseph,Arnav Sharma,Ravi Kiran Sarvadevabhatla

Main category: cs.CV

TL;DR: 提出一种鲁棒且自适应的分块策略，结合Attention U-Net模型，有效提升历史石刻铭文图像的二值化效果，并发布了一个精细标注的印度铭文数据集。

Details

Motivation: 历史石刻铭文图像由于对比度低、表面退化不均、干扰多和布局复杂，现有二值化方法难以有效分割字符区域，亟需更鲁棒的方法。 Method: 提出一种动态采样与自适应分块策略，利用Attention U-Net模型进行二值化，通过注意力机制聚焦细微结构特征，并在分块上优化训练效果。 Result: 该方法显著提升了经典与深度学习基线模型的二值化性能，在单一脚本数据集上训练后，还能在其他印度及非印度语系脚本上实现强零样本泛化能力。 Conclusion: 所提方法能生成清晰、结构化的铭文表示，为后续的OCR、文字识别和历史文本分析等任务奠定了基础。 Abstract: Binarization is a popular first step towards text extraction in historical artifacts. Stone inscription images pose severe challenges for binarization due to poor contrast between etched characters and the stone background, non-uniform surface degradation, distracting artifacts, and highly variable text density and layouts. These conditions frequently cause existing binarization techniques to fail and struggle to isolate coherent character regions. Many approaches sub-divide the image into patches to improve text fragment resolution and improve binarization performance. With this in mind, we present a robust and adaptive patching strategy to binarize challenging Indic inscriptions. The patches from our approach are used to train an Attention U-Net for binarization. The attention mechanism allows the model to focus on subtle structural cues, while our dynamic sampling and patch selection method ensures that the model learns to overcome surface noise and layout irregularities. We also introduce a carefully annotated, pixel-precise dataset of Indic stone inscriptions at the character-fragment level. We demonstrate that our novel patching mechanism significantly boosts binarization performance across classical and deep learning baselines. Despite training only on single script Indic dataset, our model exhibits strong zero-shot generalization to other Indic and non-indic scripts, highlighting its robustness and script-agnostic generalization capabilities. By producing clean, structured representations of inscription content, our method lays the foundation for downstream tasks such as script identification, OCR, and historical text analysis. Project page: https://ihdia.iiit.ac.in/shilalekhya-binarization/

[142] Systematic Evaluation of Depth Backbones and Semantic Cues for Monocular Pseudo-LiDAR 3D Detection

Samson Oseiwe Ajadalu

Main category: cs.CV

TL;DR: 本文系统评估了深度骨干网络和特征工程对单目伪LiDAR三维检测性能的影响，发现深度估计的几何保真度比附加语义特征更重要。

Details

Motivation: 单目3D目标检测因成本低而有前景，但受限于从单张图像估计精确深度的难度，需探究如何提升其性能。 Method: 在相同的伪LiDAR生成和PointRCNN检测框架下，比较监督深度模型NeWCRFs与Depth Anything V2的性能，并测试基于外观和语义线索的点云增强方法。 Result: NeWCRFs在中等难度集上达到10.50% AP$_{3D}$（IoU=0.7）；加入语义特征增益有限，掩码采样可能损害上下文几何；深度精度随距离变化的分析表明粗略深度正确性不能完全预测严格3D IoU。 Conclusion: 在使用现成LiDAR检测器时，深度骨干网络的选择和几何保真度主导检测性能，远超过额外特征注入的影响。 Abstract: Monocular 3D object detection offers a low-cost alternative to LiDAR, yet remains less accurate due to the difficulty of estimating metric depth from a single image. We systematically evaluate how depth backbones and feature engineering affect a monocular Pseudo-LiDAR pipeline on the KITTI validation split. Specifically, we compare NeWCRFs (supervised metric depth) against Depth Anything V2 Metric-Outdoor (Base) under an identical pseudo-LiDAR generation and PointRCNN detection protocol. NeWCRFs yields stronger downstream 3D detection, achieving 10.50\% AP$_{3D}$ at IoU$=0.7$ on the Moderate split using grayscale intensity (Exp~2). We further test point-cloud augmentations using appearance cues (grayscale intensity) and semantic cues (instance segmentation confidence). Contrary to the expectation that semantics would substantially close the gap, these features provide only marginal gains, and mask-based sampling can degrade performance by removing contextual geometry. Finally, we report a depth-accuracy-versus-distance diagnostic using ground-truth 2D boxes (including Ped/Cyc), highlighting that coarse depth correctness does not fully predict strict 3D IoU. Overall, under an off-the-shelf LiDAR detector, depth-backbone choice and geometric fidelity dominate performance, outweighing secondary feature injection.

[143] Shape Classification using Approximately Convex Segment Features

Bimal Kumar Ray

Main category: cs.CV

TL;DR: 提出一种无需对象对齐的物体分类方法，通过将边界分割并按特征排序来计算相似性。

Details

Motivation: 避免传统方法中对象对齐带来的复杂性和限制。 Method: 将对象边界归一化并分割为近似凸段，按长度降序排列，并使用段长、极值点数、面积、底边和宽度等特征构成特征包来衡量边界相似性。 Result: 在多个数据集上测试了该方法，取得了可接受的分类结果。 Conclusion: 该方法有效替代了对象对齐，简化了分类流程且具有实用性。 Abstract: The existing object classification techniques based on descriptive features rely on object alignment to compute the similarity of objects for classification. This paper replaces the necessity of object alignment through sorting of feature. The object boundary is normalized and segmented into approximately convex segments and the segments are then sorted in descending order of their length. The segment length, number of extreme points in segments, area of segments, the base and the width of the segments - a bag of features - is used to measure the similarity between image boundaries. The proposed method is tested on datasets and acceptable results are observed.

[144] MFC-RFNet: A Multi-scale Guided Rectified Flow Network for Radar Sequence Prediction

Wenjie Luo,Chuanhu Deng,Chaorong Li,Rongyao Deng,Qiang Yang

Main category: cs.CV

TL;DR: 本文提出了一种名为MFC-RFNet的生成模型，用于雷达回波序列的高精度降水临近预报，结合多尺度特征通信、空间对齐与小波引导的高频信息融合，并采用修正流训练实现快速高质量采样，在多个数据集上表现优越。

Details

Motivation: 降水临近预报对防灾减灾和经济规划至关重要，但现有方法在建模多尺度演化、帧间特征错位校正以及时空上下文捕捉方面仍面临挑战。 Method: 提出MFC-RFNet模型：引入小波引导跳跃连接（WGSC）保留高频细节；设计特征通信模块（FCM）增强跨尺度交互；通过条件引导空间变换融合（CGSTF）校正帧间位移；采用修正流（Rectified Flow）训练实现稳定的一致性生成；在关键位置嵌入轻量化的Vision-RWKV模块捕获长程时空依赖。 Result: 在SEVIR、MeteoNet、Shanghai和CIKM四个公开数据集上均优于强基线方法，尤其在高雨强阈值下生成更清晰的回波形态，并在较长预测时序中保持较高预报技能。 Conclusion: 修正流训练与多尺度通信、空间对齐及频率感知融合的结合，为雷达回波临近预报提供了一种高效且鲁棒的解决方案。 Abstract: Accurate and high-resolution precipitation nowcasting from radar echo sequences is crucial for disaster mitigation and economic planning, yet it remains a significant challenge. Key difficulties include modeling complex multi-scale evolution, correcting inter-frame feature misalignment caused by displacement, and efficiently capturing long-range spatiotemporal context without sacrificing spatial fidelity. To address these issues, we present the Multi-scale Feature Communication Rectified Flow (RF) Network (MFC-RFNet), a generative framework that integrates multi-scale communication with guided feature fusion. To enhance multi-scale fusion while retaining fine detail, a Wavelet-Guided Skip Connection (WGSC) preserves high-frequency components, and a Feature Communication Module (FCM) promotes bidirectional cross-scale interaction. To correct inter-frame displacement, a Condition-Guided Spatial Transform Fusion (CGSTF) learns spatial transforms from conditioning echoes to align shallow features. The backbone adopts rectified flow training to learn near-linear probability-flow trajectories, enabling few-step sampling with stable fidelity. Additionally, lightweight Vision-RWKV (RWKV) blocks are placed at the encoder tail, the bottleneck, and the first decoder layer to capture long-range spatiotemporal dependencies at low spatial resolutions with moderate compute. Evaluations on four public datasets (SEVIR, MeteoNet, Shanghai, and CIKM) demonstrate consistent improvements over strong baselines, yielding clearer echo morphology at higher rain-rate thresholds and sustained skill at longer lead times. These results suggest that the proposed synergy of RF training with scale-aware communication, spatial alignment, and frequency-aware fusion presents an effective and robust approach for radar-based nowcasting.

[145] CrackSegFlow: Controllable Flow-Matching Synthesis for Generalizable Crack Segmentation with the CSF-50K Benchmark

Babak Asadi,Peiyang Wu,Mani Golparvar-Fard,Ramez Hajj

Main category: cs.CV

TL;DR: 本文提出了一种名为CrackSegFlow的可控流匹配合成框架，用于生成高质量、配对的裂缝图像和掩码数据，以解决路面和基础设施自动化裂缝分割中标签稀缺和域偏移严重的问题。该方法通过拓扑保持的掩码注入与边界门控调制生成几何对齐的逼真裂缝图像，并能控制裂缝覆盖率生成多样化训练数据。实验表明其在多个基准上显著提升分割性能，尤其在跨域场景下表现突出，并发布了包含5万对数据的公开数据集CSF-50K。

Details

Motivation: 由于像素级标注稀缺以及不同传感器、光照、纹理和标注标准之间的严重域偏移，当前自动化裂缝分割的实际应用受到限制。因此需要一种能够生成高质量、严格对齐且多样化的配对数据（图像与掩码）的方法来提升模型泛化能力。 Method: 提出CrackSegFlow：一种基于流匹配的双阶段合成框架。第一阶段，结合拓扑保持的掩码注入与边界门控调制，从二值掩码生成保形且对齐的逼真裂缝图像；第二阶段，使用类条件流匹配模型合成具有可控裂缝覆盖率的掩码，实现无需额外人工标注的平衡且拓扑多样的配对数据生成。进一步将裂缝掩码嵌入无裂纹背景中以增强光照和表面变化，减少阴影、接缝等引起的误检。 Result: 在四个沥青数据集和一个混凝土数据集的裂缝类别上的五个基准测试中，使用真实+合成数据后，平均提升5.37 mIoU和5.13 F1；基于目标域掩码统计进行引导的跨域合成带来13.12 mIoU和14.82 F1的平均增益。相比扩散模型，CrackSegFlow采样更快、保真度更高、掩码-图像对齐更好，尤其适用于细长结构的裂缝几何建模。 Conclusion: CrackSegFlow有效缓解了裂缝分割中标签稀缺和域偏移问题，通过可控生成高保真、严格对齐的配对数据显著提升了模型性能，尤其在跨域场景下优势明显。其快速确定性采样和高质量输出优于现有扩散方法，并发布大规模公开数据集CSF-50K推动后续研究。 Abstract: Automated crack segmentation is essential for scalable condition assessment of pavements and civil infrastructure, yet practical deployment is limited by scarce pixel-level labels and severe domain shift across sensors, illumination, textures, and annotation conventions. This paper presents CrackSegFlow, a controllable flow-matching synthesis framework that generates photorealistic crack images conditioned on binary masks while preserving strict mask-image alignment. The generator combines topology-preserving mask injection with boundary-gated modulation to maintain thin-structure continuity and suppress texture-driven false positives. A second class-conditional flow-matching model synthesizes crack masks with explicit control over crack coverage, enabling balanced, topology-diverse paired data without additional manual annotation. We further inject crack masks into crack-free backgrounds to diversify illumination and surface artifacts and reduce false positives caused by shadows, joints, and pavement markings. Experiments on five benchmarks spanning four asphalt datasets and the crack class of a concrete-domain dataset demonstrate consistent improvements under an established hybrid CNN--Transformer segmentation backbone and a fixed training protocol. With real plus synthesized pairs, in-domain performance improves on average by 5.37 mIoU and 5.13 F1, and target-guided cross-domain synthesis yields average gains of 13.12 mIoU and 14.82 F1 using only limited target mask statistics. Compared with diffusion-based semantic synthesis, CrackSegFlow provides substantially faster deterministic sampling and improves fidelity and mask-image alignment for thin-structure crack geometry. Finally, we release CSF-50K, a public dataset of 50,000 paired crack images and pixel-accurate masks for large-scale benchmarking of generalizable crack segmentation.

[146] VideoMemory: Toward Consistent Video Generation via Memory Integration

Jinsong Zhou,Yihua Du,Xinli Xu,Luozhou Wang,Zijie Zhuang,Yehang Zhang,Shuaibo Li,Xiaojun Hu,Bolan Su,Ying-cong Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为VideoMemory的实体中心框架，通过动态记忆库在叙事视频生成中实现跨镜头的实体一致性。

Details

Motivation: 现有视频生成模型在场景变化或实体长时间消失后难以保持身份一致性，本文旨在解决这一问题。 Method: 设计一个多智能体系统，结合结构化剧本分解叙事，并利用动态记忆库存储和更新角色、道具和背景的视觉与语义描述，通过检索-更新机制生成连贯的关键帧和视频。 Result: 在包含54个案例的多镜头一致性基准上进行了实验，结果显示VideoMemory在不同叙事序列中实现了强实体级连贯性和高感知质量。 Conclusion: VideoMemory有效提升了长时程叙事视频生成中的实体一致性，为多镜头视频生成提供了可扩展且一致的解决方案。 Abstract: Maintaining consistent characters, props, and environments across multiple shots is a central challenge in narrative video generation. Existing models can produce high-quality short clips but often fail to preserve entity identity and appearance when scenes change or when entities reappear after long temporal gaps. We present VideoMemory, an entity-centric framework that integrates narrative planning with visual generation through a Dynamic Memory Bank. Given a structured script, a multi-agent system decomposes the narrative into shots, retrieves entity representations from memory, and synthesizes keyframes and videos conditioned on these retrieved states. The Dynamic Memory Bank stores explicit visual and semantic descriptors for characters, props, and backgrounds, and is updated after each shot to reflect story-driven changes while preserving identity. This retrieval-update mechanism enables consistent portrayal of entities across distant shots and supports coherent long-form generation. To evaluate this setting, we construct a 54-case multi-shot consistency benchmark covering character-, prop-, and background-persistent scenarios. Extensive experiments show that VideoMemory achieves strong entity-level coherence and high perceptual quality across diverse narrative sequences.

[147] MGPC: Multimodal Network for Generalizable Point Cloud Completion With Modality Dropout and Progressive Decoding

Jiangyuan Liu,Hongxuan Ma,Yuhao Zhao,Zhe Liu,Jian Wang,Wei Zou

Main category: cs.CV

TL;DR: 本文提出了一种名为MGPC的多模态点云补全框架，结合点云、RGB图像和文本信息，通过模态丢弃策略、Transformer融合模块和渐进式生成器提升模型在真实场景下的泛化能力，并构建了大规模数据集MGPC-1M进行验证。

Details

Motivation: 现有基于学习的点云补全方法在合成数据上表现良好，但在新物体和真实场景中的泛化能力受限于模态单一、可扩展性差和生成能力不足。 Method: 提出MGPC框架，融合点云、RGB图像和文本；引入模态丢弃策略、基于Transformer的融合模块和渐进式生成器；构建包含百万级样本的大规模数据集MGPC-1M。 Result: 在MGPC-1M和真实世界数据上的实验表明，MGPC显著优于先前方法，在多种条件下展现出更强的鲁棒性和泛化性能。 Conclusion: MGPC通过多模态融合与创新架构设计，有效提升了点云补全在真实场景中的适用性和泛化能力，为未来实际应用提供了可行方案。 Abstract: Point cloud completion aims to recover complete 3D geometry from partial observations caused by limited viewpoints and occlusions. Existing learning-based works, including 3D Convolutional Neural Network (CNN)-based, point-based, and Transformer-based methods, have achieved strong performance on synthetic benchmarks. However, due to the limitations of modality, scalability, and generative capacity, their generalization to novel objects and real-world scenarios remains challenging. In this paper, we propose MGPC, a generalizable multimodal point cloud completion framework that integrates point clouds, RGB images, and text within a unified architecture. MGPC introduces an innovative modality dropout strategy, a Transformer-based fusion module, and a novel progressive generator to improve robustness, scalability, and geometric modeling capability. We further develop an automatic data generation pipeline and construct MGPC-1M, a large-scale benchmark with over 1,000 categories and one million training pairs. Extensive experiments on MGPC-1M and in-the-wild data demonstrate that the proposed method consistently outperforms prior baselines and exhibits strong generalization under real-world conditions.

[148] PhysVideoGenerator: Towards Physically Aware Video Generation via Latent Physics Guidance

Siddarth Nilol Kundur Satish,Devesh Jaiswal,Hongyu Chen,Abhishek Bakshi

Main category: cs.CV

TL;DR: 提出PhysVideoGenerator框架，通过将可学习的物理先验嵌入视频生成过程，利用PredictorP网络从扩散模型潜在空间预测物理特征，并注入DiT生成器的时序注意力层，验证了联合训练范式的可行性与稳定性。

Details

Motivation: 现有视频生成模型在真实物理动态建模方面存在不足，常出现不自然的物体碰撞、重力不一致和时序闪烁等问题，缺乏对现实世界物理规律的准确表达。 Method: 设计PredictorP轻量网络，从预训练V-JEPA2模型提取的高维物理特征回归扩散噪声潜在表示，并通过交叉注意力机制将预测的物理token注入Latte（DiT-based）生成器的时序注意力层，实现物理感知的视频生成。 Result: 实验证明扩散潜在空间包含足够恢复V-JEPA2物理表征的信息，且多任务联合训练过程稳定，成功实现了物理先验与扩散生成的端到端联合优化。 Conclusion: 该工作验证了在视频生成中显式引入可学习物理先验的技术可行性，为未来大规模评估物理感知生成模型奠定了基础。 Abstract: Current video generation models produce high-quality aesthetic videos but often struggle to learn representations of real-world physics dynamics, resulting in artifacts such as unnatural object collisions, inconsistent gravity, and temporal flickering. In this work, we propose PhysVideoGenerator, a proof-of-concept framework that explicitly embeds a learnable physics prior into the video generation process. We introduce a lightweight predictor network, PredictorP, which regresses high-level physical features extracted from a pre-trained Video Joint Embedding Predictive Architecture (V-JEPA 2) directly from noisy diffusion latents. These predicted physics tokens are injected into the temporal attention layers of a DiT-based generator (Latte) via a dedicated cross-attention mechanism. Our primary contribution is demonstrating the technical feasibility of this joint training paradigm: we show that diffusion latents contain sufficient information to recover V-JEPA 2 physical representations, and that multi-task optimization remains stable over training. This report documents the architectural design, technical challenges, and validation of training stability, establishing a foundation for future large-scale evaluation of physics-aware generative models.

[149] TRec: Egocentric Action Recognition using 2D Point Tracks

Dennis Holzmann,Sven Wachsmuth

Main category: cs.CV

TL;DR: 本文提出了一种利用2D点轨迹作为额外运动线索的自我中心动作识别新方法，通过随机采样图像点的跟踪信息与Transformer模型结合，显著提升了识别精度，且无需检测手部、物体或交互区域。

Details

Motivation: 现有方法多依赖RGB外观、人体姿态估计或其组合，缺乏对简单而有效运动线索的探索，本文旨在验证2D点轨迹作为一种轻量级运动表示在动作识别中的潜力。 Method: 使用CoTracker追踪随机初始化的图像点在视频中的轨迹，并将这些轨迹与对应帧输入到基于Transformer的动作识别模型中，仅用首帧及其点轨迹即可实现性能提升。 Result: 实验表明，加入2D点轨迹后，模型性能相比无运动信息训练的结果持续提升，即使不使用完整视频序列也表现出显著增益。 Conclusion: 2D点轨迹是一种轻量且有效的运动表示方式，能够显著增强自我中心动作识别的准确性，为后续研究提供了新的方向。 Abstract: We present a novel approach for egocentric action recognition that leverages 2D point tracks as an additional motion cue. While most existing methods rely on RGB appearance, human pose estimation, or their combination, our work demonstrates that tracking randomly sampled image points across video frames can substantially improve recognition accuracy. Unlike prior approaches, we do not detect hands, objects, or interaction regions. Instead, we employ CoTracker to follow a set of randomly initialized points through each video and use the resulting trajectories, together with the corresponding image frames, as input to a Transformer-based recognition model. Surprisingly, our method achieves notable gains even when only the initial frame and its associated point tracks are provided, without incorporating the full video sequence. Experimental results confirm that integrating 2D point tracks consistently enhances performance compared to the same model trained without motion information, highlighting their potential as a lightweight yet effective representation for egocentric action understanding.

[150] BREATH-VL: Vision-Language-Guided 6-DoF Bronchoscopy Localization via Semantic-Geometric Fusion

Qingyao Tian,Bingyu Yang,Huai Liao,Xinyan Huang,Junyong Li,Dong Yi,Hongbin Liu

Main category: cs.CV

TL;DR: 本文提出了一种结合视觉-语言模型（VLM）和基于视觉配准方法的混合框架BREATH-VL，用于6自由度内窥镜相机定位，并构建了目前最大的体内内窥镜定位数据集BREATH，显著提升了定位精度和泛化能力。

Details

Motivation: 由于缺乏大规模、高质量、密集标注且面向定位的医学视觉-语言数据集，现有VLM在6-DoF内窥镜定位中面临语义理解不足、细粒度位姿回归能力弱和时序特征计算延迟高等问题，因此需要融合语义与几何优势并提升时序推理效率。 Method: 构建BREATH数据集，并提出BREATH-VL框架：将VLM的语义理解与基于视觉的配准方法的几何对齐相结合；引入轻量级上下文学习机制，将运动历史编码为语言提示以实现高效的时序推理。 Result: 实验表明，该视觉-语言模块在复杂手术场景中实现了鲁棒的语义定位，BREATH-VL相比最先进的纯视觉方法在平移误差上降低了25.5%，同时具有竞争力的计算延迟。 Conclusion: 通过融合VLM的泛化语义理解与几何配准的精确定位优势，并利用语言化时序建模，BREATH-VL在内窥镜定位任务中实现了更高精度、更好泛化性和实时性，验证了语义与几何协同的有效性。 Abstract: Vision-language models (VLMs) have recently shown remarkable performance in navigation and localization tasks by leveraging large-scale pretraining for semantic understanding. However, applying VLMs to 6-DoF endoscopic camera localization presents several challenges: 1) the lack of large-scale, high-quality, densely annotated, and localization-oriented vision-language datasets in real-world medical settings; 2) limited capability for fine-grained pose regression; and 3) high computational latency when extracting temporal features from past frames. To address these issues, we first construct BREATH dataset, the largest in-vivo endoscopic localization dataset to date, collected in the complex human airway. Building on this dataset, we propose BREATH-VL, a hybrid framework that integrates semantic cues from VLMs with geometric information from vision-based registration methods for accurate 6-DoF pose estimation. Our motivation lies in the complementary strengths of both approaches: VLMs offer generalizable semantic understanding, while registration methods provide precise geometric alignment. To further enhance the VLM's ability to capture temporal context, we introduce a lightweight context-learning mechanism that encodes motion history as linguistic prompts, enabling efficient temporal reasoning without expensive video-level computation. Extensive experiments demonstrate that the vision-language module delivers robust semantic localization in challenging surgical scenes. Building on this, our BREATH-VL outperforms state-of-the-art vision-only localization methods in both accuracy and generalization, reducing translational error by 25.5% compared with the best-performing baseline, while achieving competitive computational latency.

[151] Towards Real-world Lens Active Alignment with Unlabeled Data via Domain Adaptation

Wenyong Lia,Qi Jiang,Weijian Hu,Kailun Yang,Zhanjun Zhang,Wenjun Tian,Kaiwei Wang,Jian Bai

Main category: cs.CV

TL;DR: 提出了一种名为DA3的域自适应主动对准方法，利用仿真数据和少量无标签的真实世界图像，通过自回归域变换生成器和对抗性特征对齐策略，显著缩小了仿真与现实之间的域差距，实现了高精度的光学系统装配，准确率比纯仿真方法提高46%，且数据采集时间减少98.7%。

Details

Motivation: 复杂成像条件导致仿真与真实图像之间存在域差距，限制了仿真训练模型在实际应用中的泛化能力，因此需要一种能有效利用少量未标注真实数据进行域适应的方法。 Method: 提出Domain Adaptive Active Alignment (DA3)，结合自回归域变换生成器和基于对抗的特征对齐策略，通过自监督学习从少量无标签的真实图像中提取域不变的图像退化特征，提升模型在真实场景中的对准预测鲁棒性。 Result: 在两种镜头类型上的实验表明，DA3比纯仿真流水线准确率提高46%，性能接近使用3个镜头样本精确标注真实数据训练的结果，同时将设备上数据采集时间减少了98.7%。 Conclusion: 域适应能有效赋予仿真训练模型强健的真实世界性能，验证了基于数字孪生的流水线是提升大规模光学装配效率的可行方案。 Abstract: Active Alignment (AA) is a key technology for the large-scale automated assembly of high-precision optical systems. Compared with labor-intensive per-model on-device calibration, a digital-twin pipeline built on optical simulation offers a substantial advantage in generating large-scale labeled data. However, complex imaging conditions induce a domain gap between simulation and real-world images, limiting the generalization of simulation-trained models. To address this, we propose augmenting a simulation baseline with minimal unlabeled real-world images captured at random misalignment positions, mitigating the gap from a domain adaptation perspective. We introduce Domain Adaptive Active Alignment (DA3), which utilizes an autoregressive domain transformation generator and an adversarial-based feature alignment strategy to distill real-world domain information via self-supervised learning. This enables the extraction of domain-invariant image degradation features to facilitate robust misalignment prediction. Experiments on two lens types reveal that DA3 improves accuracy by 46% over a purely simulation pipeline. Notably, it approaches the performance achieved with precisely labeled real-world data collected on 3 lens samples, while reducing on-device data collection time by 98.7%. The results demonstrate that domain adaptation effectively endows simulation-trained models with robust real-world performance, validating the digital-twin pipeline as a practical solution to significantly enhance the efficiency of large-scale optical assembly.

[152] CSMCIR: CoT-Enhanced Symmetric Alignment with Memory Bank for Composed Image Retrieval

Zhipeng Qian,Zihan Liang,Yufei Ma,Ben Chen,Huangyu Dai,Yiwei Ma,Jiayi Ji,Chenyi Lei,Han Li,Xiaoshuai Sun

Main category: cs.CV

TL;DR: 本文提出CSMCIR框架，通过多级思维链提示、对称双塔架构和动态记忆库策略，解决组合图像检索中的表征空间碎片化问题，实现查询与目标的高效对齐。

Details

Motivation: 现有CIR方法因异构模态导致表征空间碎片化，难以有效对齐查询与目标，限制了检索性能。 Method: 1) 提出多级思维链（MCoT）提示生成判别性图像描述；2) 设计共享参数的对称双塔结构；3) 采用基于熵的动态记忆库存储高质量负样本。 Result: 在四个基准数据集上达到SOTA性能，且训练效率更高，消融实验验证各组件有效性。 Conclusion: CSMCIR通过统一表征框架有效缩小模态间对齐鸿沟，为组合图像检索提供了更高效、对称的解决方案。 Abstract: Composed Image Retrieval (CIR) enables users to search for target images using both a reference image and manipulation text, offering substantial advantages over single-modality retrieval systems. However, existing CIR methods suffer from representation space fragmentation: queries and targets comprise heterogeneous modalities and are processed by distinct encoders, forcing models to bridge misaligned representation spaces only through post-hoc alignment, which fundamentally limits retrieval performance. This architectural asymmetry manifests as three distinct, well-separated clusters in the feature space, directly demonstrating how heterogeneous modalities create fundamentally misaligned representation spaces from initialization. In this work, we propose CSMCIR, a unified representation framework that achieves efficient query-target alignment through three synergistic components. First, we introduce a Multi-level Chain-of-Thought (MCoT) prompting strategy that guides Multimodal Large Language Models to generate discriminative, semantically compatible captions for target images, establishing modal symmetry. Building upon this, we design a symmetric dual-tower architecture where both query and target sides utilize the identical shared-parameter Q-Former for cross-modal encoding, ensuring consistent feature representations and further reducing the alignment gap. Finally, this architectural symmetry enables an entropy-based, temporally dynamic Memory Bank strategy that provides high-quality negative samples while maintaining consistency with the evolving model state. Extensive experiments on four benchmark datasets demonstrate that our CSMCIR achieves state-of-the-art performance with superior training efficiency. Comprehensive ablation studies further validate the effectiveness of each proposed component.

[153] MATANet: A Multi-context Attention and Taxonomy-Aware Network for Fine-Grained Underwater Recognition of Marine Species

Donghwan Lee,Byeongjin Kim,Geunhee Kim,Hyukjin Kwon,Nahyeon Maeng,Wooju Kim

Main category: cs.CV

TL;DR: 提出MATANet模型，结合环境上下文和生物分类层级信息，提升海洋生物细粒度分类性能。

Details

Motivation: 现有方法忽略环境上下文交互且未能充分融合海洋生物分类体系的层次结构，导致细粒度分类效果受限。 Method: 设计多上下文注意力与分类感知网络（MATANet），包含多上下文环境注意力模块（MCEAM）捕捉目标区域与周围环境的关系，以及分层分离诱导学习模块（HSLM）将分类层级嵌入特征空间。 Result: 在FathomNet2025、FAIR1M和LifeCLEF2015-Fish数据集上达到最先进性能。 Conclusion: MATANet通过整合环境上下文和分类先验知识，有效提升了海洋物种的细粒度分类准确率，支持生态监测与保护决策。 Abstract: Fine-grained classification of marine animals supports ecology, biodiversity and habitat conservation, and evidence-based policy-making. However, existing methods often overlook contextual interactions from the surrounding environment and insufficiently incorporate the hierarchical structure of marine biological taxonomy. To address these challenges, we propose MATANet (Multi-context Attention and Taxonomy-Aware Network), a novel model designed for fine-grained marine species classification. MATANet mimics expert strategies by using taxonomy and environmental context to interpret ambiguous features of underwater animals. It consists of two key components: a Multi-Context Environmental Attention Module (MCEAM), which learns relationships between regions of interest (ROIs) and their surrounding environments, and a Hierarchical Separation-Induced Learning Module (HSLM), which encodes taxonomic hierarchy into the feature space. MATANet combines instance and environmental features with taxonomic structure to enhance fine-grained classification. Experiments on the FathomNet2025, FAIR1M, and LifeCLEF2015-Fish datasets demonstrate state-of-the-art performance. The source code is available at: https://github.com/dhlee-work/fathomnet-cvpr2025-ssl

[154] RadDiff: Describing Differences in Radiology Image Sets with Natural Language

Xiaoxian Shen,Yuhui Zhang,Sahithi Ankireddy,Xiaohan Wang,Maya Varma,Henry Guo,Curtis Langlotz,Serena Yeung-Levy

Main category: cs.CV

TL;DR: RadDiff 是一种用于比较放射学影像研究的多模态智能系统，通过结合图像与临床报告、医学知识注入、迭代假设优化和定向视觉搜索，实现类似放射科医生的差异推理，并在新构建的基准 RadDiffBench 上表现出优于通用方法的性能。

Details

Motivation: 准确识别放射学图像集之间的临床差异对于生成医学洞察和解释医疗AI系统至关重要，现有方法缺乏针对医学领域的深度推理能力。 Method: 基于VisDiff的提出-排序框架，引入四个创新：领域适配的视觉语言模型注入医学知识、整合图像与临床报告的多模态推理、多轮迭代假设 refinement 和定向视觉搜索以捕捉细微病变。 Result: 在包含57对专家验证影像的新基准RadDiffBench上，RadDiff达到47%的准确率，若由真实报告引导则达50%，显著优于通用VisDiff基线，并展现出在COVID-19表型比较、种族亚组分析和生存相关特征发现等任务中的通用性。 Conclusion: RadDiff与RadDiffBench共同为系统化揭示放射学数据中的有意义差异提供了首个方法与评估基准基础。 Abstract: Understanding how two radiology image sets differ is critical for generating clinical insights and for interpreting medical AI systems. We introduce RadDiff, a multimodal agentic system that performs radiologist-style comparative reasoning to describe clinically meaningful differences between paired radiology studies. RadDiff builds on a proposer-ranker framework from VisDiff, and incorporates four innovations inspired by real diagnostic workflows: (1) medical knowledge injection through domain-adapted vision-language models; (2) multimodal reasoning that integrates images with their clinical reports; (3) iterative hypothesis refinement across multiple reasoning rounds; and (4) targeted visual search that localizes and zooms in on salient regions to capture subtle findings. To evaluate RadDiff, we construct RadDiffBench, a challenging benchmark comprising 57 expert-validated radiology study pairs with ground-truth difference descriptions. On RadDiffBench, RadDiff achieves 47% accuracy, and 50% accuracy when guided by ground-truth reports, significantly outperforming the general-domain VisDiff baseline. We further demonstrate RadDiff's versatility across diverse clinical tasks, including COVID-19 phenotype comparison, racial subgroup analysis, and discovery of survival-related imaging features. Together, RadDiff and RadDiffBench provide the first method-and-benchmark foundation for systematically uncovering meaningful differences in radiological data.

[155] HyperCOD: The First Challenging Benchmark and Baseline for Hyperspectral Camouflaged Object Detection

Shuyan Bai,Tingfa Xu,Peifu Liu,Yuhao Qiu,Huiyan Bai,Huan Chen,Yanyan Peng,Jianan Li

Main category: cs.CV

TL;DR: 本文提出了首个用于高光谱伪装物体检测（HCOD）的基准数据集HyperCOD，并设计了HSC-SAM模型，通过解耦空间图与光谱显著性图来适配SAM框架，实现了在复杂场景下的先进性能。

Details

Motivation: 现有基于RGB的伪装物体检测在颜色和纹理模糊的真实场景中表现不佳，而高光谱图像虽具潜力，但缺乏专用的大规模数据集阻碍了HCOD的发展。 Method: 构建了包含350张高分辨率高光谱图像的HyperCOD数据集，涵盖复杂真实场景；提出HSC-SAM模型，将高光谱图像分解为空间图输入SAM图像编码器，并生成光谱显著性图作为自适应提示，以桥接模态差异。 Result: 实验表明HSC-SAM在HyperCOD上达到最先进的性能，并能稳健泛化到其他公开高光谱数据集。 Conclusion: HyperCOD为HCOD研究提供了重要基础，HSC-SAM展示了将基础模型适配于高光谱任务的有效范式，推动了伪装物体检测在复杂现实场景中的发展。 Abstract: RGB-based camouflaged object detection struggles in real-world scenarios where color and texture cues are ambiguous. While hyperspectral image offers a powerful alternative by capturing fine-grained spectral signatures, progress in hyperspectral camouflaged object detection (HCOD) has been critically hampered by the absence of a dedicated, large-scale benchmark. To spur innovation, we introduce HyperCOD, the first challenging benchmark for HCOD. Comprising 350 high-resolution hyperspectral images, It features complex real-world scenarios with minimal objects, intricate shapes, severe occlusions, and dynamic lighting to challenge current models. The advent of foundation models like the Segment Anything Model (SAM) presents a compelling opportunity. To adapt the Segment Anything Model (SAM) for HCOD, we propose HyperSpectral Camouflage-aware SAM (HSC-SAM). HSC-SAM ingeniously reformulates the hyperspectral image by decoupling it into a spatial map fed to SAM's image encoder and a spectral saliency map that serves as an adaptive prompt. This translation effectively bridges the modality gap. Extensive experiments show that HSC-SAM sets a new state-of-the-art on HyperCOD and generalizes robustly to other public HSI datasets. The HyperCOD dataset and our HSC-SAM baseline provide a robust foundation to foster future research in this emerging area.

[156] I2E: From Image Pixels to Actionable Interactive Environments for Text-Guided Image Editing

Jinghan Yu,Junhao Xiao,Chenyu Zhu,Jiaming Li,Jia Li,HanMing Deng,Xirui Wang,Guoli Jia,Jianjun Li,Zhiyuan Ma,Xiang Bai,Bowen Zhou

Main category: cs.CV

TL;DR: 本文提出了一种新的文本引导图像编辑范式I2E，采用“分解-然后操作”的框架，通过将图像分解为可操作的对象层，并利用视觉-语言-动作智能体进行原子化操作，显著提升了复杂组合编辑任务的性能。

Details

Motivation: 现有基于像素级修复的图像编辑方法在处理需要精确局部控制和多对象空间推理的复杂编辑任务时存在局限性，主要体现在规划与执行耦合、缺乏对象级控制粒度以及依赖非结构化建模。为此，本文旨在提出一种更具结构性和可控性的新范式。 Method: 提出I2E框架，包含一个分解器（Decomposer）将图像转换为离散的可操作对象层，以及一个具备物理感知能力的视觉-语言-动作智能体，通过思维链推理将复杂指令解析为一系列原子动作，在结构化环境中实现图像编辑。同时构建了用于评估多实例空间推理和高精度编辑的基准I2E-Bench。 Result: 在I2E-Bench及多个公开基准上的实验表明，I2E在处理复杂组合指令、保持物理合理性以及多轮编辑稳定性方面显著优于现有最先进方法。 Conclusion: I2E通过解耦规划与执行、引入对象级操作和结构化环境建模，为文本引导图像编辑提供了更强大且可控的新范式，特别适用于复杂的多对象编辑场景。 Abstract: Existing text-guided image editing methods primarily rely on end-to-end pixel-level inpainting paradigm. Despite its success in simple scenarios, this paradigm still significantly struggles with compositional editing tasks that require precise local control and complex multi-object spatial reasoning. This paradigm is severely limited by 1) the implicit coupling of planning and execution, 2) the lack of object-level control granularity, and 3) the reliance on unstructured, pixel-centric modeling. To address these limitations, we propose I2E, a novel "Decompose-then-Action" paradigm that revisits image editing as an actionable interaction process within a structured environment. I2E utilizes a Decomposer to transform unstructured images into discrete, manipulable object layers and then introduces a physics-aware Vision-Language-Action Agent to parse complex instructions into a series of atomic actions via Chain-of-Thought reasoning. Further, we also construct I2E-Bench, a benchmark designed for multi-instance spatial reasoning and high-precision editing. Experimental results on I2E-Bench and multiple public benchmarks demonstrate that I2E significantly outperforms state-of-the-art methods in handling complex compositional instructions, maintaining physical plausibility, and ensuring multi-turn editing stability.

[157] MVP: Enhancing Video Large Language Models via Self-supervised Masked Video Prediction

Xiaokun Sun,Zezhong Wu,Zewen Ding,Linli Xu

Main category: cs.CV

TL;DR: 提出了一种新的视频大语言模型后训练目标——掩码视频预测（MVP），通过重构被掩码的视频片段来增强模型的时间推理和因果理解能力。

Details

Motivation: 现有基于强化学习的视频大语言模型后训练方法主要关注整体内容理解，缺乏对时间连贯性和帧间相关性的显式监督，限制了模型捕捉复杂动态和细粒度视觉因果的能力。 Method: 提出了掩码视频预测（MVP）作为新的后训练目标，并构建可扩展的数据合成流程生成训练样本；采用分组相对策略优化（GRPO）与细粒度奖励函数进行训练。 Result: 实验证明MVP能有效提升模型在视频推理、时间逻辑和因果理解方面的能力。 Conclusion: MVP通过显式建模时间结构，显著增强了视频大语言模型的时序推理能力，为未来视频理解任务提供了新方向。 Abstract: Reinforcement learning based post-training paradigms for Video Large Language Models (VideoLLMs) have achieved significant success by optimizing for visual-semantic tasks such as captioning or VideoQA. However, while these approaches effectively enhance perception abilities, they primarily target holistic content understanding, often lacking explicit supervision for intrinsic temporal coherence and inter-frame correlations. This tendency limits the models' ability to capture intricate dynamics and fine-grained visual causality. To explicitly bridge this gap, we propose a novel post-training objective: Masked Video Prediction (MVP). By requiring the model to reconstruct a masked continuous segment from a set of challenging distractors, MVP forces the model to attend to the sequential logic and temporal context of events. To support scalable training, we introduce a scalable data synthesis pipeline capable of transforming arbitrary video corpora into MVP training samples, and further employ Group Relative Policy Optimization (GRPO) with a fine-grained reward function to enhance the model's understanding of video context and temporal properties. Comprehensive evaluations demonstrate that MVP enhances video reasoning capabilities by directly reinforcing temporal reasoning and causal understanding.

[158] A Comparative Study of 3D Model Acquisition Methods for Synthetic Data Generation of Agricultural Products

Steven Moonen,Rob Salaets,Kenneth Batstone,Abdellatif Bey-Temsamani,Nick Michiels

Main category: cs.CV

TL;DR: 本文探讨了在缺乏CAD模型的农业行业中，使用替代3D建模技术生成合成数据以训练AI目标检测模型的方法，并通过微调少量真实数据显著提升模型性能。

Details

Motivation: 在高方差、低产量的制造环境中，获取和标注大量真实训练数据成本高昂，而农业领域缺乏现成的CAD模型，难以利用合成数据，因此需要寻找替代方案。 Method: 提出并比较了多种替代CAD文件生成合成数据集的技术，包括扫描获取高代表性3D模型和图像转3D方法，并在马铃薯与石块分拣场景中评估其用于训练目标检测模型的效果，结合小规模真实数据进行微调。 Result: 实验表明，使用高代表性的3D模型生成的合成数据可有效训练目标检测模型，且通过对少量真实数据微调可显著提升模型性能，甚至弥补低代表性模型带来的不足。 Conclusion: 在缺乏CAD模型的农业场景中，采用扫描或图像转3D生成的3D模型可用于构建有效的合成数据集，结合少量真实数据微调，能够实现良好的目标检测性能，为数据稀缺环境提供了可行解决方案。 Abstract: In the manufacturing industry, computer vision systems based on artificial intelligence (AI) are widely used to reduce costs and increase production. Training these AI models requires a large amount of training data that is costly to acquire and annotate, especially in high-variance, low-volume manufacturing environments. A popular approach to reduce the need for real data is the use of synthetic data that is generated by leveraging computer-aided design (CAD) models available in the industry. However, in the agricultural industry these models are not readily available, increasing the difficulty in leveraging synthetic data. In this paper, we present different techniques for substituting CAD files to create synthetic datasets. We measure their relative performance when used to train an AI object detection model to separate stones and potatoes in a bin picking environment. We demonstrate that using highly representative 3D models acquired by scanning or using image-to-3D approaches can be used to generate synthetic data for training object detection models. Finetuning on a small real dataset can significantly improve the performance of the models and even get similar performance when less representative models are used.

[159] From Brute Force to Semantic Insight: Performance-Guided Data Transformation Design with LLMs

Usha Shrestha,Dmitry Ignatov,Radu Timofte

Main category: cs.CV

TL;DR: 提出了一种性能感知的闭环方法，使大语言模型能通过内部化实证性能信号自主优化代码增强转换，无需强化学习或显式奖励机制。

Details

Motivation: 现有代码合成中数据增强依赖启发式设计或暴力搜索，缺乏对实际性能反馈的有效利用。 Method: 基于超过6000个经实证评估的PyTorch增强函数构建新数据集，仅用下游模型准确率标注，采用成对性能排序进行微调，并结合低秩适应（LoRA）和直接提示方法。 Result: 相比暴力搜索减少了高达600倍的候选评估数量，保持竞争性峰值准确率；消融实验显示链式思维提示会引入语法噪声，而直接提示更稳定。 Conclusion: 大语言模型可通过非文本反馈环实现任务级推理，无需显式符号奖励，推动代码生成从随机合成向任务对齐设计转变。 Abstract: Large language models (LLMs) have achieved notable performance in code synthesis; however, data-aware augmentation remains a limiting factor, handled via heuristic design or brute-force approaches. We introduce a performance-aware, closed-loop solution in the NNGPT ecosystem of projects that enables LLMs to autonomously engineer optimal transformations by internalizing empirical performance cues. We fine-tune LLMs with Low-Rank Adaptation on a novel repository of more than 6,000 empirically evaluated PyTorch augmentation functions, each annotated solely by downstream model accuracy. Training uses pairwise performance ordering (better-worse transformations), enabling alignment through empirical feedback without reinforcement learning, reward models, or symbolic objectives. This reduces the need for exhaustive search, achieving up to 600x times fewer evaluated candidates than brute-force discovery while maintaining competitive peak accuracy and shifting generation from random synthesis to task-aligned design. Ablation studies show that structured Chain-of-Thought prompting introduces syntactic noise and degrades performance, whereas direct prompting ensures stable optimization in performance-critical code tasks. Qualitative and quantitative analyses demonstrate that the model internalizes semantic performance cues rather than memorizing syntax. These results show that LLMs can exhibit task-level reasoning through non-textual feedback loops, bypassing explicit symbolic rewards.

[160] EvalBlocks: A Modular Pipeline for Rapidly Evaluating Foundation Models in Medical Imaging

Jan Tagscherer,Sarah de Boer,Lena Philipp,Fennie van der Graaf,Dré Peeters,Joeran Bosma,Lars Leijten,Bogdan Obreja,Ewoud Smit,Alessa Hering

Main category: cs.CV

TL;DR: EvalBlocks是一个模块化、即插即用的框架，用于高效评估医学影像中的基础模型，支持可重复、可扩展的实验管理。

Details

Motivation: 医学影像中基础模型的开发需要持续监控下游性能，但研究人员常依赖手动、易出错的工作流，效率低下。 Method: 基于Snakemake构建EvalBlocks框架，支持新数据集、模型、聚合方法和评估策略的无缝集成，实现集中化实验追踪、命令式复现、缓存与并行执行。 Result: 在五个先进基础模型和三个医学影像分类任务上验证了框架的有效性，显著提升评估效率和可扩展性。 Conclusion: EvalBlocks简化了基础模型的评估流程，使研究人员能更快迭代并专注于模型创新而非评估琐务。 Abstract: Developing foundation models in medical imaging requires continuous monitoring of downstream performance. Researchers are burdened with tracking numerous experiments, design choices, and their effects on performance, often relying on ad-hoc, manual workflows that are inherently slow and error-prone. We introduce EvalBlocks, a modular, plug-and-play framework for efficient evaluation of foundation models during development. Built on Snakemake, EvalBlocks supports seamless integration of new datasets, foundation models, aggregation methods, and evaluation strategies. All experiments and results are tracked centrally and are reproducible with a single command, while efficient caching and parallel execution enable scalable use on shared compute infrastructure. Demonstrated on five state-of-the-art foundation models and three medical imaging classification tasks, EvalBlocks streamlines model evaluation, enabling researchers to iterate faster and focus on model innovation rather than evaluation logistics. The framework is released as open source software at https://github.com/DIAGNijmegen/eval-blocks.

[161] IDESplat: Iterative Depth Probability Estimation for Generalizable 3D Gaussian Splatting

Wei Long,Haifeng Wu,Shiyin Jiang,Jinhua Zhang,Xinchun Ji,Shuhang Gu

Main category: cs.CV

TL;DR: 本文提出IDESplat，通过迭代优化深度概率估计来提升3D高斯点阵的均值预测精度，利用级联warp操作和乘性融合机制逐步细化深度图，在多个数据集上实现了实时且具有强泛化能力的先进重建性能。

Details

Motivation: 现有方法通常仅依赖单次warp操作进行深度估计，难以充分利用多视角几何信息，导致深度图不稳定且粗糙，影响3D高斯点阵中高斯均值的准确预测。 Method: 提出IDESplat，引入深度概率增强单元（DPBU），通过级联warp操作生成的极线注意力图以乘性方式融合，消除单次warp的不稳定性；并通过堆叠多个DPBU构建迭代深度估计流程，逐步识别高可能性的深度候选，持续优化深度概率估计。 Result: 在RE10K、ACID和DL3DV等数据集上实验表明，IDESplat实现实时高效重建，在RE10K上比DepthSplat提升0.33 dB PSNR，参数量仅为10.7%，内存占用为70%；跨数据集DTU上PSNR提升2.95 dB，展现出强泛化能力。 Conclusion: IDESplat通过迭代深度概率增强策略显著提升了高斯均值预测的准确性与稳定性，解决了传统单次warp方法在深度估计中的局限性，实现了高效、精确且具有良好泛化性的3D高斯点阵重建。 Abstract: Generalizable 3D Gaussian Splatting aims to directly predict Gaussian parameters using a feed-forward network for scene reconstruction. Among these parameters, Gaussian means are particularly difficult to predict, so depth is usually estimated first and then unprojected to obtain the Gaussian sphere centers. Existing methods typically rely solely on a single warp to estimate depth probability, which hinders their ability to fully leverage cross-view geometric cues, resulting in unstable and coarse depth maps. To address this limitation, we propose IDESplat, which iteratively applies warp operations to boost depth probability estimation for accurate Gaussian mean prediction. First, to eliminate the inherent instability of a single warp, we introduce a Depth Probability Boosting Unit (DPBU) that integrates epipolar attention maps produced by cascading warp operations in a multiplicative manner. Next, we construct an iterative depth estimation process by stacking multiple DPBUs, progressively identifying potential depth candidates with high likelihood. As IDESplat iteratively boosts depth probability estimates and updates the depth candidates, the depth map is gradually refined, resulting in accurate Gaussian means. We conduct experiments on RealEstate10K, ACID, and DL3DV. IDESplat achieves outstanding reconstruction quality and state-of-the-art performance with real-time efficiency. On RE10K, it outperforms DepthSplat by 0.33 dB in PSNR, using only 10.7% of the parameters and 70% of the memory. Additionally, our IDESplat improves PSNR by 2.95 dB over DepthSplat on the DTU dataset in cross-dataset experiments, demonstrating its strong generalization ability.

Arun Muthukkumar

Main category: cs.CV

TL;DR: 提出MDENeRF，一种通过NeRF深度信息迭代优化单目深度估计的框架，融合贝叶斯方法结合全局结构与高频细节。

Details

Motivation: 现有单目深度估计方法生成的深度图过于平滑，缺乏精细几何细节，限制了对场景的准确理解。 Method: 构建包含三个组件的框架：初始单目估计、基于扰动视角并具像素级不确定性的NeRF、以及结合单目与NeRF深度的贝叶斯融合；利用体渲染过程推导NeRF不确定性以迭代注入高频细节。 Result: 在SUN RGB-D数据集的室内场景上表现出优于现有方法的关键指标和实验结果。 Conclusion: MDENeRF能有效提升单目深度估计的细节精度，同时保持良好的全局结构，适用于需要高保真深度的视觉任务。 Abstract: Monocular depth estimation has applications in many fields, such as autonomous navigation and extended reality, making it an essential computer vision task. However, current methods often produce smooth depth maps that lack the fine geometric detail needed for accurate scene understanding. We propose MDENeRF, an iterative framework that refines monocular depth estimates using depth information from Neural Radiance Fields (NeRFs). MDENeRF consists of three components: (1) an initial monocular estimate for global structure, (2) a NeRF trained on perturbed viewpoints, with per-pixel uncertainty, and (3) Bayesian fusion of the noisy monocular and NeRF depths. We derive NeRF uncertainty from the volume rendering process to iteratively inject high-frequency fine details. Meanwhile, our monocular prior maintains global structure. We demonstrate superior performance on key metrics and experiments using indoor scenes from the SUN RGB-D dataset.

[163] FLNet: Flood-Induced Agriculture Damage Assessment using Super Resolution of Satellite Images

Sanidhya Ghosal,Anurag Sharma,Sushil Ghildiyal,Mukesh Saini

Main category: cs.CV

TL;DR: 本文提出了一种基于深度学习的超分辨率模型FLNet，用于提升Sentinel-2卫星图像分辨率以实现更精确的农作物洪灾损失评估，在Bihar数据集上显著提高了“完全损毁”类别的F1分数。

Details

Motivation: 传统人工调查速度慢且存在偏差，现有卫星方法受限于云层遮挡和低空间分辨率，难以满足快速精准的农业灾害评估需求。 Method: 提出FLNet，一种基于深度学习的架构，利用超分辨率技术将Sentinel-2影像从10米分辨率提升至3米，再进行损毁分类。 Result: 在BFCD-22数据集上测试，FLNet将‘完全损毁’类别的F1分数从0.83提升到0.89，接近商业高分辨率影像的性能。 Conclusion: FLNet提供了一种经济、可扩展的解决方案，有望推动印度全国范围从人工向自动化、高保真灾害评估的转变。 Abstract: Distributing government relief efforts after a flood is challenging. In India, the crops are widely affected by floods; therefore, making rapid and accurate crop damage assessment is crucial for effective post-disaster agricultural management. Traditional manual surveys are slow and biased, while current satellite-based methods face challenges like cloud cover and low spatial resolution. Therefore, to bridge this gap, this paper introduced FLNet, a novel deep learning based architecture that used super-resolution to enhance the 10 m spatial resolution of Sentinel-2 satellite images into 3 m resolution before classifying damage. We tested our model on the Bihar Flood Impacted Croplands Dataset (BFCD-22), and the results showed an improved critical "Full Damage" F1-score from 0.83 to 0.89, nearly matching the 0.89 score of commercial high-resolution imagery. This work presented a cost-effective and scalable solution, paving the way for a nationwide shift from manual to automated, high-fidelity damage assessment.

[164] HemBLIP: A Vision-Language Model for Interpretable Leukemia Cell Morphology Analysis

Julie van Logtestijn,Petru Manescu

Main category: cs.CV

TL;DR: HemBLIP是一种视觉语言模型，用于生成可解释的外周血细胞形态描述，支持透明且可扩展的血液学诊断。

Details

Motivation: 现有深度学习模型在白血病诊断中缺乏可解释性，限制了临床信任与应用。 Method: 基于1.4万个健康与白血病细胞的新数据集，采用全微调和LoRA高效微调方法优化视觉语言模型，并与MedGEMMA进行对比。 Result: HemBLIP在描述质量和形态准确性上优于对比模型，LoRA进一步降低了计算成本并提升性能。 Conclusion: 视觉语言模型在可解释性血液诊断中具有重要潜力，尤其在提升临床可信度和可扩展性方面。 Abstract: Microscopic evaluation of white blood cell morphology is central to leukemia diagnosis, yet current deep learning models often act as black boxes, limiting clinical trust and adoption. We introduce HemBLIP, a vision language model designed to generate interpretable, morphology aware descriptions of peripheral blood cells. Using a newly constructed dataset of 14k healthy and leukemic cells paired with expert-derived attribute captions, we adapt a general-purpose VLM via both full fine-tuning and LoRA based parameter efficient training, and benchmark against the biomedical foundation model MedGEMMA. HemBLIP achieves higher caption quality and morphological accuracy, while LoRA adaptation provides further gains with significantly reduced computational cost. These results highlight the promise of vision language models for transparent and scalable hematological diagnostics.

[165] FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection

Mingyu Ouyang,Kevin Qinghong Lin,Mike Zheng Shou,Hwee Tou Ng

Main category: cs.CV

TL;DR: 本文提出了FocusUI，一种高效的用户界面（UI）视觉定位框架，通过选择与指令最相关的图像块并保持位置连续性来减少视觉标记数量，显著降低计算开销的同时保持高性能。

Details

Motivation: 现有的视觉语言模型在处理高分辨率UI截图时生成大量视觉标记，导致计算开销大且注意力稀释；而人类交互时仅关注关键区域，因此需要更高效、贴近人类感知的UI定位方法。 Method: 提出FocusUI框架：1）构建基于指令条件得分和基于规则的UI图得分的融合策略，实现对冗余图像块的监督筛选；2）引入PosPad策略，在连续丢弃的图像块位置插入特殊标记以保持位置连续性，从而提升定位精度。 Result: 在四个基准测试上实验表明，FocusUI优于现有GUI专用模型；在ScreenSpot-Pro上，FocusUI-7B比GUI-Actor-7B提升3.7%；即使仅保留30%视觉标记，性能仅下降3.2%，推理速度提升达1.44倍，峰值GPU内存降低17%。 Conclusion: FocusUI通过有选择地保留关键视觉标记并在结构上保持位置连续性，实现了高效且精确的UI视觉定位，为VLM在UI任务中的轻量化部署提供了有效解决方案。 Abstract: Vision-Language Models (VLMs) have shown remarkable performance in User Interface (UI) grounding tasks, driven by their ability to process increasingly high-resolution screenshots. However, screenshots are tokenized into thousands of visual tokens (e.g., about 4700 for 2K resolution), incurring significant computational overhead and diluting attention. In contrast, humans typically focus on regions of interest when interacting with UI. In this work, we pioneer the task of efficient UI grounding. Guided by practical analysis of the task's characteristics and challenges, we propose FocusUI, an efficient UI grounding framework that selects patches most relevant to the instruction while preserving positional continuity for precise grounding. FocusUI addresses two key challenges: (1) Eliminating redundant tokens in visual encoding. We construct patch-level supervision by fusing an instruction-conditioned score with a rule-based UI-graph score that down-weights large homogeneous regions to select distinct and instruction-relevant visual tokens. (2) Preserving positional continuity during visual token selection. We find that general visual token pruning methods suffer from severe accuracy degradation on UI grounding tasks due to broken positional information. We introduce a novel PosPad strategy, which compresses each contiguous sequence of dropped visual tokens into a single special marker placed at the sequence's last index to preserve positional continuity. Comprehensive experiments on four grounding benchmarks demonstrate that FocusUI surpasses GUI-specific baselines. On the ScreenSpot-Pro benchmark, FocusUI-7B achieves a performance improvement of 3.7% over GUI-Actor-7B. Even with only 30% visual token retention, FocusUI-7B drops by only 3.2% while achieving up to 1.44x faster inference and 17% lower peak GPU memory.

[166] ResTok: Learning Hierarchical Residuals in 1D Visual Tokenizers for Autoregressive Image Generation

Xu Zhang,Cheng Da,Huan Yang,Kun Gai,Ming Lu,Zhan Ma

Main category: cs.CV

TL;DR: 提出Residual Tokenizer (ResTok)，一种引入层次化残差结构的1D视觉 tokenizer，通过跨层级特征融合和语义残差提升自回归图像生成性能，在ImageNet-256上以仅9步采样达到2.34 gFID。

Details

Motivation: 现有1D视觉tokenizer沿用语言模型设计，将视觉数据视为扁平序列，忽略了视觉中层次化和残差结构的关键特性，导致表示能力受限且生成效率低。 Method: 设计ResTok，构建图像和潜在token的层次化残差表示；通过逐层合并实现跨层级特征融合，并利用层次间语义残差减少信息冗余；同时提出分层自回归生成器，一次性预测整层潜在token以加速生成。 Result: 在ImageNet-256上实现2.34的gFID，仅需9步采样，显著优于以往方法；模型表现出更强的表示能力和更集中的潜在分布。 Conclusion: 将视觉特有的层次化与残差先验重新引入视觉tokenizer设计，能有效提升自回归图像生成的质量与效率，ResTok为统一语言式生成与视觉模型架构提供了新思路。 Abstract: Existing 1D visual tokenizers for autoregressive (AR) generation largely follow the design principles of language modeling, as they are built directly upon transformers whose priors originate in language, yielding single-hierarchy latent tokens and treating visual data as flat sequential token streams. However, this language-like formulation overlooks key properties of vision, particularly the hierarchical and residual network designs that have long been essential for convergence and efficiency in visual models. To bring "vision" back to vision, we propose the Residual Tokenizer (ResTok), a 1D visual tokenizer that builds hierarchical residuals for both image tokens and latent tokens. The hierarchical representations obtained through progressively merging enable cross-level feature fusion at each layer, substantially enhancing representational capacity. Meanwhile, the semantic residuals between hierarchies prevent information overlap, yielding more concentrated latent distributions that are easier for AR modeling. Cross-level bindings consequently emerge without any explicit constraints. To accelerate the generation process, we further introduce a hierarchical AR generator that substantially reduces sampling steps by predicting an entire level of latent tokens at once rather than generating them strictly token-by-token. Extensive experiments demonstrate that restoring hierarchical residual priors in visual tokenization significantly improves AR image generation, achieving a gFID of 2.34 on ImageNet-256 with only 9 sampling steps. Code is available at https://github.com/Kwai-Kolors/ResTok.

[167] FUSION: Full-Body Unified Motion Prior for Body and Hands via Diffusion

Enes Duran,Nikos Athanasiou,Muhammed Kocabas,Michael J. Black,Omid Taheri

Main category: cs.CV

TL;DR: 本文提出了FUSION，首个基于扩散模型的无条件全身体运动先验模型，通过整合现有的手部与身体运动数据集，实现了包含精细手部动作的全身体运动生成，并在多种任务中展现出优越的自然性和控制精度。

Details

Motivation: 现有全身体运动合成方法往往忽略手部动作或局限于特定场景，且缺乏大规模同时包含身体和手部精细动作的数据集，限制了真实感和多样性的提升。因此，需要一种能够联合建模身体与手部运动的新方法。 Method: 通过整合现有的手部运动数据集与大规模身体运动数据，构建统一的全身体运动序列；提出基于扩散模型的无条件全身体运动先验FUSION，采用姿态表示进行建模；并设计优化流程，在潜在空间中精细化生成特定任务的运动，支持物体驱动的手-身协同运动和语言引导的自交互动作生成。 Result: FUSION在HumanML3D数据集的关键点追踪任务上超越现有最先进骨骼控制模型，生成的运动更自然；在新应用如物体交互和自交互任务中，实现了对手部动作的精确控制，同时保持全身协调性。 Conclusion: FUSION是首个能联合建模身体与手部运动的扩散先验模型，通过数据融合与潜在空间优化，显著提升了全身体运动合成的质量与应用范围，尤其在手部细节与全身协调性方面表现突出。 Abstract: Hands are central to interacting with our surroundings and conveying gestures, making their inclusion essential for full-body motion synthesis. Despite this, existing human motion synthesis methods fall short: some ignore hand motions entirely, while others generate full-body motions only for narrowly scoped tasks under highly constrained settings. A key obstacle is the lack of large-scale datasets that jointly capture diverse full-body motion with detailed hand articulation. While some datasets capture both, they are limited in scale and diversity. Conversely, large-scale datasets typically focus either on body motion without hands or on hand motions without the body. To overcome this, we curate and unify existing hand motion datasets with large-scale body motion data to generate full-body sequences that capture both hand and body. We then propose the first diffusion-based unconditional full-body motion prior, FUSION, which jointly models body and hand motion. Despite using a pose-based motion representation, FUSION surpasses state-of-the-art skeletal control models on the Keypoint Tracking task in the HumanML3D dataset and achieves superior motion naturalness. Beyond standard benchmarks, we demonstrate that FUSION can go beyond typical uses of motion priors through two applications: (1) generating detailed full-body motion including fingers during interaction given the motion of an object, and (2) generating Self-Interaction motions using an LLM to transform natural language cues into actionable motion constraints. For these applications, we develop an optimization pipeline that refines the latent space of our diffusion model to generate task-specific motions. Experiments on these tasks highlight precise control over hand motion while maintaining plausible full-body coordination. The code will be public.

[168] PosterVerse: A Full-Workflow Framework for Commercial-Grade Poster Generation with HTML-Based Scalable Typography

Junle Liu,Peirong Zhang,Yuyi Zhang,Pengyu Yan,Hui Zhou,Xinyue Zhou,Fengjun Guo,Lianwen Jin

Main category: cs.CV

TL;DR: 本文提出了一种名为PosterVerse的全流程商业级海报生成方法，通过细调语言模型、定制扩散模型和MLLM驱动的HTML引擎，实现了从设计草图到高精度文本渲染的自动化流程，并发布了首个包含HTML排版文件的中文海报数据集PosterDNA。

Details

Motivation: 现有的自动海报生成系统存在设计流程不完整、文本渲染不准确和商业应用灵活性不足等问题，难以满足商业级海报对美观性与信息密度的高要求。 Method: PosterVerse采用三阶段流程：1）利用微调的LLM从用户需求中提取关键设计元素生成蓝图；2）通过定制化扩散模型生成图形背景；3）使用MLLM驱动的HTML引擎进行统一布局与文本渲染。同时构建了商业级HTML格式数据集PosterDNA用于训练与评估。 Result: 实验结果表明，PosterVerse能够稳定生成视觉吸引力强、文本对齐准确、布局可定制的商业级海报，尤其在小字号和高密度文本渲染方面表现优异。 Conclusion: PosterVerse为自动化商业海报设计提供了高效且实用的解决方案，结合PosterDNA数据集推动了高精度文本渲染海报生成技术的发展。 Abstract: Commercial-grade poster design demands the seamless integration of aesthetic appeal with precise, informative content delivery. Current automated poster generation systems face significant limitations, including incomplete design workflows, poor text rendering accuracy, and insufficient flexibility for commercial applications. To address these challenges, we propose PosterVerse, a full-workflow, commercial-grade poster generation method that seamlessly automates the entire design process while delivering high-density and scalable text rendering. PosterVerse replicates professional design through three key stages: (1) blueprint creation using fine-tuned LLMs to extract key design elements from user requirements, (2) graphical background generation via customized diffusion models to create visually appealing imagery, and (3) unified layout-text rendering with an MLLM-powered HTML engine to guarantee high text accuracy and flexible customization. In addition, we introduce PosterDNA, a commercial-grade, HTML-based dataset tailored for training and validating poster design models. To the best of our knowledge, PosterDNA is the first Chinese poster generation dataset to introduce HTML typography files, enabling scalable text rendering and fundamentally solving the challenges of rendering small and high-density text. Experimental results demonstrate that PosterVerse consistently produces commercial-grade posters with appealing visuals, accurate text alignment, and customizable layouts, making it a promising solution for automating commercial poster design. The code and model are available at https://github.com/wuhaer/PosterVerse.

[169] Padé Neurons for Efficient Neural Models

Onur Keleş,A. Murat Tekalp

Main category: cs.CV

TL;DR: 本文提出了一种受Padé逼近启发的新型非线性神经元模型——Padé神经元（Paons），具有更强的非线性表达能力、更少的层数需求，并兼容现有各类神经元模型，实验验证了其在图像超分辨率、压缩和分类任务中优于或等效于传统模型的表现。

Details

Motivation: 为了增强神经网络中神经元的非线性表达能力，克服传统点激活函数和现有非线性神经元模型的局限性，本文提出一种更具通用性和高效性的新神经元模型。 Method: 基于Padé逼近理论设计了一种新的神经元结构Paons，每个Paon可学习输入的不同非线性函数，并支持以更少的层数实现强非线性；该模型统一了多种已有神经元形式，可直接替换现有网络中的经典神经元。 Result: 在基于ResNet的图像超分辨率、压缩和分类模型中，用Paons替代传统神经元后，在更少层数下实现了相等或更好的性能表现。 Conclusion: Paons是一种更强大且灵活的神经元模型，能够提升模型效率与表现力，具备广泛的应用潜力和可扩展性。 Abstract: Neural networks commonly employ the McCulloch-Pitts neuron model, which is a linear model followed by a point-wise non-linear activation. Various researchers have already advanced inherently non-linear neuron models, such as quadratic neurons, generalized operational neurons, generative neurons, and super neurons, which offer stronger non-linearity compared to point-wise activation functions. In this paper, we introduce a novel and better non-linear neuron model called Padé neurons (Paons), inspired by Padé approximants. Paons offer several advantages, such as diversity of non-linearity, since each Paon learns a different non-linear function of its inputs, and layer efficiency, since Paons provide stronger non-linearity in much fewer layers compared to piecewise linear approximation. Furthermore, Paons include all previously proposed neuron models as special cases, thus any neuron model in any network can be replaced by Paons. We note that there has been a proposal to employ the Padé approximation as a generalized point-wise activation function, which is fundamentally different from our model. To validate the efficacy of Paons, in our experiments, we replace classic neurons in some well-known neural image super-resolution, compression, and classification models based on the ResNet architecture with Paons. Our comprehensive experimental results and analyses demonstrate that neural models built by Paons provide better or equal performance than their classic counterparts with a smaller number of layers. The PyTorch implementation code for Paon is open-sourced at https://github.com/onur-keles/Paon.

[170] Thinking with Frames: Generative Video Distortion Evaluation via Frame Reward Model

Yuan Wang,Borui Liao,Huijuan Huang,Jinda Lu,Ouxiang Li,Kuien Liu,Meng Wang,Xiang Wang

Main category: cs.CV

TL;DR: 本文提出了REACT，一种用于生成视频中结构失真评估的帧级奖励模型，通过推理识别异常对象外观和交互，并引入动态采样机制与两阶段训练框架以提升评估准确性与可解释性。

Details

Motivation: 现有视频奖励模型多关注视觉质量、运动质量和文本对齐，但忽视了影响生成视频整体质量的关键结构性失真问题，如异常对象表现和交互，因此需要专门针对此类问题的评估模型。 Method: 提出REACT模型，采用两阶段训练框架：第一阶段使用带掩码损失的监督微调注入领域知识，第二阶段采用分组相对策略优化（GRPO）和成对奖励进行强化学习；构建大规模人类偏好数据集并设计高效的思维链（CoT）合成 pipeline 生成额外数据；推理时引入动态采样机制聚焦最可能失真的帧。 Result: 实验表明REACT能有效评估生成视频中的结构性失真，兼具准确的定量评估能力和可解释的归因分析能力，并与其他奖励模型互补；同时发布了REACT-Bench作为新的评测基准。 Conclusion: REACT为生成视频中的结构性失真提供了有效的评估方案，通过精细化的标注体系、合理的训练策略和推理机制，在提升模型判断能力的同时增强了结果的可解释性，推动了T2V生成质量评估的发展。 Abstract: Recent advances in video reward models and post-training strategies have improved text-to-video (T2V) generation. While these models typically assess visual quality, motion quality, and text alignment, they often overlook key structural distortions, such as abnormal object appearances and interactions, which can degrade the overall quality of the generative video. To address this gap, we introduce REACT, a frame-level reward model designed specifically for structural distortions evaluation in generative videos. REACT assigns point-wise scores and attribution labels by reasoning over video frames, focusing on recognizing distortions. To support this, we construct a large-scale human preference dataset, annotated based on our proposed taxonomy of structural distortions, and generate additional data using a efficient Chain-of-Thought (CoT) synthesis pipeline. REACT is trained with a two-stage framework: ((1) supervised fine-tuning with masked loss for domain knowledge injection, followed by (2) reinforcement learning with Group Relative Policy Optimization (GRPO) and pairwise rewards to enhance reasoning capability and align output scores with human preferences. During inference, a dynamic sampling mechanism is introduced to focus on frames most likely to exhibit distortion. We also present REACT-Bench, a benchmark for generative video distortion evaluation. Experimental results demonstrate that REACT complements existing reward models in assessing structutal distortion, achieving both accurate quantitative evaluations and interpretable attribution analysis.

[171] Unsupervised Modular Adaptive Region Growing and RegionMix Classification for Wind Turbine Segmentation

Raül Pérez-Gonzalo,Riccardo Magro,Andreas Espersen,Antonio Agudo

Main category: cs.CV

TL;DR: 提出一种无需密集标注的风力涡轮机叶片分割方法，通过区域生长与合并及新的数据增强策略RegionMix，实现高效准确且具有良好跨站点泛化的分割效果。

Details

Motivation: 传统基于像素级深度学习的叶片分割方法依赖大量标注数据，难以扩展，因此需要一种标注高效的分割方法。 Method: 将像素级分割任务转化为二分类区域分类问题，采用无监督的模块化自适应区域生长技术生成图像区域，并结合自适应阈值和区域合并优化分割结果；引入RegionMix增强训练样本以提升泛化能力。 Result: 该方法在多个风电场数据上实现了最先进的分割精度，并表现出强跨站点泛化能力。 Conclusion: 所提方法显著降低了对标注数据的依赖，同时保持高精度和良好泛化性，适用于实际风电巡检自动化。 Abstract: Reliable operation of wind turbines requires frequent inspections, as even minor surface damages can degrade aerodynamic performance, reduce energy output, and accelerate blade wear. Central to automating these inspections is the accurate segmentation of turbine blades from visual data. This task is traditionally addressed through dense, pixel-wise deep learning models. However, such methods demand extensive annotated datasets, posing scalability challenges. In this work, we introduce an annotation-efficient segmentation approach that reframes the pixel-level task into a binary region classification problem. Image regions are generated using a fully unsupervised, interpretable Modular Adaptive Region Growing technique, guided by image-specific Adaptive Thresholding and enhanced by a Region Merging process that consolidates fragmented areas into coherent segments. To improve generalization and classification robustness, we introduce RegionMix, an augmentation strategy that synthesizes new training samples by combining distinct regions. Our framework demonstrates state-of-the-art segmentation accuracy and strong cross-site generalization by consistently segmenting turbine blades across distinct windfarms.

[172] Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models

Zitong Huang,Kaidong Zhang,Yukang Ding,Chao Gao,Rui Ding,Ying Chen,Wangmeng Zuo

Main category: cs.CV

TL;DR: 提出LocalDPO，一种基于局部偏好对的视频生成后训练对齐框架，通过自动化管道高效构建偏好数据，并在时空区域级别进行优化，提升视频质量与人类偏好一致性。

Details

Motivation: 现有基于偏好的优化方法依赖多样本排序和任务特定判别模型，效率低且监督信号模糊，难以有效对齐文本到视频扩散模型与人类偏好。 Method: 设计一个自动化流程，利用高质量真实视频作为正样本，通过随机时空掩码局部破坏生成负样本，并仅用基础模型修复掩区；引入区域感知的DPO损失，限制偏好学习于被破坏区域，实现细粒度对齐优化。 Result: 在Wan2.1和CogVideoX模型上实验表明，LocalDPO在视频保真度、时序连贯性和人类偏好评分上优于其他后训练方法。 Conclusion: LocalDPO提供了一种更高效、细粒度的视频生成模型对齐范式，无需外部判别器或人工标注，显著提升生成质量。 Abstract: Aligning text-to-video diffusion models with human preferences is crucial for generating high-quality videos. Existing Direct Preference Otimization (DPO) methods rely on multi-sample ranking and task-specific critic models, which is inefficient and often yields ambiguous global supervision. To address these limitations, we propose LocalDPO, a novel post-training framework that constructs localized preference pairs from real videos and optimizes alignment at the spatio-temporal region level. We design an automated pipeline to efficiently collect preference pair data that generates preference pairs with a single inference per prompt, eliminating the need for external critic models or manual annotation. Specifically, we treat high-quality real videos as positive samples and generate corresponding negatives by locally corrupting them with random spatio-temporal masks and restoring only the masked regions using the frozen base model. During training, we introduce a region-aware DPO loss that restricts preference learning to corrupted areas for rapid convergence. Experiments on Wan2.1 and CogVideoX demonstrate that LocalDPO consistently improves video fidelity, temporal coherence and human preference scores over other post-training approaches, establishing a more efficient and fine-grained paradigm for video generator alignment.

Zhihao Zhu,Jiafeng Liang,Shixin Jiang,Jinlan Fu,Ming Liu,Guanglu Sun,See-Kiong Ng,Bing Qin

Main category: cs.CV

TL;DR: 本文提出了“文本惯性”这一大模型在视频推理中的关键失败模式，并提出了一种无需训练的主动视觉上下文优化方法来增强推理鲁棒性。

Details

Motivation: 研究大型多模态模型在链式思维推理中因文本幻觉导致的推理链脆弱问题，揭示其忽视视觉证据而盲目依赖错误文本的现象。 Method: 提出LogicGraph扰动协议以系统评估不同架构LMM的自省能力，并设计主动视觉上下文优化（Active Visual-Context Refinement）方法，结合主动视觉重定位与自适应上下文精炼策略。 Result: 实验显示现有模型自我纠正率低于10%，普遍存在文本错误传播；所提方法显著抑制幻觉传播并提升推理准确性与鲁棒性。 Conclusion: 文本惯性是当前LMMs在复杂推理任务中的主要瓶颈之一，主动视觉上下文优化为提升推理链可靠性提供了有效且通用的解决方案。 Abstract: Large Multimodal Models (LMMs) have demonstrated impressive capabilities in video reasoning via Chain-of-Thought (CoT). However, the robustness of their reasoning chains remains questionable. In this paper, we identify a critical failure mode termed textual inertia, where once a textual hallucination occurs in the thinking process, models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence. To systematically investigate this, we propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs spanning both native reasoning architectures and prompt-driven paradigms to evaluate their self-reflection capabilities. The results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation. To mitigate this, we introduce Active Visual-Context Refinement, a training-free inference paradigm which orchestrates an active visual re-grounding mechanism to enforce fine-grained verification coupled with an adaptive context refinement strategy to summarize and denoise the reasoning history. Experiments demonstrate that our approach significantly stifles hallucination propagation and enhances reasoning robustness.

[174] Gen3R: 3D Scene Generation Meets Feed-Forward Reconstruction

Jiaxin Huang,Yuanbo Yang,Bangbang Yang,Lin Ma,Yuewen Ma,Yiyi Liao

Main category: cs.CV

TL;DR: 本文提出了Gen3R方法，通过结合基础重建模型和视频扩散模型的强先验，实现场景级3D生成。该方法利用VGGT重建模型生成几何潜在表示，并通过适配器训练使其与预训练视频扩散模型的外观潜在表示对齐，从而联合生成解耦且对齐的潜在变量，输出RGB视频及对应的3D几何信息（如相机位姿、深度图和全局点云）。实验表明，该方法在单图和多图条件下的3D场景生成中达到了最先进的性能，并通过生成先验增强了重建的鲁棒性，展示了重建与生成模型紧密结合的互利优势。

Details

Motivation: 现有的3D生成方法往往难以同时兼顾几何准确性和外观真实性，且缺乏对场景级复杂结构的有效建模。本文旨在通过融合重建模型的几何先验与生成模型的外观先验，提升3D场景生成的质量与鲁棒性。 Method: 提出Gen3R方法，利用VGGT重建模型提取几何潜在表示，并训练一个适配器将其令牌与预训练视频扩散模型的外观潜在表示对齐。通过对齐并联合生成这些解耦的潜在变量，同步输出RGB视频和3D几何数据（包括相机位姿、深度图和点云）。 Result: 在单图像和多图像条件下的3D场景生成任务中达到最先进水平；生成的3D结构具有更高的几何一致性与视觉质量；通过引入生成先验，反向提升了重建过程的鲁棒性。 Conclusion: Gen3R成功实现了重建模型与生成模型的紧密耦合，证明了二者结合可在3D场景生成中实现互补增益，为未来3D内容生成提供了新方向。 Abstract: We present Gen3R, a method that bridges the strong priors of foundational reconstruction models and video diffusion models for scene-level 3D generation. We repurpose the VGGT reconstruction model to produce geometric latents by training an adapter on its tokens, which are regularized to align with the appearance latents of pre-trained video diffusion models. By jointly generating these disentangled yet aligned latents, Gen3R produces both RGB videos and corresponding 3D geometry, including camera poses, depth maps, and global point clouds. Experiments demonstrate that our approach achieves state-of-the-art results in single- and multi-image conditioned 3D scene generation. Additionally, our method can enhance the robustness of reconstruction by leveraging generative priors, demonstrating the mutual benefit of tightly coupling reconstruction and generative models.

[175] GeoReason: Aligning Thinking And Answering In Remote Sensing Vision-Language Models Via Logical Consistency Reinforcement Learning

Wenshuai Li,Xiantai Xiang,Zixiao Wen,Guangyao Zhou,Ben Niu,Feng Wang,Lijia Huang,Qiantong Wang,Yuxin Hu

Main category: cs.CV

TL;DR: 本文提出了一种名为GeoReason的新框架，旨在提升遥感视觉-语言模型（RS-VLMs）在复杂空间任务中的认知可靠性，通过构建逻辑驱动的数据集和两阶段训练策略有效减少逻辑幻觉。

Details

Motivation: 现有RS-VLMs常出现逻辑幻觉，答案虽正确但推理过程有误或依赖位置捷径而非空间逻辑，影响高阶推理的可靠性。 Method: 提出GeoReason框架：首先构建包含4000条推理轨迹的GeoReason-Bench数据集；然后采用两阶段训练——监督知识初始化与一致性感知强化学习，并引入基于选项排列的逻辑一致性奖励机制以惩罚推理漂移。 Result: 实验结果表明，该方法显著提升了RS-VLMs的认知可靠性和可解释性，在多项指标上优于现有先进方法。 Conclusion: GeoReason通过同步内部推理与最终决策，有效增强了RS-VLMs的逻辑一致性和空间推理能力，为高阶地理空间理解提供了可靠框架。 Abstract: The evolution of Remote Sensing Vision-Language Models(RS-VLMs) emphasizes the importance of transitioning from perception-centric recognition toward high-level deductive reasoning to enhance cognitive reliability in complex spatial tasks. However, current models often suffer from logical hallucinations, where correct answers are derived from flawed reasoning chains or rely on positional shortcuts rather than spatial logic. This decoupling undermines reliability in strategic spatial decision-making. To address this, we present GeoReason, a framework designed to synchronize internal thinking with final decisions. We first construct GeoReason-Bench, a logic-driven dataset containing 4,000 reasoning trajectories synthesized from geometric primitives and expert knowledge. We then formulate a two-stage training strategy: (1) Supervised Knowledge Initialization to equip the model with reasoning syntax and domain expertise, and (2) Consistency-Aware Reinforcement Learning to refine deductive reliability. This second stage integrates a novel Logical Consistency Reward, which penalizes logical drift via an option permutation strategy to anchor decisions in verifiable reasoning traces. Experimental results demonstrate that our framework significantly enhances the cognitive reliability and interpretability of RS-VLMs, achieving state-of-the-art performance compared to other advanced methods.

[176] Pixel-Wise Multimodal Contrastive Learning for Remote Sensing Images

Leandro Stival,Ricardo da Silva Torres,Helio Pedrini

Main category: cs.CV

TL;DR: 本文提出了一种新的多模态自监督方法PIxel-wise Multimodal Contrastive (PIMC)，利用二维像素级表示（如递归图）来增强卫星图像时间序列（SITS）的特征提取，并结合遥感影像进行下游任务。实验表明该方法在多个地球观测任务中优于现有最先进模型。

Details

Motivation: 现有的深度学习模型通常直接使用原始像素值或完整时间序列处理SITS数据，难以有效捕捉像素级变化；本文旨在通过更信息丰富的2D表示和多模态对比学习提升特征表示能力。 Method: 从基于像素的植被指数时间序列（NDVI、EVI、SAVI）生成递归图作为2D表示，提出PIMC框架，采用多模态自监督对比学习联合优化SITS与遥感影像的编码器。 Result: 在PASTIS数据集上的像素级预测与分类及EuroSAT数据集的土地覆盖分类任务中均超越SOTA方法，验证了2D表示与对比学习对特征质量的提升作用。 Conclusion: 所提PIMC框架能有效提升SITS与RSI的表示学习效果，是一种鲁棒的多模态自监督学习方法，适用于多种地球观测任务。 Abstract: Satellites continuously generate massive volumes of data, particularly for Earth observation, including satellite image time series (SITS). However, most deep learning models are designed to process either entire images or complete time series sequences to extract meaningful features for downstream tasks. In this study, we propose a novel multimodal approach that leverages pixel-wise two-dimensional (2D) representations to encode visual property variations from SITS more effectively. Specifically, we generate recurrence plots from pixel-based vegetation index time series (NDVI, EVI, and SAVI) as an alternative to using raw pixel values, creating more informative representations. Additionally, we introduce PIxel-wise Multimodal Contrastive (PIMC), a new multimodal self-supervision approach that produces effective encoders based on two-dimensional pixel time series representations and remote sensing imagery (RSI). To validate our approach, we assess its performance on three downstream tasks: pixel-level forecasting and classification using the PASTIS dataset, and land cover classification on the EuroSAT dataset. Moreover, we compare our results to state-of-the-art (SOTA) methods on all downstream tasks. Our experimental results show that the use of 2D representations significantly enhances feature extraction from SITS, while contrastive learning improves the quality of representations for both pixel time series and RSI. These findings suggest that our multimodal method outperforms existing models in various Earth observation tasks, establishing it as a robust self-supervision framework for processing both SITS and RSI. Code avaliable on

[177] Klear: Unified Multi-Task Audio-Video Joint Generation

Jun Wang,Chunyu Qiang,Yuxin Guo,Yiran Wang,Xijuan Zeng,Chen Zhang,Pengfei Wan

Main category: cs.CV

TL;DR: 本文提出了Klear，一种用于音频-视频联合生成的模型，通过改进模型架构、训练策略和数据构建，在音视频对齐、多模态融合和泛化能力方面显著优于现有方法，并实现了与Veo 3相媲美的性能。

Details

Motivation: 现有非商业音频-视频生成方法存在音画不同步、唇语不匹配和单模态退化等问题，主要源于音视频关联建模弱、泛化能力有限以及高质量密集标注数据稀缺。 Method: 采用单塔架构与统一DiT块及Omni-Full Attention机制以增强音视频对齐；设计渐进式多任务训练和多阶段课程学习防止单模态崩溃；构建首个大规模带密集字幕的音视频数据集，并提出自动化数据构建流程筛选高质量三元组。 Result: Klear在多种任务上大幅超越先前方法，实现高保真、语义与时间对齐的生成效果，支持指令跟随，并在分布外场景中表现出强鲁棒性。 Conclusion: Klear通过架构、训练和数据三方面的创新，为下一代音视频合成提供了统一且可扩展的技术路径。 Abstract: Audio-video joint generation has progressed rapidly, yet substantial challenges still remain. Non-commercial approaches still suffer audio-visual asynchrony, poor lip-speech alignment, and unimodal degradation, which can be stemmed from weak audio-visual correspondence modeling, limited generalization, and scarce high-quality dense-caption data. To address these issues, we introduce Klear and delve into three axes--model architecture, training strategy, and data curation. Architecturally, we adopt a single-tower design with unified DiT blocks and an Omni-Full Attention mechanism, achieving tight audio-visual alignment and strong scalability. Training-wise, we adopt a progressive multitask regime--random modality masking to joint optimization across tasks, and a multistage curriculum, yielding robust representations, strengthening A-V aligned world knowledge, and preventing unimodal collapse. For datasets, we present the first large-scale audio-video dataset with dense captions, and introduce a novel automated data-construction pipeline which annotates and filters millions of diverse, high-quality, strictly aligned audio-video-caption triplets. Building on this, Klear scales to large datasets, delivering high-fidelity, semantically and temporally aligned, instruction-following generation in both joint and unimodal settings while generalizing robustly to out-of-distribution scenarios. Across tasks, it substantially outperforms prior methods by a large margin and achieves performance comparable to Veo 3, offering a unified, scalable path toward next-generation audio-video synthesis.

[178] Diffusion-DRF: Differentiable Reward Flow for Video Diffusion Fine-Tuning

Yifan Wang,Yanyu Li,Sergey Tulyakov,Yun Fu,Anil Kag

Main category: cs.CV

TL;DR: 提出Diffusion-DRF，一种基于冻结现成视觉语言模型的可微奖励流，用于细调视频扩散模型，无需额外奖励模型或偏好数据集，缓解奖励黑客攻击和训练崩溃问题。

Details

Motivation: 现有基于DPO的文本到视频生成方法依赖于非可微的偏好信号，导致标注成本高、易受偏见影响且容易被操纵，引发奖励黑客攻击和训练不稳定。 Method: 利用冻结的现成视觉语言模型（VLM）作为无训练的评判器，通过扩散去噪链直接反向传播VLM反馈，将logit级响应转化为token感知梯度；设计自动化的、面向多维度的提示管道，并采用梯度检查点实现高效更新。 Result: 在不使用额外奖励模型或偏好数据集的情况下，提升了视频质量和语义对齐能力，同时缓解了奖励欺骗和训练崩溃问题。 Conclusion: Diffusion-DRF是一种模型无关、可推广的方法，能有效优化扩散模型生成视频的质量与对齐性，避免传统偏好学习中的关键缺陷。 Abstract: Direct Preference Optimization (DPO) has recently improved Text-to-Video (T2V) generation by enhancing visual fidelity and text alignment. However, current methods rely on non-differentiable preference signals from human annotations or learned reward models. This reliance makes training label-intensive, bias-prone, and easy-to-game, which often triggers reward hacking and unstable training. We propose Diffusion-DRF, a differentiable reward flow for fine-tuning video diffusion models using a frozen, off-the-shelf Vision-Language Model (VLM) as a training-free critic. Diffusion-DRF directly backpropagates VLM feedback through the diffusion denoising chain, converting logit-level responses into token-aware gradients for optimization. We propose an automated, aspect-structured prompting pipeline to obtain reliable multi-dimensional VLM feedback, while gradient checkpointing enables efficient updates through the final denoising steps. Diffusion-DRF improves video quality and semantic alignment while mitigating reward hacking and collapse -- without additional reward models or preference datasets. It is model-agnostic and readily generalizes to other diffusion-based generative tasks.

[179] ToTMNet: FFT-Accelerated Toeplitz Temporal Mixing Network for Lightweight Remote Photoplethysmography

Vladimir Frants,Sos Agaian,Karen Panetta

Main category: cs.CV

TL;DR: 本文提出了一种轻量级的远程光电容积描记（rPPG）模型ToTMNet，用FFT加速的Toeplitz结构替代注意力机制，实现高效长序列建模，在多个数据集上表现出优异的心率估计精度。

Details

Motivation: 现有基于注意力机制的时间建模方法计算复杂度高、参数量大，难以部署在资源受限设备上，因此需要一种更高效的时序建模方式用于rPPG信号提取。 Method: 提出ToTMNet，采用基于FFT加速的Toeplitz矩阵进行全局时序混合，并设计了一个结合局部深度卷积与门控全局Toeplitz混合的紧凑型时间门控混合器。 Result: 在UBFC-rPPG数据集上达到1.055 bpm MAE（相关系数0.996），在跨域设置（SCAMPS到UBFC-rPPG）下达到1.582 bpm MAE（相关系数0.994），模型仅含63k参数。消融实验表明门控机制对提升性能尤其在域迁移场景中至关重要。 Conclusion: Toeplitz结构化时序混合是一种比注意力机制更高效且实用的rPPG时序建模方案，适合低功耗和实时应用。 Abstract: Remote photoplethysmography (rPPG) estimates a blood volume pulse (BVP) waveform from facial videos captured by commodity cameras. Although recent deep models improve robustness compared to classical signal-processing approaches, many methods increase computational cost and parameter count, and attention-based temporal modeling introduces quadratic scaling with respect to the temporal length. This paper proposes ToTMNet, a lightweight rPPG architecture that replaces temporal attention with an FFT-accelerated Toeplitz temporal mixing layer. The Toeplitz operator provides full-sequence temporal receptive field using a linear number of parameters in the clip length and can be applied in near-linear time using circulant embedding and FFT-based convolution. ToTMNet integrates the global Toeplitz temporal operator into a compact gated temporal mixer that combines a local depthwise temporal convolution branch with gated global Toeplitz mixing, enabling efficient long-range temporal filtering while only having 63k parameters. Experiments on two datasets, UBFC-rPPG (real videos) and SCAMPS (synthetic videos), show that ToTMNet achieves strong heart-rate estimation accuracy with a compact design. On UBFC-rPPG intra-dataset evaluation, ToTMNet reaches 1.055 bpm MAE with Pearson correlation 0.996. In a synthetic-to-real setting (SCAMPS to UBFC-rPPG), ToTMNet reaches 1.582 bpm MAE with Pearson correlation 0.994. Ablation results confirm that the gating mechanism is important for effectively using global Toeplitz mixing, especially under domain shift. The main limitation of this preprint study is the use of only two datasets; nevertheless, the results indicate that Toeplitz-structured temporal mixing is a practical and efficient alternative to attention for rPPG.

[180] ImLoc: Revisiting Visual Localization with Image-based Representation

Xudong Jiang,Fangjinhua Wang,Silvano Galliani,Christoph Vogel,Marc Pollefeys

Main category: cs.CV

TL;DR: 提出一种基于2D图像表示并增强估计深度图的视觉定位方法，结合密集匹配器实现高精度和高效存储计算，在多个基准上达到最先进性能。

Details

Motivation: 现有视觉定位方法在几何推理能力、集中重建需求和更新难度之间存在权衡，需要一种既易于构建维护又能实现高精度的方法。 Method: 采用2D图像表示，并为每张图像增配估计的深度图以捕捉几何结构，利用密集匹配器进行匹配，结合紧凑压缩和GPU加速的LO-RANSAC实现高效计算与存储。 Result: 在多个标准基准上实现了最新的最先进精度，优于现有内存效率高的方法，且在相似地图大小下表现更优。 Conclusion: 该方法在保持易构建和维护的同时，通过深度图增强和高效实现，在挑战性条件下实现了高精度视觉定位，并支持精度与内存效率之间的灵活权衡。 Abstract: Existing visual localization methods are typically either 2D image-based, which are easy to build and maintain but limited in effective geometric reasoning, or 3D structure-based, which achieve high accuracy but require a centralized reconstruction and are difficult to update. In this work, we revisit visual localization with a 2D image-based representation and propose to augment each image with estimated depth maps to capture the geometric structure. Supported by the effective use of dense matchers, this representation is not only easy to build and maintain, but achieves highest accuracy in challenging conditions. With compact compression and a GPU-accelerated LO-RANSAC implementation, the whole pipeline is efficient in both storage and computation and allows for a flexible trade-off between accuracy and highest memory efficiency. Our method achieves a new state-of-the-art accuracy on various standard benchmarks and outperforms existing memory-efficient methods at comparable map sizes. Code will be available at https://github.com/cvg/Hierarchical-Localization.

[181] Choreographing a World of Dynamic Objects

Yanzhe Lyu,Chen Geng,Karthik Dharmarajan,Yunzhi Zhang,Hadi Alzayer,Shangzhe Wu,Jiajun Wu

Main category: cs.CV

TL;DR: 本文提出了一种名为CHORD的通用生成管道，用于合成动态物体和场景的4D动态，通过从2D视频中提取拉格朗日运动信息来实现，具有普适性、多用途和类别无关的特点。

Details

Motivation: 传统基于规则的图形管道创建4D动态现象耗时且不可扩展，而现有的学习方法需要大规模数据集，可能无法涵盖所有感兴趣的物体类别。 Method: 提出一种基于蒸馏的管道，从2D视频的欧拉表示中提取丰富的拉格朗日运动信息，从而继承视频生成模型的普遍性。 Result: 实验证明了该方法在生成多种多体4D动态方面的有效性，并展示了其在生成机器人操作策略中的应用优势。 Conclusion: CHORD方法是通用、灵活且不依赖于物体类别的，相较于现有方法具有明显优势。 Abstract: Dynamic objects in our physical 4D (3D + time) world are constantly evolving, deforming, and interacting with other objects, leading to diverse 4D scene dynamics. In this paper, we present a universal generative pipeline, CHORD, for CHOReographing Dynamic objects and scenes and synthesizing this type of phenomena. Traditional rule-based graphics pipelines to create these dynamics are based on category-specific heuristics, yet are labor-intensive and not scalable. Recent learning-based methods typically demand large-scale datasets, which may not cover all object categories in interest. Our approach instead inherits the universality from the video generative models by proposing a distillation-based pipeline to extract the rich Lagrangian motion information hidden in the Eulerian representations of 2D videos. Our method is universal, versatile, and category-agnostic. We demonstrate its effectiveness by conducting experiments to generate a diverse range of multi-body 4D dynamics, show its advantage compared to existing methods, and demonstrate its applicability in generating robotics manipulation policies. Project page: https://yanzhelyu.github.io/chord

eess.IV [Back]

[182] Staged Voxel-Level Deep Reinforcement Learning for 3D Medical Image Segmentation with Noisy Annotations

Yuyang Fu,Xiuzhen Guo,Ji Shi

Main category: eess.IV

TL;DR: 提出了一种基于分阶段体素级深度强化学习（SVL-DRL）的框架，用于在噪声标注下实现鲁棒的医学图像分割，通过将每个体素建模为自主智能体并设计复合奖励函数，在多个公开数据集上实现了最先进的性能。

Details

Motivation: 医学图像分割常受噪声标注影响，因器官形态复杂及标注者差异导致标注错误，降低模型性能；而人类标注者可依据先验知识修正错误，启发本文模拟此过程提升鲁棒性。 Method: 提出SVL-DRL框架，将噪声标注问题建模为体素依赖问题，采用分阶段强化学习策略；引入体素级异步优势演员-评论家（vA3C）模块，将每个体素视为独立智能体，动态优化状态表示；设计新的动作空间与结合Dice系数和空间连续性的复合奖励函数，以减轻错误标签影响。 Result: 在三个公开医学图像数据集上验证，SVL-DRL在不同实验设置下均达到最先进水平，Dice和IoU指标平均提升超过3%。 Conclusion: SVL-DRL能有效缓解噪声标注对医学图像分割的影响，具备良好的收敛性与语义保持能力，为弱监督下的鲁棒分割提供了新思路。 Abstract: Deep learning has achieved significant advancements in medical image segmentation. Currently, obtaining accurate segmentation outcomes is critically reliant on large-scale datasets with high-quality annotations. However, noisy annotations are frequently encountered owing to the complex morphological structures of organs in medical images and variations among different annotators, which can substantially limit the efficacy of segmentation models. Motivated by the fact that medical imaging annotator can correct labeling errors during segmentation based on prior knowledge, we propose an end-to-end Staged Voxel-Level Deep Reinforcement Learning (SVL-DRL) framework for robust medical image segmentation under noisy annotations. This framework employs a dynamic iterative update strategy to automatically mitigate the impact of erroneous labels without requiring manual intervention. The key advancements of SVL-DRL over existing works include: i) formulating noisy annotations as a voxel-dependent problem and addressing it through a novel staged reinforcement learning framework which guarantees robust model convergence; ii) incorporating a voxel-level asynchronous advantage actor-critic (vA3C) module that conceptualizes each voxel as an autonomous agent, which allows each agent to dynamically refine its own state representation during training, thereby directly mitigating the influence of erroneous labels; iii) designing a novel action space for the agents, along with a composite reward function that strategically combines the Dice value and a spatial continuity metric to significantly boost segmentation accuracy while maintain semantic integrity. Experiments on three public medical image datasets demonstrates State-of-The-Art (SoTA) performance under various experimental settings, with an average improvement of over 3\% in both Dice and IoU scores.

Table of Contents

cs.CL [Back]

[1] DeepResearch-Slice: Bridging the Retrieval-Utilization Gap via Explicit Text Slicing

[2] Internal Reasoning vs. External Control: A Thermodynamic Analysis of Sycophancy in Large Language Models

[3] Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models

[4] Benchmarking and Adapting On-Device Large Language Models for Clinical Decision Support

[5] OpenAI GPT-5 System Card

[6] WRAVAL -- WRiting Assist eVALuation

[7] The Instruction Gap: LLMs get lost in Following Instruction

[8] Advances and Challenges in Semantic Textual Similarity: A Comprehensive Survey

[9] Less is more: Not all samples are effective for evaluation

[10] GuardEval: A Multi-Perspective Benchmark for Evaluating Safety, Fairness, and Robustness in LLM Moderators

[11] LLM_annotate: A Python package for annotating and analyzing fiction characters

[12] Topic Segmentation Using Generative Language Models

[13] Bare-Metal Tensor Virtualization: Overcoming the Memory Wall in Edge-AI Inference on ARM64

[14] A path to natural language through tokenisation and transformers

[15] Metaphors are a Source of Cross-Domain Misalignment of Large Reasoning Models

[16] Breaking the Assistant Mold: Modeling Behavioral Variation in LLM Based Procedural Character Generation

[17] Rendering Data Unlearnable by Exploiting LLM Alignment Mechanisms

[18] Tigrinya Number Verbalization: Rules, Algorithm, and Implementation

[19] Implicit Graph, Explicit Retrieval: Towards Efficient and Interpretable Long-horizon Memory for Large Language Models

[20] PCoA: A New Benchmark for Medical Aspect-Based Summarization With Phrase-Level Context Attribution

[21] Training-Free Adaptation of New-Generation LLMs using Legacy Clinical Models

[22] The Critical Role of Aspects in Measuring Document Similarity

[23] Grading Scale Impact on LLM-as-a-Judge: Human-LLM Alignment Is Highest on 0-5 Grading Scale

[24] Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks

[25] Prompting Underestimates LLM Capability for Time Series Classification

[26] EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning

[27] SegNSP: Revisiting Next Sentence Prediction for Linear Text Segmentation

[28] Self-Explaining Hate Speech Detection with Moral Rationales

[29] CALM: Culturally Self-Aware Language Models

[30] Submodular Evaluation Subset Selection in Automatic Prompt Optimization

[31] Beyond Perplexity: A Lightweight Benchmark for Knowledge Retention in Supervised Fine-Tuning

[32] Reasoning Pattern Alignment Merging for Adaptive Reasoning

[33] IntroLM: Introspective Language Models via Prefilling-Time Self-Evaluation

[34] Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents

[35] PALM-Bench: A Comprehensive Benchmark for Personalized Audio-Language Models

[36] Persona-aware and Explainable Bikeability Assessment: A Vision-Language Model Approach

[37] DeepSynth-Eval: Objectively Evaluating Information Consolidation in Deep Survey Writing

[38] Layer-Order Inversion: Rethinking Latent Multi-Hop Reasoning in Large Language Models

[39] EvolMem: A Cognitive-Driven Benchmark for Multi-Session Dialogue Memory

[40] Value-Action Alignment in Large Language Models under Privacy-Prosocial Conflict

[41] Evaluating LLMs for Police Decision-Making: A Framework Based on Police Action Scenarios

[42] DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs

[43] How Do Large Language Models Learn Concepts During Continual Pre-Training?

[44] PsychEthicsBench: Evaluating Large Language Models Against Australian Mental Health Ethics

[45] OLA: Output Language Alignment in Code-Switched LLM Interactions

[46] From Chains to Graphs: Self-Structured Reasoning for General-Domain LLMs

[47] DiVA: Fine-grained Factuality Verification with Agentic-Discriminative Verifier

[48] Analyzing Reasoning Shifts in Audio Deepfake Detection under Adversarial Attacks: The Reasoning Tax versus Shield Bifurcation

[49] Evaluating the Pre-Consultation Ability of LLMs using Diagnostic Guidelines

[50] Reasoning Model Is Superior LLM-Judge, Yet Suffers from Biases

[51] Agent-Dice: Disentangling Knowledge Updates via Geometric Consensus for Agent Continual Learning

[52] LLM-MC-Affect: LLM-Based Monte Carlo Modeling of Affective Trajectories and Latent Ambiguity for Interpersonal Dynamic Insight

[53] ELO: Efficient Layer-Specific Optimization for Continual Pretraining of Multilingual LLMs

[54] SyncThink: A Training-Free Strategy to Align Inference Termination with Reasoning Saturation

[55] e5-omni: Explicit Cross-modal Alignment for Omni-modal Embeddings

[56] eTracer: Towards Traceable Text Generation via Claim-Level Grounding

[57] DisastQA: A Comprehensive Benchmark for Evaluating Question Answering in Disaster Management

[58] NeuronScope: A Multi-Agent Framework for Explaining Polysemantic Neurons in Language Models

[59] Towards Compositional Generalization of LLMs via Skill Taxonomy Guided Data Synthesis

[60] From Implicit to Explicit: Token-Efficient Logical Supervision for Mathematical Reasoning in LLMs

[61] Evaluation Framework for AI Creativity: A Case Study Based on Story Generation

[62] RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models

[63] ADEPT: Adaptive Dynamic Early-Exit Process for Transformers

[64] AirNav: A Large-Scale Real-World UAV Vision-and-Language Navigation Dataset with Natural and Diverse Instructions

[65] Visual Merit or Linguistic Crutch? A Close Look at DeepSeek-OCR

[66] MIND: From Passive Mimicry to Active Reasoning through Capability-Aware Multi-Perspective CoT Distillation

[67] Stuttering-Aware Automatic Speech Recognition for Indonesian Language

[68] O-Researcher: An Open Ended Deep Research Model via Multi-Agent Distillation and Agentic RL

[69] Whose Facts Win? LLM Source Preferences under Knowledge Conflicts

[70] Evaluation of Multilingual LLMs Personalized Text Generation Capabilities Targeting Groups and Social-Media Platforms

[71] Do LLM Self-Explanations Help Users Predict Model Behavior? Evaluating Counterfactual Simulatability with Pragmatic Perturbations

[72] Tracing the complexity profiles of different linguistic phenomena through the intrinsic dimension of LLM representations

[73] HearSay Benchmark: Do Audio LLMs Leak What They Hear?

[74] Membox: Weaving Topic Continuity into Long-Range Memory for LLM Agents

[75] InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training

[76] Compact Example-Based Explanations for Language Models

[77] NeoAMT: Neologism-Aware Agentic Machine Translation with Reinforcement Learning

[78] Do LLMs Really Memorize Personally Identifiable Information? Revisiting PII Leakage with a Cue-Controlled Memorization Framework