2025 04 08

A Unified Virtual Mixture-of-Experts Framework:Enhanced Inference and Hallucination Mitigation in Single-Model System

Mingyan Liu

Task: 提出一种统一的虚拟混合专家（MoE）融合策略，以增强小型模型（Qwen 1.5 0.5B）的推理性能并减少幻觉问题。

Motivation: 生成模型（如GPT和BERT）在文本生成和摘要等任务中表现优异，但小型模型中的幻觉问题（生成非事实或误导性内容）限制了其实际应用。

Details

Method: 采用多领域专家提示引导模型，结合统计异常截断策略和嵌入空间噪声注入，使用固定投票机制而非动态门控网络。 Result: 在对话生成任务中，该方法显著提高了小型模型的推理准确性和鲁棒性。 Conclusion: 该方法通过减少输出方差和抑制幻觉，为小型模型的优化提供了有效途径，并探讨了动态专家权重分配的潜力。 Abstract: Generative models, such as GPT and BERT, have significantly improved performance in tasks like text generation and summarization. However, hallucinations "where models generate non-factual or misleading content" are especially problematic in smaller-scale architectures, limiting their real-world applicability.In this paper, we propose a unified Virtual Mixture-of-Experts (MoE) fusion strategy that enhances inference performance and mitigates hallucinations in a single Qwen 1.5 0.5B model without increasing the parameter count. Our method leverages multiple domain-specific expert prompts (with the number of experts being adjustable) to guide the model from different perspectives. We apply a statistical outlier truncation strategy based on the mean and standard deviation to filter out abnormally high probability predictions, and we inject noise into the embedding space to promote output diversity. To clearly assess the contribution of each module, we adopt a fixed voting mechanism rather than a dynamic gating network, thereby avoiding additional confounding factors. We provide detailed theoretical derivations from both statistical and ensemble learning perspectives to demonstrate how our method reduces output variance and suppresses hallucinations. Extensive ablation experiments on dialogue generation tasks show that our approach significantly improves inference accuracy and robustness in small models. Additionally, we discuss methods for evaluating the orthogonality of virtual experts and outline the potential for future work involving dynamic expert weight allocation using gating networks.

Do "New Snow Tablets" Contain Snow? Large Language Models Over-Rely on Names to Identify Ingredients of Chinese Drugs

Sifan Li,Yujun Cai,Bryan Hooi,Nanyun Peng,Yiwei Wang

Task: 评估通用和中医专用大语言模型在识别中药成分中的表现。

Motivation: 中医在医疗保健中的应用日益广泛，但现有大语言模型在中药成分识别中存在明显缺陷，影响临床可靠性。

Details

Method: 采用检索增强生成（RAG）方法，专注于成分名称的识别。 Result: 实验表明，该方法将成分验证任务的准确率从约50%提升至82%。 Conclusion: 研究揭示了当前中医专用大语言模型的关键弱点，并提出了一种提升其临床可靠性的实用解决方案。 Abstract: Traditional Chinese Medicine (TCM) has seen increasing adoption in healthcare, with specialized Large Language Models (LLMs) emerging to support clinical applications. A fundamental requirement for these models is accurate identification of TCM drug ingredients. In this paper, we evaluate how general and TCM-specialized LLMs perform when identifying ingredients of Chinese drugs. Our systematic analysis reveals consistent failure patterns: models often interpret drug names literally, overuse common herbs regardless of relevance, and exhibit erratic behaviors when faced with unfamiliar formulations. LLMs also fail to understand the verification task. These findings demonstrate that current LLMs rely primarily on drug names rather than possessing systematic pharmacological knowledge. To address these limitations, we propose a Retrieval Augmented Generation (RAG) approach focused on ingredient names. Experiments across 220 TCM formulations show our method significantly improves accuracy from approximately 50% to 82% in ingredient verification tasks. Our work highlights critical weaknesses in current TCM-specific LLMs and offers a practical solution for enhancing their clinical reliability.

Sample, Don't Search: Rethinking Test-Time Alignment for Language Models

Gonçalo Faria,Noah A. Smith

Task: 提出一种新的测试时对齐方法QAlign，以优化语言模型在测试时的性能。

Motivation: 现有测试时搜索方法在计算规模扩大时性能下降，因为过度优化了不完美的奖励代理。

Details

Method: 采用马尔可夫链蒙特卡洛方法进行文本生成，无需修改底层模型或访问logit。 Result: 在数学推理基准测试和多样化数据集上，QAlign优于现有方法。 Conclusion: QAlign是一种实用的解决方案，无需进一步训练即可提升现成语言模型的能力。 Abstract: Increasing test-time computation has emerged as a promising direction for improving language model performance, particularly in scenarios where model finetuning is impractical or impossible due to computational constraints or private model weights. However, existing test-time search methods using a reward model (RM) often degrade in quality as compute scales, due to the over-optimization of what are inherently imperfect reward proxies. We introduce QAlign, a new test-time alignment approach. As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt. By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access. We demonstrate the effectiveness of QAlign on mathematical reasoning benchmarks (GSM8K and GSM-Symbolic) using a task-specific RM, showing consistent improvements over existing test-time compute methods like best-of-n and majority voting. Furthermore, when applied with more realistic RMs trained on the Tulu 3 preference dataset, QAlign outperforms direct preference optimization (DPO), best-of-n, majority voting, and weighted majority voting on a diverse range of datasets (GSM8K, MATH500, IFEval, MMLU-Redux, and TruthfulQA). A practical solution to aligning language models at test time using additional computation without degradation, our approach expands the limits of the capability that can be obtained from off-the-shelf language models without further training.

Entropy-Based Block Pruning for Efficient Large Language Models

Liangwei Yang,Yuhui Xu,Juntao Tan,Doyen Sahoo,Silvio Savarese,Caiming Xiong,Huan Wang,Shelby Heinecke

Task: 研究基于Transformer的模型中的冗余并提出基于熵的剪枝策略以提高效率。

Motivation: 大型语言模型的规模和存储需求增长给实际部署带来挑战，需要更高效的模型压缩方法。

Details

Method: 提出基于熵的剪枝策略，利用隐藏表示熵的变化趋势作为信息丰富度的衡量标准。 Result: 实验表明，基于熵的剪枝方法在减少模型大小的同时保持准确性，优于基于余弦相似性的方法。 Conclusion: 基于熵的剪枝策略为高效模型部署提供了有前景的方向。 Abstract: As large language models continue to scale, their growing computational and storage demands pose significant challenges for real-world deployment. In this work, we investigate redundancy within Transformer-based models and propose an entropy-based pruning strategy to enhance efficiency while maintaining performance. Empirical analysis reveals that the entropy of hidden representations decreases in the early blocks but progressively increases across most subsequent blocks. This trend suggests that entropy serves as a more effective measure of information richness within computation blocks. Unlike cosine similarity, which primarily captures geometric relationships, entropy directly quantifies uncertainty and information content, making it a more reliable criterion for pruning. Extensive experiments demonstrate that our entropy-based pruning approach surpasses cosine similarity-based methods in reducing model size while preserving accuracy, offering a promising direction for efficient model deployment.

Semi-supervised learning for marine anomaly detection on board satellites

Luca Marini

Task: 利用半监督学习方法（FixMatch）进行语义分割，以检测海洋异常。

Motivation: 海洋异常（如海洋垃圾、有害藻华、非法捕捞等）对生态系统造成严重威胁，但深度学习模型需要大量标注数据，成本高且耗时。半监督学习可以利用未标注数据，降低成本。

Details

Method: 采用FixMatch算法进行语义分割，使用U-Net模型架构，比较半监督与全监督模型在不同标注数据量下的性能。 Result: 半监督模型在标注数据有限时表现优于全监督模型，而全监督模型在标注数据充足时略优。 Conclusion: 半监督学习在海洋异常检测中具有潜力，尤其在标注数据有限时表现更优。 Abstract: Aquatic bodies face numerous environmental threats caused by several marine anomalies. Marine debris can devastate habitats and endanger marine life through entanglement, while harmful algal blooms can produce toxins that negatively affect marine ecosystems. Additionally, ships may discharge oil or engage in illegal and overfishing activities, causing further harm. These marine anomalies can be identified by applying trained deep learning (DL) models on multispectral satellite imagery. Furthermore, the detection of other anomalies, such as clouds, could be beneficial in filtering out irrelevant images. However, DL models often require a large volume of labeled data for training, which can be both costly and time-consuming, particularly for marine anomaly detection where expert annotation is needed. A potential solution is the use of semi-supervised learning methods, which can also utilize unlabeled data. In this project, we implement and study the performance of FixMatch for Semantic Segmentation, a semi-supervised algorithm for semantic segmentation. Firstly, we found that semi-supervised models perform best with a high confidence threshold of 0.9 when there is a limited amount of labeled data. Secondly, we compare the performance of semi-supervised models with fully-supervised models under varying amounts of labeled data. Our findings suggest that semi-supervised models outperform fully-supervised models with limited labeled data, while fully-supervised models have a slightly better performance with larger volumes of labeled data. We propose two hypotheses to explain why fully-supervised models surpass semi-supervised ones when a high volume of labeled data is used. All of our experiments were conducted using a U-Net model architecture with a limited number of parameters to ensure compatibility with space-rated hardware.

What Large Language Models Do Not Talk About: An Empirical Study of Moderation and Censorship Practices

Sander Noels,Guillaume Bied,Maarten Buyl,Alexander Rogiers,Yousra Fettach,Jefrey Lijffijt,Tijl De Bie

Task: 研究大型语言模型（LLMs）在政治话题上的内容审查实践，区分硬审查和软审查。

Motivation: LLMs作为信息门户的部署日益增多，但其内容审查实践尚未充分探索。

Details

Method: 分析14个来自西方国家、中国和俄罗斯的最先进模型在联合国六种官方语言中的回答，区分硬审查和软审查。 Result: 审查普遍存在，但主要针对LLM提供者的国内受众，表现为硬审查或软审查（很少同时出现）。 Conclusion: 需要公开可用的LLMs在意识形态和地理上的多样性，以及更透明的审查策略，以支持用户知情选择。 Abstract: Large Language Models (LLMs) are increasingly deployed as gateways to information, yet their content moderation practices remain underexplored. This work investigates the extent to which LLMs refuse to answer or omit information when prompted on political topics. To do so, we distinguish between hard censorship (i.e., generated refusals, error messages, or canned denial responses) and soft censorship (i.e., selective omission or downplaying of key elements), which we identify in LLMs' responses when asked to provide information on a broad range of political figures. Our analysis covers 14 state-of-the-art models from Western countries, China, and Russia, prompted in all six official United Nations (UN) languages. Our analysis suggests that although censorship is observed across the board, it is predominantly tailored to an LLM provider's domestic audience and typically manifests as either hard censorship or soft censorship (though rarely both concurrently). These findings underscore the need for ideological and geographic diversity among publicly available LLMs, and greater transparency in LLM moderation strategies to facilitate informed user choices. All data are made freely available.

Scalable heliostat surface predictions from focal spots: Sim-to-Real transfer of inverse Deep Learning Raytracing

Jan Lewen,Max Pargmann,Jenia Jitsev,Mehdi Cherti,Robert Pitz-Paal,Daniel Maldonado Quinto

Task: 利用逆深度学习光线追踪（iDLR）方法从实际目标图像中推断定日镜表面轮廓，以提高聚光太阳能发电（CSP）系统的效率和安全性。

Motivation: 定日镜表面缺陷会影响聚光太阳能发电系统的性能和安全性，但实际部署中难以测量这些缺陷，导致控制系统假设理想表面，从而影响效果。

Details

Method: 提出逆深度学习光线追踪（iDLR）方法，通过标准校准过程中记录的目标图像推断定日镜表面轮廓，并首次实现从模拟到现实的迁移。 Result: 在63个定日镜上验证，iDLR的中位平均绝对误差为0.17毫米，84%情况下与地面真实数据一致，模拟中通量密度预测准确率达90%，比理想假设提高26%。 Conclusion: iDLR是一种可扩展、自动化且经济高效的解决方案，可用于数字孪生中集成真实定日镜表面模型，提升未来CSP系统的效率和安全性。 Abstract: Concentrating Solar Power (CSP) plants are a key technology in the transition toward sustainable energy. A critical factor for their safe and efficient operation is the distribution of concentrated solar flux on the receiver. However, flux distributions from individual heliostats are sensitive to surface imperfections. Measuring these surfaces across many heliostats remains impractical in real-world deployments. As a result, control systems often assume idealized heliostat surfaces, leading to suboptimal performance and potential safety risks. To address this, inverse Deep Learning Raytracing (iDLR) has been introduced as a novel method for inferring heliostat surface profiles from target images recorded during standard calibration procedures. In this work, we present the first successful Sim-to-Real transfer of iDLR, enabling accurate surface predictions directly from real-world target images. We evaluate our method on 63 heliostats under real operational conditions. iDLR surface predictions achieve a median mean absolute error (MAE) of 0.17 mm and show good agreement with deflectometry ground truth in 84% of cases. When used in raytracing simulations, it enables flux density predictions with a mean accuracy of 90% compared to deflectometry over our dataset, and outperforms the commonly used ideal heliostat surface assumption by 26%. We tested this approach in a challenging double-extrapolation scenario-involving unseen sun positions and receiver projection-and found that iDLR maintains high predictive accuracy, highlighting its generalization capabilities. Our results demonstrate that iDLR is a scalable, automated, and cost-effective solution for integrating realistic heliostat surface models into digital twins. This opens the door to improved flux control, more precise performance modeling, and ultimately, enhanced efficiency and safety in future CSP plants.

Do LLM Evaluators Prefer Themselves for a Reason?

Wei-Lin Chen,Zhepei Wei,Xinyu Zhu,Shi Feng,Yu Meng

Task: 研究大型语言模型（LLMs）在自动评估中的自我偏好偏差及其影响。

Motivation: 探讨自我偏好是否对评估结果有害，或仅反映模型能力的客观优势。

Details

Method: 使用可验证的基准（数学推理、事实知识、代码生成）进行大规模实验，区分有害与合法的自我偏好。 Result: 发现更好的生成模型也是更好的评估者，有害自我偏好在模型表现差时更明显，推理时策略可减少有害偏差。 Conclusion: 提供了对LLM评估的更细致理解，并为提高其可靠性提供了实用建议。 Abstract: Large language models (LLMs) are increasingly used as automatic evaluators in applications such as benchmarking, reward modeling, and self-refinement. Prior work highlights a potential self-preference bias where LLMs favor their own generated responses, a tendency often intensifying with model size and capability. This raises a critical question: Is self-preference detrimental, or does it simply reflect objectively superior outputs from more capable models? Disentangling these has been challenging due to the usage of subjective tasks in previous studies. To address this, we investigate self-preference using verifiable benchmarks (mathematical reasoning, factual knowledge, code generation) that allow objective ground-truth assessment. This enables us to distinguish harmful self-preference (favoring objectively worse responses) from legitimate self-preference (favoring genuinely superior ones). We conduct large-scale experiments under controlled evaluation conditions across diverse model families (e.g., Llama, Qwen, Gemma, Mistral, Phi, GPT, DeepSeek). Our findings reveal three key insights: (1) Better generators are better judges -- LLM evaluators' accuracy strongly correlates with their task performance, and much of the self-preference in capable models is legitimate. (2) Harmful self-preference persists, particularly when evaluator models perform poorly as generators on specific task instances. Stronger models exhibit more pronounced harmful bias when they err, though such incorrect generations are less frequent. (3) Inference-time scaling strategies, such as generating a long Chain-of-Thought before evaluation, effectively reduce the harmful self-preference. These results provide a more nuanced understanding of LLM-based evaluation and practical insights for improving its reliability.

CrowdVLM-R1: Expanding R1 Ability to Vision Language Model for Crowd Counting using Fuzzy Group Relative Policy Reward

Zhiqiang Wang,Pengbin Feng,Yanbin Lin,Shuzhang Cai,Zongao Bian,Jinghua Yan,Xingquan Zhu

Task: 提出一种名为Fuzzy Group Relative Policy Reward (FGRPR)的新框架，结合Group Relative Policy Optimization (GRPO)和模糊奖励函数以提高学习效率。

Motivation: 传统二元0/1准确度奖励不够精细，无法鼓励更精确的输出，因此需要一种更灵活的奖励机制。

Details

Method: 将GRPO与模糊奖励函数结合，通过提供更细致的激励来优化模型输出。 Result: FGRPR在多个数据集上超越基线模型（包括GPT4o、LLaMA2和SFT），尤其在目标值较大时表现更优。 Conclusion: FGRPR适用于需要答案精确性的任务，其模糊奖励机制显著提升了模型性能。 Abstract: We propose Fuzzy Group Relative Policy Reward (FGRPR), a novel framework that integrates Group Relative Policy Optimization (GRPO) with a fuzzy reward function to enhance learning efficiency. Unlike the conventional binary 0/1 accuracy reward, our fuzzy reward model provides nuanced incentives, encouraging more precise outputs. Experimental results demonstrate that GRPO with a standard 0/1 accuracy reward underperforms compared to supervised fine-tuning (SFT). In contrast, FGRPR, applied to Qwen2.5-VL(3B and 7B), surpasses all baseline models, including GPT4o, LLaMA2(90B), and SFT, across five in-domain datasets. On an out-of-domain dataset, FGRPR achieves performance comparable to SFT but excels when target values are larger, as its fuzzy reward function assigns higher rewards to closer approximations. This approach is broadly applicable to tasks where the precision of the answer is critical. Code and data: https://github.com/yeyimilk/CrowdVLM-R1

Abhilekh Borah,Hasnat Md Abdullah,Kangda Wei,Ruihong Huang

Task: 评估大型语言模型（LLMs）在理解和处理气候相关内容方面的能力。

Motivation: 气候问题在社交媒体上广泛传播，但目前缺乏对多模态表达的分析工具，无法判断LLMs是传播可信解决方案还是未经证实的观点。

Details

Method: 引入CliME数据集和Climate Alignment Quotient（CAQ）指标，从五个维度评估LLMs性能，并结合三个分析视角（行动性、批判性和公正性）进行系统分析。 Result: 大多数LLMs在批判性和公正性上表现较好，但在行动性上表现不佳；Claude 3.7 Sonnet表现最佳。 Conclusion: CliME数据集和CAQ指标为评估LLMs在气候相关内容上的表现提供了新工具，揭示了模型在行动性方面的不足。 Abstract: The rise of Large Language Models (LLMs) has raised questions about their ability to understand climate-related contexts. Though climate change dominates social media, analyzing its multimodal expressions is understudied, and current tools have failed to determine whether LLMs amplify credible solutions or spread unsubstantiated claims. To address this, we introduce CliME (Climate Change Multimodal Evaluation), a first-of-its-kind multimodal dataset, comprising 2579 Twitter and Reddit posts. The benchmark features a diverse collection of humorous memes and skeptical posts, capturing how these formats distill complex issues into viral narratives that shape public opinion and policy discussions. To systematically evaluate LLM performance, we present the Climate Alignment Quotient (CAQ), a novel metric comprising five distinct dimensions: Articulation, Evidence, Resonance, Transition, and Specificity. Additionally, we propose three analytical lenses: Actionability, Criticality, and Justice, to guide the assessment of LLM-generated climate discourse using CAQ. Our findings, based on the CAQ metric, indicate that while most evaluated LLMs perform relatively well in Criticality and Justice, they consistently underperform on the Actionability axis. Among the models evaluated, Claude 3.7 Sonnet achieves the highest overall performance. We publicly release our CliME dataset and code to foster further research in this domain.

From Keypoints to Realism: A Realistic and Accurate Virtual Try-on Network from 2D Images

Maliheh Toozandehjani,Ali Mousavi,Reza Taheri

Task: 生成穿着目标服装的真实感图像，确保姿势、体型和服装特征准确保留。

Motivation: 现有方法难以有效复现服装细节且泛化能力不足。

Details

Method: 完全移除初始服装，通过预测关键点进行精确变形，结合对齐感知的分割归一化生成高质量图像。 Result: 能够重建服装的精确特征，包括形状和纹理，并提升对不同姿势的适应性。 Conclusion: 该方法在保留服装特征和泛化能力方面表现优异。 Abstract: The aim of image-based virtual try-on is to generate realistic images of individuals wearing target garments, ensuring that the pose, body shape and characteristics of the target garment are accurately preserved. Existing methods often fail to reproduce the fine details of target garments effectively and lack generalizability to new scenarios. In the proposed method, the person's initial garment is completely removed. Subsequently, a precise warping is performed using the predicted keypoints to fully align the target garment with the body structure and pose of the individual. Based on the warped garment, a body segmentation map is more accurately predicted. Then, using an alignment-aware segment normalization, the misaligned areas between the warped garment and the predicted garment region in the segmentation map are removed. Finally, the generator produces the final image with high visual quality, reconstructing the precise characteristics of the target garment, including its overall shape and texture. This approach emphasizes preserving garment characteristics and improving adaptability to various poses, providing better generalization for diverse applications.

Adaptation of Large Language Models

Zixuan Ke,Yifei Ming,Shafiq Joty

Task: 提供关于动态、领域特定和任务自适应的LLM适应技术的概述。

Motivation: 通用LLM在专业领域表现不佳且静态特性限制了其适应能力，因此需要适应技术以解决这些问题。

Details

Method: 将适应技术分为参数化知识适应（更新LLM内部参数）和半参数化知识适应（利用外部知识或工具）。 Result: 概述了LLM适应技术的分类和应用，包括实时适应和半参数化方法。 Conclusion: LLM适应技术对工业和学术界都具有重要意义，能够提升模型在专业领域的表现和实用性。 Abstract: This tutorial on adaptation of LLMs is designed to address the growing demand for models that go beyond the static capabilities of generic LLMs by providing an overview of dynamic, domain-specific, and task-adaptive LLM adaptation techniques. While general LLMs have demonstrated strong generalization across a variety of tasks, they often struggle to perform well in specialized domains such as finance, healthcare, and code generation for underrepresented languages. Additionally, their static nature limits their ability to evolve with the changing world, and they are often extremely large in size, making them impractical and costly to deploy at scale. As a result, the adaptation of LLMs has drawn much attention since the birth of LLMs and is of core importance, both for industry, which focuses on serving its targeted users, and academia, which can greatly benefit from small but powerful LLMs. To address this gap, this tutorial aims to provide an overview of the LLM adaptation techniques. We start with an introduction to LLM adaptation, from both the data perspective and the model perspective. We then emphasize how the evaluation metrics and benchmarks are different from other techniques. After establishing the problems, we explore various adaptation techniques. We categorize adaptation techniques into two main families. The first is parametric knowledge adaptation, which focuses on updating the parametric knowledge within LLMs. Additionally, we will discuss real-time adaptation techniques, including model editing, which allows LLMs to be updated dynamically in production environments. The second kind of adaptation is semi-parametric knowledge adaptation, where the goal is to update LLM parameters to better leverage external knowledge or tools through techniques like retrieval-augmented generation (RAG) and agent-based systems.

A Hybrid Wavelet-Fourier Method for Next-Generation Conditional Diffusion Models

Andrew Kiruluta,Andreas Lemos

Task: 提出一种新颖的生成建模框架Wavelet-Fourier-Diffusion，用于合成高质量、高保真度的图像。

Motivation: 传统扩散模型仅依赖像素空间的加性噪声，无法有效捕捉全局结构和细粒度特征。

Details

Method: 结合小波子带分解和部分傅里叶步骤的多变换策略，在混合谱域中逐步降解和重建图像。 Result: 在CIFAR-10、CelebA-HQ和条件ImageNet子集上表现优于基线扩散模型和GANs。 Conclusion: 混合频率表示为多尺度生成建模提供了新方向。 Abstract: We present a novel generative modeling framework,Wavelet-Fourier-Diffusion, which adapts the diffusion paradigm to hybrid frequency representations in order to synthesize high-quality, high-fidelity images with improved spatial localization. In contrast to conventional diffusion models that rely exclusively on additive noise in pixel space, our approach leverages a multi-transform that combines wavelet sub-band decomposition with partial Fourier steps. This strategy progressively degrades and then reconstructs images in a hybrid spectral domain during the forward and reverse diffusion processes. By supplementing traditional Fourier-based analysis with the spatial localization capabilities of wavelets, our model can capture both global structures and fine-grained features more effectively. We further extend the approach to conditional image generation by integrating embeddings or conditional features via cross-attention. Experimental evaluations on CIFAR-10, CelebA-HQ, and a conditional ImageNet subset illustrate that our method achieves competitive or superior performance relative to baseline diffusion models and state-of-the-art GANs, as measured by Fr\'echet Inception Distance (FID) and Inception Score (IS). We also show how the hybrid frequency-based representation improves control over global coherence and fine texture synthesis, paving the way for new directions in multi-scale generative modeling.

YaleNLP @ PerAnsSumm 2025: Multi-Perspective Integration via Mixture-of-Agents for Enhanced Healthcare QA Summarization

Dongsuk Jang,Alan Li,Arman Cohan

Task: 通过识别不同答案中的观点并生成综合回答，解决医疗社区问答论坛的自动摘要挑战。

Motivation: 医疗社区问答论坛中的回答多样性使得自动摘要具有挑战性，因此提出了PerAnsSumm共享任务来解决这一问题。

Details

Method: 采用两种互补方法：(i) 基于训练的QLoRA微调LLaMA-3.3-70B-Instruct；(ii) 代理方法，包括零样本和少样本提示（使用LLaMA-3.3-70B-Instruct和GPT-4o）以及混合代理（MoA）框架。 Result: GPT-4o在观点识别/分类中得分0.57，显著优于LLaMA基线（0.40）；MoA框架将LLaMA性能提升28%至0.51。在观点摘要任务中，GPT-4o零样本得分0.42，优于LLaMA（0.28），MoA提升LLaMA性能32%至0.37。少样本提示对LLaMA模型更有效。 Conclusion: YaleNLP团队的方法在共享任务中排名第二，证明了其方法的有效性。 Abstract: Automated summarization of healthcare community question-answering forums is challenging due to diverse perspectives presented across multiple user responses to each question. The PerAnsSumm Shared Task was therefore proposed to tackle this challenge by identifying perspectives from different answers and then generating a comprehensive answer to the question. In this study, we address the PerAnsSumm Shared Task using two complementary paradigms: (i) a training-based approach through QLoRA fine-tuning of LLaMA-3.3-70B-Instruct, and (ii) agentic approaches including zero- and few-shot prompting with frontier LLMs (LLaMA-3.3-70B-Instruct and GPT-4o) and a Mixture-of-Agents (MoA) framework that leverages a diverse set of LLMs by combining outputs from multi-layer feedback aggregation. For perspective span identification/classification, GPT-4o zero-shot achieves an overall score of 0.57, substantially outperforming the 0.40 score of the LLaMA baseline. With a 2-layer MoA configuration, we were able to improve LLaMA performance up by 28 percent to 0.51. For perspective-based summarization, GPT-4o zero-shot attains an overall score of 0.42 compared to 0.28 for the best LLaMA zero-shot, and our 2-layer MoA approach boosts LLaMA performance by 32 percent to 0.37. Furthermore, in few-shot setting, our results show that the sentence-transformer embedding-based exemplar selection provides more gain than manually selected exemplars on LLaMA models, although the few-shot prompting is not always helpful for GPT-4o. The YaleNLP team's approach ranked the overall second place in the shared task.

Detection Limits and Statistical Separability of Tree Ring Watermarks in Rectified Flow-based Text-to-Image Generation Models

Ved Umrajkar,Aakash Kumar Singh

Task: 评估和比较Tree-Ring水印技术在SD 2.1和FLUX.1-dev模型中的检测与分离效果。

Motivation: 探索Tree-Ring水印技术在基于校正流的模型中的有效性，尤其是这些模型在噪声潜在反转方面的固有挑战。

Details

Method: 通过大量实验，分析不同文本引导配置和增强攻击对水印恢复及统计分离的影响。 Result: 揭示了Tree-Ring水印在当前SOTA模型中的局限性，并强调了改进反转方法以实现可靠水印检测和分离的必要性。 Conclusion: 当前Tree-Ring水印技术存在局限性，需改进反转方法以提升其可靠性。 Abstract: Tree-Ring Watermarking is a significant technique for authenticating AI-generated images. However, its effectiveness in rectified flow-based models remains unexplored, particularly given the inherent challenges of these models with noise latent inversion. Through extensive experimentation, we evaluated and compared the detection and separability of watermarks between SD 2.1 and FLUX.1-dev models. By analyzing various text guidance configurations and augmentation attacks, we demonstrate how inversion limitations affect both watermark recovery and the statistical separation between watermarked and unwatermarked images. Our findings provide valuable insights into the current limitations of Tree-Ring Watermarking in the current SOTA models and highlight the critical need for improved inversion methods to achieve reliable watermark detection and separability. The official implementation, dataset release and all experimental results are available at this \href{https://github.com/dsgiitr/flux-watermarking}{\textbf{link}}.

Language Models Are Implicitly Continuous

Samuele Marro,Davide Evangelista,X. Angelo Huang,Emanuele La Malfa,Michele Lombardi,Michael Wooldridge

Task: 研究Transformer-based语言模型如何隐式地将句子表示为连续时间函数。

Motivation: 探讨神经网络（尤其是Transformer模型）在处理语言时如何表现出连续性和平滑性，与人类语言处理的离散性形成对比。

Details

Method: 通过分析主流大型语言模型（如Llama2、Llama3等），将Transformer模型形式化扩展以捕捉时间和空间的连续性。 Result: 发现LLMs以连续函数的方式表示句子，与人类语言处理方式有本质差异。 Conclusion: 研究挑战了传统对LLMs语言理解方式的解释，并提出了语言学和工程学上的新见解。 Abstract: Language is typically modelled with discrete sequences. However, the most successful approaches to language modelling, namely neural networks, are continuous and smooth function approximators. In this work, we show that Transformer-based language models implicitly learn to represent sentences as continuous-time functions defined over a continuous input space. This phenomenon occurs in most state-of-the-art Large Language Models (LLMs), including Llama2, Llama3, Phi3, Gemma, Gemma2, and Mistral, and suggests that LLMs reason about language in ways that fundamentally differ from humans. Our work formally extends Transformers to capture the nuances of time and space continuity in both input and output space. Our results challenge the traditional interpretation of how LLMs understand language, with several linguistic and engineering implications.

Can ChatGPT Learn My Life From a Week of First-Person Video?

Keegan Harris

Task: 研究基础模型如何通过第一人称摄像头数据学习佩戴者的个人生活。

Motivation: 基于生成式AI和可穿戴摄像头设备（如智能眼镜和AI支持的别针）的最新进展，探索模型对佩戴者个人生活的理解能力。

Details

Method: 佩戴摄像头头戴设备54小时，生成不同长度的摘要（如分钟级、小时级和天级摘要），并对GPT-4o和GPT-4o-mini进行微调。 Result: 模型能学习佩戴者的基本信息（如年龄、性别），GPT-4o还能推断出居住地、职业、惯用手和宠物信息，但存在幻觉问题，如虚构视频中人物的名字。 Conclusion: 基础模型能通过第一人称数据学习个人生活信息，但需解决幻觉问题以提高准确性。 Abstract: Motivated by recent improvements in generative AI and wearable camera devices (e.g. smart glasses and AI-enabled pins), I investigate the ability of foundation models to learn about the wearer's personal life through first-person camera data. To test this, I wore a camera headset for 54 hours over the course of a week, generated summaries of various lengths (e.g. minute-long, hour-long, and day-long summaries), and fine-tuned both GPT-4o and GPT-4o-mini on the resulting summary hierarchy. By querying the fine-tuned models, we are able to learn what the models learned about me. The results are mixed: Both models learned basic information about me (e.g. approximate age, gender). Moreover, GPT-4o correctly deduced that I live in Pittsburgh, am a PhD student at CMU, am right-handed, and have a pet cat. However, both models also suffered from hallucination and would make up names for the individuals present in the video footage of my life.

Clinical ModernBERT: An efficient and long context encoder for biomedical text

Simon A. Lee,Anthony Wu,Jeffrey N. Chiang

Task: 开发一种基于Transformer的编码器Clinical ModernBERT，专门针对生物医学和临床领域进行预训练。

Motivation: 结合生物医学文献、临床笔记和医学术语，提升长上下文任务的语义表示能力。

Details

Method: 基于ModernBERT架构，引入旋转位置嵌入（RoPE）、Flash Attention和扩展上下文长度至8,192个标记，并在PubMed摘要、MIMIC IV临床数据和医学术语描述上进行预训练。 Result: 模型在临床NLP基准测试中表现优异，能够生成语义丰富的表示。 Conclusion: Clinical ModernBERT在生物医学和临床领域的任务中表现出色，特别是在长上下文任务中。 Abstract: We introduce Clinical ModernBERT, a transformer based encoder pretrained on large scale biomedical literature, clinical notes, and medical ontologies, incorporating PubMed abstracts, MIMIC IV clinical data, and medical codes with their textual descriptions. Building on ModernBERT the current state of the art natural language text encoder featuring architectural upgrades such as rotary positional embeddings (RoPE), Flash Attention, and extended context length up to 8,192 tokens our model adapts these innovations specifically for biomedical and clinical domains. Clinical ModernBERT excels at producing semantically rich representations tailored for long context tasks. We validate this both by analyzing its pretrained weights and through empirical evaluation on a comprehensive suite of clinical NLP benchmarks.

Control Map Distribution using Map Query Bank for Online Map Generation

Ziming Liu,Leichen Wang,Ge Yang,Xinrun Li,Xingtao Hu,Hao Sun,Guangyu Gao

Task: 提出一种基于地图查询库（MQBank）的方法，为不同场景生成合适的初始地图查询分布，以提高视觉在线地图生成（OMG）的性能。

Motivation: 预建高清地图成本高，而现有基于查询的BEV Transformer模型性能受限于初始地图查询分布的质量和数量。

Details

Method: 通过分解高清地图分布为点表示（MQBank），并引入低成本的标准定义地图（SD map）作为先验知识，优化初始查询分布。 Result: 在OpenLaneV2基准测试中取得了40.5%和45.7%的mAP（车辆车道和行人区域），创下新纪录。 Conclusion: MQBank方法结合SD map先验知识，有效提升了在线地图生成的性能，为实际应用提供了低成本解决方案。 Abstract: Reliable autonomous driving systems require high-definition (HD) map that contains detailed map information for planning and navigation. However, pre-build HD map requires a large cost. Visual-based Online Map Generation (OMG) has become an alternative low-cost solution to build a local HD map. Query-based BEV Transformer has been a base model for this task. This model learns HD map predictions from an initial map queries distribution which is obtained by offline optimization on training set. Besides the quality of BEV feature, the performance of this model also highly relies on the capacity of initial map query distribution. However, this distribution is limited because the limited query number. To make map predictions optimal on each test sample, it is essential to generate a suitable initial distribution for each specific scenario. This paper proposes to decompose the whole HD map distribution into a set of point representations, namely map query bank (MQBank). To build specific map query initial distributions of different scenarios, low-cost standard definition map (SD map) data is introduced as a kind of prior knowledge. Moreover, each layer of map decoder network learns instance-level map query features, which will lose detailed information of each point. However, BEV feature map is a point-level dense feature. It is important to keep point-level information in map queries when interacting with BEV feature map. This can also be solved with map query bank method. Final experiments show a new insight on SD map prior and a new record on OpenLaneV2 benchmark with 40.5%, 45.7% mAP on vehicle lane and pedestrian area.

Structured Extraction of Process Structure Properties Relationships in Materials Science

Amit K Verma,Zhisong Zhang,Junwon Seo,Robin Kuo,Runbo Jiang,Emma Strubell,Anthony D Rollett

Task: 提出一种新的标注模式，用于从科学文献中提取通用的过程-结构-性能关系。

Motivation: 大型语言模型（LLMs）在材料发现领域具有潜力，但需要针对材料特定查询进行优化。

Details

Method: 使用条件随机场（CRF）模型和微调的LLM（GPT-4o）进行实体提取，并在两个不同领域的数据集上评估性能。 Result: 微调LLM在特定领域（Domain I）表现优于BERT-CRF模型，但在加入另一领域（Domain II）数据后，两者性能相当。 Conclusion: 提出的标注模式有效，且两种建模方法具有互补优势。 Abstract: With the advent of large language models (LLMs), the vast unstructured text within millions of academic papers is increasingly accessible for materials discovery, although significant challenges remain. While LLMs offer promising few- and zero-shot learning capabilities, particularly valuable in the materials domain where expert annotations are scarce, general-purpose LLMs often fail to address key materials-specific queries without further adaptation. To bridge this gap, fine-tuning LLMs on human-labeled data is essential for effective structured knowledge extraction. In this study, we introduce a novel annotation schema designed to extract generic process-structure-properties relationships from scientific literature. We demonstrate the utility of this approach using a dataset of 128 abstracts, with annotations drawn from two distinct domains: high-temperature materials (Domain I) and uncertainty quantification in simulating materials microstructure (Domain II). Initially, we developed a conditional random field (CRF) model based on MatBERT, a domain-specific BERT variant, and evaluated its performance on Domain I. Subsequently, we compared this model with a fine-tuned LLM (GPT-4o from OpenAI) under identical conditions. Our results indicate that fine-tuning LLMs can significantly improve entity extraction performance over the BERT-CRF baseline on Domain I. However, when additional examples from Domain II were incorporated, the performance of the BERT-CRF model became comparable to that of the GPT-4o model. These findings underscore the potential of our schema for structured knowledge extraction and highlight the complementary strengths of both modeling approaches.

3D Scene Understanding Through Local Random Access Sequence Modeling

Wanhee Lee,Klemen Kotar,Rahul Mysore Venkatesh,Jared Watrous,Honglin Chen,Khai Loong Aw,Daniel L. K. Yamins

Task: 提出一种名为Local Random Access Sequence (LRAS)的自回归生成方法，用于解决单图像3D场景理解中的对象和场景一致性问题。

Motivation: 基于扩散的建模方法在复杂现实场景中难以保持对象和场景一致性，因此需要一种更有效的方法。

Details

Method: 采用局部块量化和随机顺序序列生成的LRAS建模方法，并利用光流作为3D场景编辑的中间表示。 Result: 实验表明，LRAS在新视角合成和3D对象操作方面达到最先进水平，并能通过简单修改扩展到自监督深度估计。 Conclusion: LRAS为构建下一代3D视觉模型提供了一个统一且有效的框架。 Abstract: 3D scene understanding from single images is a pivotal problem in computer vision with numerous downstream applications in graphics, augmented reality, and robotics. While diffusion-based modeling approaches have shown promise, they often struggle to maintain object and scene consistency, especially in complex real-world scenarios. To address these limitations, we propose an autoregressive generative approach called Local Random Access Sequence (LRAS) modeling, which uses local patch quantization and randomly ordered sequence generation. By utilizing optical flow as an intermediate representation for 3D scene editing, our experiments demonstrate that LRAS achieves state-of-the-art novel view synthesis and 3D object manipulation capabilities. Furthermore, we show that our framework naturally extends to self-supervised depth estimation through a simple modification of the sequence design. By achieving strong performance on multiple 3D scene understanding tasks, LRAS provides a unified and effective framework for building the next generation of 3D vision models.

Algorithmic Prompt Generation for Diverse Human-like Teaming and Communication with Large Language Models

Siddharth Srikanth,Varun Bhatt,Boshen Zhang,Werner Hager,Charles Michael Lewis,Katia P. Sycara,Aaquib Tabrez,Stefanos Nikolaidis

Task: 结合质量多样性（QD）优化与大型语言模型（LLM）驱动的智能体，生成多样化的团队行为。

Motivation: 研究人类团队协作行为的多样性，以改进人机协作和AI辅助决策，但大规模用户研究存在实际、伦理和操作限制。

Details

Method: 通过QD优化与LLM驱动的智能体结合，迭代搜索生成多样化团队行为的提示。 Result: 实验表明，该方法能有效复现人类团队行为的趋势，并捕捉难以通过大量数据观察到的行为。 Conclusion: QD与LLM的结合是研究多智能体协作中团队和沟通策略的有效工具。 Abstract: Understanding how humans collaborate and communicate in teams is essential for improving human-agent teaming and AI-assisted decision-making. However, relying solely on data from large-scale user studies is impractical due to logistical, ethical, and practical constraints, necessitating synthetic models of multiple diverse human behaviors. Recently, agents powered by Large Language Models (LLMs) have been shown to emulate human-like behavior in social settings. But, obtaining a large set of diverse behaviors requires manual effort in the form of designing prompts. On the other hand, Quality Diversity (QD) optimization has been shown to be capable of generating diverse Reinforcement Learning (RL) agent behavior. In this work, we combine QD optimization with LLM-powered agents to iteratively search for prompts that generate diverse team behavior in a long-horizon, multi-step collaborative environment. We first show, through a human-subjects experiment (n=54 participants), that humans exhibit diverse coordination and communication behavior in this domain. We then show that our approach can effectively replicate trends from human teaming data and also capture behaviors that are not easily observed without collecting large amounts of data. Our findings highlight the combination of QD and LLM-powered agents as an effective tool for studying teaming and communication strategies in multi-agent collaboration.

WildGS-SLAM: Monocular Gaussian Splatting SLAM in Dynamic Environments

Jianhao Zheng,Zihan Zhu,Valentin Bieri,Marc Pollefeys,Songyou Peng,Iro Armeni

Task: 提出一种名为WildGS-SLAM的鲁棒且高效的单目RGB SLAM系统，用于处理动态环境。

Motivation: 传统SLAM系统假设场景静态，无法有效处理动态环境中的移动物体。

Details

Method: 通过集成深度和不确定性信息，引入不确定性地图（由浅层多层感知器和DINOv2特征预测），指导动态物体移除，并优化密集束调整和高斯地图。 Result: 在多个数据集上评估，展示了无伪影的视图合成效果，性能优于现有方法。 Conclusion: WildGS-SLAM在动态环境中表现出色，优于现有技术。 Abstract: We present WildGS-SLAM, a robust and efficient monocular RGB SLAM system designed to handle dynamic environments by leveraging uncertainty-aware geometric mapping. Unlike traditional SLAM systems, which assume static scenes, our approach integrates depth and uncertainty information to enhance tracking, mapping, and rendering performance in the presence of moving objects. We introduce an uncertainty map, predicted by a shallow multi-layer perceptron and DINOv2 features, to guide dynamic object removal during both tracking and mapping. This uncertainty map enhances dense bundle adjustment and Gaussian map optimization, improving reconstruction accuracy. Our system is evaluated on multiple datasets and demonstrates artifact-free view synthesis. Results showcase WildGS-SLAM's superior performance in dynamic environments compared to state-of-the-art methods.

Rethinking Reflection in Pre-Training

Essential AI,:,Darsh J Shah,Peter Rushton,Somanshu Singla,Mohit Parmar,Kurt Smith,Yash Vanjani,Ashish Vaswani,Adarsh Chaluvaraju,Andrew Hojel,Andrew Ma,Anil Thomas,Anthony Polloreno,Ashish Tanwer,Burhan Drak Sibai,Divya S Mansingka,Divya Shivaprasad,Ishaan Shah,Karl Stratos,Khoi Nguyen,Michael Callahan,Michael Pust,Mrinal Iyer,Philip Monk,Platon Mazarakis,Ritvik Kapila,Saurabh Srivastava,Tim Romanski

Task: 研究语言模型在预训练阶段自我反思能力的早期发展。

Motivation: 探索语言模型在预训练阶段是否已具备自我纠正能力，而非仅在强化学习阶段发展。

Details

Method: 通过在思维链中故意引入错误，测试模型是否能识别并纠正这些错误，从而得出正确答案。 Result: 发现自我纠正能力在预训练早期即出现，并随时间稳步提升，例如OLMo2-7B模型在4万亿标记的预训练后表现出自我纠正能力。 Conclusion: 语言模型的自我反思能力在预训练阶段已开始发展，且随预训练进展而增强。 Abstract: A language model's ability to reflect on its own reasoning provides a key advantage for solving complex problems. While most recent research has focused on how this ability develops during reinforcement learning, we show that it actually begins to emerge much earlier - during the model's pre-training. To study this, we introduce deliberate errors into chains-of-thought and test whether the model can still arrive at the correct answer by recognizing and correcting these mistakes. By tracking performance across different stages of pre-training, we observe that this self-correcting ability appears early and improves steadily over time. For instance, an OLMo2-7B model pre-trained on 4 trillion tokens displays self-correction on our six self-reflection tasks.

Leveraging Gait Patterns as Biomarkers: An attention-guided Deep Multiple Instance Learning Network for Scoliosis Classification

Haiqing Li,Yuzhi Guo,Feng Jiang,Qifeng Zhou,Hehuan Ma,Junzhou Huang

Task: 提出一种基于注意力引导的深度多实例学习方法（Gait-MIL），用于通过步态模式检测脊柱侧弯。

Motivation: 脊柱侧弯早期检测困难，传统方法依赖临床经验和X射线，存在辐射风险，限制了大规模筛查。

Details

Method: 采用注意力引导的深度多实例学习方法，从步态模式中提取判别性特征。 Result: 在基于步态模式的大规模数据集上验证，显著提升了检测准确性，尤其在难以检测的中性病例中表现优异。 Conclusion: Gait-MIL在数据不平衡情况下表现稳健，有望成为大规模脊柱侧弯筛查的有力工具。 Abstract: Scoliosis is a spinal curvature disorder that is difficult to detect early and can compress the chest cavity, impacting respiratory function and cardiac health. Especially for adolescents, delayed detection and treatment result in worsening compression. Traditional scoliosis detection methods heavily rely on clinical expertise, and X-ray imaging poses radiation risks, limiting large-scale early screening. We propose an Attention-Guided Deep Multi-Instance Learning method (Gait-MIL) to effectively capture discriminative features from gait patterns, which is inspired by ScoNet-MT's pioneering use of gait patterns for scoliosis detection. We evaluate our method on the first large-scale dataset based on gait patterns for scoliosis classification. The results demonstrate that our study improves the performance of using gait as a biomarker for scoliosis detection, significantly enhances detection accuracy for the particularly challenging Neutral cases, where subtle indicators are often overlooked. Our Gait-MIL also performs robustly in imbalanced scenarios, making it a promising tool for large-scale scoliosis screening.

myNER: Contextualized Burmese Named Entity Recognition with Bidirectional LSTM and fastText Embeddings via Joint Training with POS Tagging

Kaung Lwin Thant,Kwankamol Nongpong,Ye Kyaw Thu,Thura Aung,Khaing Hsu Wai,Thazin Myint Oo

Task: 构建并评估一个针对缅甸语的命名实体识别（NER）语料库myNER，并测试不同模型在其中的表现。

Motivation: 解决低资源语言（如缅甸语）在NER研究中因缺乏公开标注数据集而被忽视的问题。

Details

Method: 提出myNER语料库，采用7标签标注方案并加入POS标注；评估CRF、BiLSTM-CRF等模型及其与fastText嵌入的结合效果。 Result: CRF结合fastText嵌入表现最佳（准确率0.9818，加权F1分数0.9811）；BiLSTM-CRF微调fastText嵌入次之（准确率0.9791，加权F1分数0.9776）。 Conclusion: 上下文词嵌入和联合POS标注训练显著提升模型性能，为低资源语言NER研究提供了有效工具和方法。 Abstract: Named Entity Recognition (NER) involves identifying and categorizing named entities within textual data. Despite its significance, NER research has often overlooked low-resource languages like Myanmar (Burmese), primarily due to the lack of publicly available annotated datasets. To address this, we introduce myNER, a novel word-level NER corpus featuring a 7-tag annotation scheme, enriched with Part-of-Speech (POS) tagging to provide additional syntactic information. Alongside the corpus, we conduct a comprehensive evaluation of NER models, including Conditional Random Fields (CRF), Bidirectional LSTM (BiLSTM)-CRF, and their combinations with fastText embeddings in different settings. Our experiments reveal the effectiveness of contextualized word embeddings and the impact of joint training with POS tagging, demonstrating significant performance improvements across models. The traditional CRF joint-task model with fastText embeddings as a feature achieved the best result, with a 0.9818 accuracy and 0.9811 weighted F1 score with 0.7429 macro F1 score. BiLSTM-CRF with fine-tuned fastText embeddings gets the best result of 0.9791 accuracy and 0.9776 weighted F1 score with 0.7395 macro F1 score.

Improving Brain Disorder Diagnosis with Advanced Brain Function Representation and Kolmogorov-Arnold Networks

Tyler Ward,Abdullah-Al-Zubaer Imran

Task: 提出一种基于Transformer的分类网络（AFBR-KAN），用于改进自闭症谱系障碍（ASD）的诊断。

Motivation: 传统功能连接（FC）量化方法依赖预定义脑图谱，存在选择偏差和缺乏特异性的问题。

Details

Method: AFBR-KAN采用Kolmogorov-Arnold Network（KAN）块替代传统多层感知机（MLP）组件，以提升脑功能表示效果。 Result: 实验表明AFBR-KAN在不同模型架构配置下均能有效改进ASD诊断。 Conclusion: AFBR-KAN为ASD诊断提供了一种更有效的解决方案。 Abstract: Quantifying functional connectivity (FC), a vital metric for the diagnosis of various brain disorders, traditionally relies on the use of a pre-defined brain atlas. However, using such atlases can lead to issues regarding selection bias and lack of regard for specificity. Addressing this, we propose a novel transformer-based classification network (AFBR-KAN) with effective brain function representation to aid in diagnosing autism spectrum disorder (ASD). AFBR-KAN leverages Kolmogorov-Arnold Network (KAN) blocks replacing traditional multi-layer perceptron (MLP) components. Thorough experimentation reveals the effectiveness of AFBR-KAN in improving the diagnosis of ASD under various configurations of the model architecture. Our code is available at https://github.com/tbwa233/ABFR-KAN

SyLeR: A Framework for Explicit Syllogistic Legal Reasoning in Large Language Models

Kepu Zhang,Weijie Yu,Zhongxiang Sun,Jun Xu

Task: 提出SyLeR框架，使大型语言模型能够进行明确的三段论法律推理。

Motivation: 现有大型语言模型在法律问题回答中缺乏明确的三段论推理能力，导致答案隐含且缺乏可解释性和可信度。

Details

Method: 结合树状层次检索机制整合法律条文和先例案例，并通过两阶段微调（监督微调预热和强化学习）优化模型推理能力。 Result: SyLeR显著提高了回答准确性，并生成明确、可解释且可信的法律推理。 Conclusion: SyLeR框架有效解决了现有模型在法律推理中的不足，提升了推理的透明度和可靠性。 Abstract: Syllogistic reasoning is a fundamental aspect of legal decision-making, enabling logical conclusions by connecting general legal principles with specific case facts. Although existing large language models (LLMs) can generate responses to legal questions, they fail to perform explicit syllogistic reasoning, often producing implicit and unstructured answers that lack explainability and trustworthiness. To address this limitation, we propose SyLeR, a novel framework that empowers LLMs to engage in explicit syllogistic legal reasoning. SyLeR integrates a tree-structured hierarchical retrieval mechanism to effectively combine relevant legal statutes and precedent cases, forming comprehensive major premises. This is followed by a two-stage fine-tuning process: supervised fine-tuning warm-up establishes a foundational understanding of syllogistic reasoning, while reinforcement learning with a structure-aware reward mechanism refines the ability of the model to generate diverse logically sound and well-structured reasoning paths. We conducted extensive experiments across various dimensions, including in-domain and cross-domain user groups (legal laypersons and practitioners), multiple languages (Chinese and French), and different LLM backbones (legal-specific and open-domain LLMs). The results show that SyLeR significantly improves response accuracy and consistently delivers explicit, explainable, and trustworthy legal reasoning.

ProbRes: Probabilistic Jump Diffusion for Open-World Egocentric Activity Recognition

Sanjoy Kundu,Shanmukha Vellamchetti,Sathyanarayanan N. Aakur

Task: 提出ProbRes框架，用于开放世界的自我中心活动识别，以高效推断未见活动。

Motivation: 开放世界的自我中心活动识别具有挑战性，需要模型在部分观测的搜索空间中推断未见活动。

Details

Method: 基于跳扩散的概率残差搜索框架，结合常识先验和视觉语言模型，通过随机搜索机制定位高可能性活动标签。 Result: 在多个数据集上达到最先进性能，并建立了开放世界识别的清晰分类。 Conclusion: 结构化搜索策略对高效、可扩展的开放世界活动识别至关重要。 Abstract: Open-world egocentric activity recognition poses a fundamental challenge due to its unconstrained nature, requiring models to infer unseen activities from an expansive, partially observed search space. We introduce ProbRes, a Probabilistic Residual search framework based on jump-diffusion that efficiently navigates this space by balancing prior-guided exploration with likelihood-driven exploitation. Our approach integrates structured commonsense priors to construct a semantically coherent search space, adaptively refines predictions using Vision-Language Models (VLMs) and employs a stochastic search mechanism to locate high-likelihood activity labels while minimizing exhaustive enumeration efficiently. We systematically evaluate ProbRes across multiple openness levels (L0 - L3), demonstrating its adaptability to increasing search space complexity. In addition to achieving state-of-the-art performance on benchmark datasets (GTEA Gaze, GTEA Gaze+, EPIC-Kitchens, and Charades-Ego), we establish a clear taxonomy for open-world recognition, delineating the challenges and methodological advancements necessary for egocentric activity understanding. Our results highlight the importance of structured search strategies, paving the way for scalable and efficient open-world activity recognition.

FISH-Tuning: Enhancing PEFT Methods with Fisher Information

Kang Xue,Ming Dong,Xinhui Tu,Tingting He

Task: 提出一种名为FISH-Tuning的新方法，将FISH Mask与基于加法和重参数化的PEFT方法（如LoRA和Adapters）结合，以提高性能。

Motivation: 解决FISH Mask与其他PEFT方法（如LoRA和Adapters）结合不足的问题，以降低计算成本并提升性能。

Details

Method: 利用Fisher信息选择关键参数，将FISH Mask集成到基于加法和重参数化的PEFT方法中。 Result: FISH-Tuning在多种数据集和预训练模型上表现优于传统PEFT方法，且不增加内存开销或推理延迟。 Conclusion: FISH-Tuning是一种高效且性能优越的PEFT方法，适用于多种场景。 Abstract: The rapid growth in the parameter size of Large Language Models (LLMs) has led to the development of Parameter-Efficient Fine-Tuning (PEFT) methods to alleviate the computational costs of fine-tuning. Among these, Fisher Induced Sparse uncHanging (FISH) Mask is a selection-based PEFT technique that identifies a subset of pre-trained parameters for fine-tuning based on approximate Fisher information. However, the integration of FISH Mask with other PEFT methods, such as LoRA and Adapters, remains underexplored. In this paper, we propose FISH-Tuning, a novel approach that incorporates FISH Mask into addition-based and reparameterization-based PEFT methods, including LoRA, Adapters, and their variants. By leveraging Fisher information to select critical parameters within these methods, FISH-Tuning achieves superior performance without additional memory overhead or inference latency. Experimental results across various datasets and pre-trained models demonstrate that FISH-Tuning consistently outperforms the vanilla PEFT methods with the same proportion of trainable parameters.

TGraphX: Tensor-Aware Graph Neural Network for Multi-Dimensional Feature Learning

Arash Sajjadi,Mark Eramian

Task: 提出一种新的深度学习范式TGraphX，将卷积神经网络（CNNs）与图神经网络（GNNs）结合以增强视觉推理任务。

Motivation: 传统CNNs擅长提取图像的空间特征但无法建模对象间关系，而传统GNNs依赖扁平化节点特征导致空间细节丢失。

Details

Method: 使用CNNs生成多维节点特征并保留空间语义，通过1*1卷积进行消息传递，结合深度CNN聚合器优化融合信息。 Result: 显著提升了对象检测和集成推理的性能。 Conclusion: TGraphX成功弥合了空间特征提取与关系推理之间的差距。 Abstract: TGraphX presents a novel paradigm in deep learning by unifying convolutional neural networks (CNNs) with graph neural networks (GNNs) to enhance visual reasoning tasks. Traditional CNNs excel at extracting rich spatial features from images but lack the inherent capability to model inter-object relationships. Conversely, conventional GNNs typically rely on flattened node features, thereby discarding vital spatial details. TGraphX overcomes these limitations by employing CNNs to generate multi-dimensional node features (e.g., (3*128*128) tensors) that preserve local spatial semantics. These spatially aware nodes participate in a graph where message passing is performed using 1*1 convolutions, which fuse adjacent features while maintaining their structure. Furthermore, a deep CNN aggregator with residual connections is used to robustly refine the fused messages, ensuring stable gradient flow and end-to-end trainability. Our approach not only bridges the gap between spatial feature extraction and relational reasoning but also demonstrates significant improvements in object detection refinement and ensemble reasoning.

VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation

Yuhao Wang,Heyang Liu,Ziyang Cheng,Ronghua Wu,Qunshan Gu,Yanfeng Wang,Yu Wang

Task: 提出VocalNet-1B和VocalNet-8B，一种高性能、低延迟的语音大语言模型（LLM）系列，用于实时语音交互。

Motivation: 语音大语言模型在语音处理领域的重要性日益凸显，但现有方法在生成速度和质量上存在不足。

Details

Method: 采用多令牌预测（MTP）替代传统的下一令牌预测（NTP），并开发了一个可扩展且模型无关的训练框架。 Result: VocalNet在训练数据显著减少的情况下优于主流Omni LLM，并大幅超越现有开源语音LLM。 Conclusion: VocalNet通过MTP实现了生成速度和质量的同步提升，为语音LLM的研究和社区发展提供了开源支持。 Abstract: Speech large language models (LLMs) have emerged as a prominent research focus in speech processing. We propose VocalNet-1B and VocalNet-8B, a series of high-performance, low-latency speech LLMs enabled by a scalable and model-agnostic training framework for real-time voice interaction. Departing from the conventional next-token prediction (NTP), we introduce multi-token prediction (MTP), a novel approach optimized for speech LLMs that simultaneously improves generation speed and quality. Experiments show that VocalNet outperforms mainstream Omni LLMs despite using significantly less training data, while also surpassing existing open-source speech LLMs by a substantial margin. To support reproducibility and community advancement, we will open-source all model weights, inference code, training data, and framework implementations upon publication.

VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models

Dahun Kim,AJ Piergiovanni,Ganesh Mallya,Anelia Angelova

Task: 提出VideoComp，一个用于评估和提升视频-文本组合性理解的基准和学习框架。

Motivation: 现有基准多关注静态图像-文本组合性或单一事件视频，缺乏对连续多事件视频的细粒度时间对齐研究。

Details

Method: 利用带有时间定位事件标注的视频-文本数据集（如ActivityNet-Captions、YouCook2），构建两个组合性基准，并提出分层成对偏好损失和预训练策略。 Result: 评估了视频-文本基础模型和大型多模态模型，揭示了其在组合性方面的优势和不足。 Conclusion: 提供了一个全面的框架，用于评估和提升模型在细粒度、时间连贯的视频-文本对齐方面的能力。 Abstract: We introduce VideoComp, a benchmark and learning framework for advancing video-text compositionality understanding, aimed at improving vision-language models (VLMs) in fine-grained temporal alignment. Unlike existing benchmarks focused on static image-text compositionality or isolated single-event videos, our benchmark targets alignment in continuous multi-event videos. Leveraging video-text datasets with temporally localized event captions (e.g. ActivityNet-Captions, YouCook2), we construct two compositional benchmarks, ActivityNet-Comp and YouCook2-Comp. We create challenging negative samples with subtle temporal disruptions such as reordering, action word replacement, partial captioning, and combined disruptions. These benchmarks comprehensively test models' compositional sensitivity across extended, cohesive video-text sequences. To improve model performance, we propose a hierarchical pairwise preference loss that strengthens alignment with temporally accurate pairs and gradually penalizes increasingly disrupted ones, encouraging fine-grained compositional learning. To mitigate the limited availability of densely annotated video data, we introduce a pretraining strategy that concatenates short video-caption pairs to simulate multi-event sequences. We evaluate video-text foundational models and large multimodal models (LMMs) on our benchmark, identifying both strengths and areas for improvement in compositionality. Overall, our work provides a comprehensive framework for evaluating and enhancing model capabilities in achieving fine-grained, temporally coherent video-text alignment.

Collaboration and Controversy Among Experts: Rumor Early Detection by Tuning a Comment Generator

Bing Wang,Bingrui Zhao,Ximing Li,Changchun Li,Wanfu Gao,Shengsheng Wang

Task: 研究谣言早期检测（RED）问题，并提出一种名为CAMERED的新框架，通过生成更多类似人类的评论来提高检测性能。

Motivation: 现有谣言早期检测方法在早期评论数量有限时表现不佳，而实验表明模型在训练和测试评论数量一致且充足时表现最佳，因此希望通过生成更多评论来解决这一问题。

Details

Method: 通过模拟专家协作与争议调整评论生成器，提出CAMERED框架，包括混合专家结构的生成语言模型、新型路由网络、知识数据集合成及对抗学习策略。 Result: CAMERED在实验中的表现优于现有的RED基线模型和生成方法。 Conclusion: CAMERED通过生成更多人类评论有效提升了谣言早期检测的性能，证明了其方法的有效性。 Abstract: Over the past decade, social media platforms have been key in spreading rumors, leading to significant negative impacts. To counter this, the community has developed various Rumor Detection (RD) algorithms to automatically identify them using user comments as evidence. However, these RD methods often fail in the early stages of rumor propagation when only limited user comments are available, leading the community to focus on a more challenging topic named Rumor Early Detection (RED). Typically, existing RED methods learn from limited semantics in early comments. However, our preliminary experiment reveals that the RED models always perform best when the number of training and test comments is consistent and extensive. This inspires us to address the RED issue by generating more human-like comments to support this hypothesis. To implement this idea, we tune a comment generator by simulating expert collaboration and controversy and propose a new RED framework named CAMERED. Specifically, we integrate a mixture-of-expert structure into a generative language model and present a novel routing network for expert collaboration. Additionally, we synthesize a knowledgeable dataset and design an adversarial learning strategy to align the style of generated comments with real-world comments. We further integrate generated and original comments with a mutual controversy fusion module. Experimental results show that CAMERED outperforms state-of-the-art RED baseline models and generation methods, demonstrating its effectiveness.

Edge Approximation Text Detector

Chuang Yang,Xu Han,Tao Han,Han Han,Bingxuan Zhao,Qi Wang

Task: 提出一种名为EdgeText的方法，用于紧凑地拟合文本轮廓并简化检测流程。

Motivation: 现有方法在表示不规则文本形状时存在轮廓粗糙或流程复杂的问题，EdgeText旨在解决这些问题。

Details

Method: 通过参数化曲线拟合函数近似文本边缘，结合双边增强感知模块（BEP）和比例积分损失（PI-loss）优化模型。 Result: EdgeText能够紧凑地拟合文本轮廓，同时减少轮廓重建过程，提升了检测效率。 Conclusion: EdgeText通过曲线拟合和优化模块，有效解决了现有方法的不足，简化了文本检测流程。 Abstract: Pursuing efficient text shape representations helps scene text detection models focus on compact foreground regions and optimize the contour reconstruction steps to simplify the whole detection pipeline. Current approaches either represent irregular shapes via box-to-polygon strategy or decomposing a contour into pieces for fitting gradually, the deficiency of coarse contours or complex pipelines always exists in these models. Considering the above issues, we introduce EdgeText to fit text contours compactly while alleviating excessive contour rebuilding processes. Concretely, it is observed that the two long edges of texts can be regarded as smooth curves. It allows us to build contours via continuous and smooth edges that cover text regions tightly instead of fitting piecewise, which helps avoid the two limitations in current models. Inspired by this observation, EdgeText formulates the text representation as the edge approximation problem via parameterized curve fitting functions. In the inference stage, our model starts with locating text centers, and then creating curve functions for approximating text edges relying on the points. Meanwhile, truncation points are determined based on the location features. In the end, extracting curve segments from curve functions by using the pixel coordinate information brought by truncation points to reconstruct text contours. Furthermore, considering the deep dependency of EdgeText on text edges, a bilateral enhanced perception (BEP) module is designed. It encourages our model to pay attention to the recognition of edge features. Additionally, to accelerate the learning of the curve function parameters, we introduce a proportional integral loss (PI-loss) to force the proposed model to focus on the curve distribution and avoid being disturbed by text scales.

A Benchmark for End-to-End Zero-Shot Biomedical Relation Extraction with LLMs: Experiments with OpenAI Models

Aviv Brokman,Xuguang Ai,Yuhang Jiang,Shashank Gupta,Ramakanth Kavuluru

Task: 探索OpenAI大型语言模型在生物医学关系抽取任务中的零样本性能。

Motivation: 零样本方法可以降低数据集标注和领域专业知识的需求，但目前尚不清楚这些模型在生物医学关系抽取中的表现如何。

Details

Method: 使用OpenAI GPT-4-turbo和推理模型o1，在七个数据集上进行端到端关系抽取实验，通过两种方式生成结构化输出：定义明确的关系结构模式或从提示语言中推断结构。 Result: 零样本性能接近微调方法，但在处理多关系实例和文本提及边界时表现较差。 Conclusion: 大型语言模型在生物医学关系抽取中展现出有前景的零样本能力，减少了数据标注和建模需求，但需解决其局限性以提高可靠性。 Abstract: Objective: Zero-shot methodology promises to cut down on costs of dataset annotation and domain expertise needed to make use of NLP. Generative large language models trained to align with human goals have achieved high zero-shot performance across a wide variety of tasks. As of yet, it is unclear how well these models perform on biomedical relation extraction (RE). To address this knowledge gap, we explore patterns in the performance of OpenAI LLMs across a diverse sampling of RE tasks. Methods: We use OpenAI GPT-4-turbo and their reasoning model o1 to conduct end-to-end RE experiments on seven datasets. We use the JSON generation capabilities of GPT models to generate structured output in two ways: (1) by defining an explicit schema describing the structure of relations, and (2) using a setting that infers the structure from the prompt language. Results: Our work is the first to study and compare the performance of the GPT-4 and o1 for the end-to-end zero-shot biomedical RE task across a broad array of datasets. We found the zero-shot performances to be proximal to that of fine-tuned methods. The limitations of this approach are that it performs poorly on instances containing many relations and errs on the boundaries of textual mentions. Conclusion: Recent large language models exhibit promising zero-shot capabilities in complex biomedical RE tasks, offering competitive performance with reduced dataset curation and NLP modeling needs at the cost of increased computing, potentially increasing medical community accessibility. Addressing the limitations we identify could further boost reliability. The code, data, and prompts for all our experiments are publicly available: https://github.com/bionlproc/ZeroShotRE

DiTaiListener: Controllable High Fidelity Listener Video Generation with Diffusion

Maksim Siniukov,Di Chang,Minh Tran,Hongkun Gong,Ashutosh Chaubey,Mohammad Soleymani

Task: 生成自然且细腻的听者动作以支持长时间交互。

Motivation: 现有方法依赖低维运动编码和照片级渲染，限制了视觉保真度和表现力。

Details

Method: 提出DiTaiListener，基于视频扩散模型，结合多模态条件，分两阶段生成和编辑听者动作。 Result: 在照片真实性和动作表示上达到最佳性能，用户研究显示其显著优于竞争对手。 Conclusion: DiTaiListener在生成自然听者动作方面表现出色，解决了现有方法的局限性。 Abstract: Generating naturalistic and nuanced listener motions for extended interactions remains an open problem. Existing methods often rely on low-dimensional motion codes for facial behavior generation followed by photorealistic rendering, limiting both visual fidelity and expressive richness. To address these challenges, we introduce DiTaiListener, powered by a video diffusion model with multimodal conditions. Our approach first generates short segments of listener responses conditioned on the speaker's speech and facial motions with DiTaiListener-Gen. It then refines the transitional frames via DiTaiListener-Edit for a seamless transition. Specifically, DiTaiListener-Gen adapts a Diffusion Transformer (DiT) for the task of listener head portrait generation by introducing a Causal Temporal Multimodal Adapter (CTM-Adapter) to process speakers' auditory and visual cues. CTM-Adapter integrates speakers' input in a causal manner into the video generation process to ensure temporally coherent listener responses. For long-form video generation, we introduce DiTaiListener-Edit, a transition refinement video-to-video diffusion model. The model fuses video segments into smooth and continuous videos, ensuring temporal consistency in facial expressions and image quality when merging short video segments produced by DiTaiListener-Gen. Quantitatively, DiTaiListener achieves the state-of-the-art performance on benchmark datasets in both photorealism (+73.8% in FID on RealTalk) and motion representation (+6.1% in FD metric on VICO) spaces. User studies confirm the superior performance of DiTaiListener, with the model being the clear preference in terms of feedback, diversity, and smoothness, outperforming competitors by a significant margin.

Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary

Michael J Bommarito,Daniel Martin Katz,Jillian Bommarito

Task: 开发两个高精度、高吞吐量的句子边界检测库（NUPunkt和CharBoundary），用于处理法律文本。

Motivation: 解决法律文档中特殊引用、缩写和复杂句子结构对通用句子边界检测工具的挑战。

Details

Method: 通过实验评估在五个法律数据集上验证性能，NUPunkt采用纯Python实现，CharBoundary基于scikit-learn和ONNX运行时。 Result: NUPunkt达到91.1%的精确度，处理速度为每秒1000万字符；CharBoundary在F1分数上表现最佳（0.782）。 Conclusion: 这两个库显著提升了法律文本处理的精确度和效率，适用于大规模应用，且无需专用硬件。 Abstract: We present NUPunkt and CharBoundary, two sentence boundary detection libraries optimized for high-precision, high-throughput processing of legal text in large-scale applications such as due diligence, e-discovery, and legal research. These libraries address the critical challenges posed by legal documents containing specialized citations, abbreviations, and complex sentence structures that confound general-purpose sentence boundary detectors. Our experimental evaluation on five diverse legal datasets comprising over 25,000 documents and 197,000 annotated sentence boundaries demonstrates that NUPunkt achieves 91.1% precision while processing 10 million characters per second with modest memory requirements (432 MB). CharBoundary models offer balanced and adjustable precision-recall tradeoffs, with the large model achieving the highest F1 score (0.782) among all tested methods. Notably, NUPunkt provides a 29-32% precision improvement over general-purpose tools while maintaining exceptional throughput, processing multi-million document collections in minutes rather than hours. Both libraries run efficiently on standard CPU hardware without requiring specialized accelerators. NUPunkt is implemented in pure Python with zero external dependencies, while CharBoundary relies only on scikit-learn and optional ONNX runtime integration for optimized performance. Both libraries are available under the MIT license, can be installed via PyPI, and can be interactively tested at https://sentences.aleainstitute.ai/. These libraries address critical precision issues in retrieval-augmented generation systems by preserving coherent legal concepts across sentences, where each percentage improvement in precision yields exponentially greater reductions in context fragmentation, creating cascading benefits throughout retrieval pipelines and significantly enhancing downstream reasoning quality.

Detection-Friendly Nonuniformity Correction: A Union Framework for Infrared UAVTarget Detection

Houzhang Fang,Xiaolin Wang,Zengyang Li,Lu Wang,Qingshan Li,Yi Chang,Luxin Yan

Task: 提出一种名为UniCD的联合框架，同时解决红外非均匀性校正（NUC）和无人机目标检测任务。

Motivation: 现有方法通常将红外非均匀性校正作为检测的预处理步骤，导致性能不佳，因此需要一种能平衡校正与检测的联合框架。

Details

Method: 将NUC建模为少量参数估计问题，并引入辅助损失和目标掩码监督，同时提出检测引导的自监督损失以减少任务间的特征差异。 Result: 在构建的IRBFD数据集上验证了UniCD的鲁棒性和实时处理能力。 Conclusion: UniCD是一种高效的联合框架，能同时优化红外非均匀性校正和无人机目标检测。 Abstract: Infrared unmanned aerial vehicle (UAV) images captured using thermal detectors are often affected by temperature dependent low-frequency nonuniformity, which significantly reduces the contrast of the images. Detecting UAV targets under nonuniform conditions is crucial in UAV surveillance applications. Existing methods typically treat infrared nonuniformity correction (NUC) as a preprocessing step for detection, which leads to suboptimal performance. Balancing the two tasks while enhancing detection beneficial information remains challenging. In this paper, we present a detection-friendly union framework, termed UniCD, that simultaneously addresses both infrared NUC and UAV target detection tasks in an end-to-end manner. We first model NUC as a small number of parameter estimation problem jointly driven by priors and data to generate detection-conducive images. Then, we incorporate a new auxiliary loss with target mask supervision into the backbone of the infrared UAV target detection network to strengthen target features while suppressing the background. To better balance correction and detection, we introduce a detection-guided self-supervised loss to reduce feature discrepancies between the two tasks, thereby enhancing detection robustness to varying nonuniformity levels. Additionally, we construct a new benchmark composed of 50,000 infrared images in various nonuniformity types, multi-scale UAV targets and rich backgrounds with target annotations, called IRBFD. Extensive experiments on IRBFD demonstrate that our UniCD is a robust union framework for NUC and UAV target detection while achieving real-time processing capabilities. Dataset can be available at https://github.com/IVPLaboratory/UniCD.

Cognitive Debiasing Large Language Models for Decision-Making

Yougang Lyu,Shijie Ren,Yue Feng,Zihan Wang,Zhumin Chen,Zhaochun Ren,Maarten de Rijke

Task: 提出一种名为自去偏（self-debiasing）的方法，通过迭代优化提示来减少大型语言模型（LLMs）在决策应用中的认知偏差。

Motivation: 现有认知偏差缓解策略假设输入提示仅包含一种认知偏差，无法应对现实场景中可能存在的多种偏差。

Details

Method: 采用三步迭代流程：偏差确定、偏差分析和认知去偏，以逐步消除提示中的潜在认知偏差。 Result: 实验表明，自去偏方法在无偏差、单偏差和多偏差场景下的平均准确率优于现有方法。 Conclusion: 自去偏方法能有效提升LLMs在决策任务中的可靠性。 Abstract: Large language models (LLMs) have shown potential in supporting decision-making applications, particularly as personal conversational assistants in the financial, healthcare, and legal domains. While prompt engineering strategies have enhanced the capabilities of LLMs in decision-making, cognitive biases inherent to LLMs present significant challenges. Cognitive biases are systematic patterns of deviation from norms or rationality in decision-making that can lead to the production of inaccurate outputs. Existing cognitive bias mitigation strategies assume that input prompts contain (exactly) one type of cognitive bias and therefore fail to perform well in realistic settings where there maybe any number of biases. To fill this gap, we propose a cognitive debiasing approach, called self-debiasing, that enhances the reliability of LLMs by iteratively refining prompts. Our method follows three sequential steps -- bias determination, bias analysis, and cognitive debiasing -- to iteratively mitigate potential cognitive biases in prompts. Experimental results on finance, healthcare, and legal decision-making tasks, using both closed-source and open-source LLMs, demonstrate that the proposed self-debiasing method outperforms both advanced prompt engineering methods and existing cognitive debiasing techniques in average accuracy under no-bias, single-bias, and multi-bias settings.

Window Token Concatenation for Efficient Visual Large Language Models

Yifan Li,Wentao Bao,Botao Ye,Zhen Tan,Tianlong Chen,Huan Liu,Yu Kong

Task: 提出一种名为Window Token Concatenation (WiCo)的新方法，以减少视觉大语言模型(VLLMs)中的视觉标记数量。

Motivation: 直接拼接相邻视觉标记可能会模糊细节，因此需要自适应调整标记以保留细粒度信息。

Details

Method: 使用滑动窗口拼接相邻视觉标记，并微调视觉编码器的最后几层以自适应调整标记；进一步提出WiCo+，在LLM的后续层分解视觉标记。 Result: 在粗粒度和细粒度视觉理解任务上表现优于现有标记减少方法。 Conclusion: WiCo和WiCo+在保持高效推理的同时提升了细粒度视觉理解能力。 Abstract: To effectively reduce the visual tokens in Visual Large Language Models (VLLMs), we propose a novel approach called Window Token Concatenation (WiCo). Specifically, we employ a sliding window to concatenate spatially adjacent visual tokens. However, directly concatenating these tokens may group diverse tokens into one, and thus obscure some fine details. To address this challenge, we propose fine-tuning the last few layers of the vision encoder to adaptively adjust the visual tokens, encouraging that those within the same window exhibit similar features. To further enhance the performance on fine-grained visual understanding tasks, we introduce WiCo+, which decomposes the visual tokens in later layers of the LLM. Such a design enjoys the merits of the large perception field of the LLM for fine-grained visual understanding while keeping a small number of visual tokens for efficient inference. We perform extensive experiments on both coarse- and fine-grained visual understanding tasks based on LLaVA-1.5 and Shikra, showing better performance compared with existing token reduction projectors. The code is available: https://github.com/JackYFL/WiCo.

My Life in Artificial Intelligence: People, anecdotes, and some lessons learnt

Kees van Deemter

Task: 描述作者40年来在人工智能和自然语言处理领域的研究和教育经历。

Motivation: 分享个人经历和见解，为年轻同事提供参考，尤其是在人工智能逐渐成为主流领域的背景下。

Details

Method: 通过个人回忆和轶事，结合人工智能的历史背景，展示作者在不同国家和机构的工作经历。 Result: 提供了一份关于人工智能发展历程和个人职业选择的独特视角。 Conclusion: 作者希望通过自己的经历激励年轻同事，并为他们提供职业发展的启示。 Abstract: In this very personal workography, I relate my 40-year experiences as a researcher and educator in and around Artificial Intelligence (AI), more specifically Natural Language Processing. I describe how curiosity, and the circumstances of the day, led me to work in both industry and academia, and in various countries, including The Netherlands (Amsterdam, Eindhoven, and Utrecht), the USA (Stanford), England (Brighton), Scotland (Aberdeen), and China (Beijing and Harbin). People and anecdotes play a large role in my story; the history of AI forms its backdrop. I focus on things that might be of interest to (even) younger colleagues, given the choices they face in their own work and life at a time when AI is finally emerging from the shadows.

Artificial intelligence application in lymphoma diagnosis: from Convolutional Neural Network to Vision Transformer

Daniel Rivera,Jacob Huddin,Alexander Banerjee,Rongzhen Zhang,Brenda Mai,Hanadi El Achi,Jacob Armstrong,Amer Wahed,Andy Nguyen

Task: 比较视觉Transformer和卷积神经网络在诊断间变性大细胞淋巴瘤与经典霍奇金淋巴瘤中的性能。

Motivation: 视觉Transformer在大规模数据集上表现优异，但其在病理图像诊断中的应用尚未充分探索。

Details

Method: 使用相同的数据集（20例全切片图像，每例提取60个图像块），分别训练视觉Transformer和卷积神经网络模型，并比较其分类性能。 Result: 两种模型在测试集上均达到100%的准确率。 Conclusion: 视觉Transformer在小规模数据集上表现与卷积神经网络相当，但后者架构更成熟，适合无大规模预训练的场景。 Abstract: Recently, vision transformers were shown to be capable of outperforming convolutional neural networks when pretrained on sufficiently large datasets. Vision transformer models show good accuracy on large scale datasets, with features of multi-modal training. Due to their promising feature detection, we aim to explore vision transformer models for diagnosis of anaplastic large cell lymphoma versus classical Hodgkin lymphoma using pathology whole slide images of HE slides. We compared the classification performance of the vision transformer to our previously designed convolutional neural network on the same dataset. The dataset includes whole slide images of HE slides for 20 cases, including 10 cases in each diagnostic category. From each whole slide image, 60 image patches having size of 100 by 100 pixels and at magnification of 20 were obtained to yield 1200 image patches, from which 90 percent were used for training, 9 percent for validation, and 10 percent for testing. The test results from the convolutional neural network model had previously shown an excellent diagnostic accuracy of 100 percent. The test results from the vision transformer model also showed a comparable accuracy at 100 percent. To the best of the authors' knowledge, this is the first direct comparison of predictive performance between a vision transformer model and a convolutional neural network model using the same dataset of lymphoma. Overall, convolutional neural network has a more mature architecture than vision transformer and is usually the best choice when large scale pretraining is not an available option. Nevertheless, our current study shows comparable and excellent accuracy of vision transformer compared to that of convolutional neural network even with a relatively small dataset of anaplastic large cell lymphoma and classical Hodgkin lymphoma.

Reasoning on Multiple Needles In A Haystack

Yidong Wang

Task: 解决长上下文问答任务中模型直接依赖内部知识的问题，并揭示性能下降的原因。

Motivation: 现有方法无法解决模型直接回答问题或解释性能随上下文长度增加而下降的问题。

Details

Method: 通过过滤直接回答问题，分解思考过程为检索和推理阶段，并引入多轮扩展的反思机制。 Result: 揭示了性能下降主要由思考过程长度减少引起，并通过训练模型缓解了这一问题。 Conclusion: 提出的检索-反思机制在数学推理场景中提升了GPT-4o在AIME2024上的表现。 Abstract: The Needle In A Haystack (NIAH) task has been widely used to evaluate the long-context question-answering capabilities of Large Language Models (LLMs). However, its reliance on simple retrieval limits its effectiveness. To address this limitation, recent studies have introduced the Multiple Needles In A Haystack Reasoning (MNIAH-R) task, which incorporates supporting documents (Multiple needles) of multi-hop reasoning tasks into a distracting context (Haystack}). Despite this advancement, existing approaches still fail to address the issue of models providing direct answers from internal knowledge, and they do not explain or mitigate the decline in accuracy as context length increases. In this paper, we tackle the memory-based answering problem by filtering out direct-answer questions, and we reveal that performance degradation is primarily driven by the reduction in the length of the thinking process as the input length increases. Building on this insight, we decompose the thinking process into retrieval and reasoning stages and introduce a reflection mechanism for multi-round extension. We also train a model using the generated iterative thinking process, which helps mitigate the performance degradation. Furthermore, we demonstrate the application of this retrieval-reflection capability in mathematical reasoning scenarios, improving GPT-4o's performance on AIME2024.

Simultaneous Motion And Noise Estimation with Event Cameras

Shintaro Shiba,Yoshimitsu Aoki,Guillermo Gallego

Task: 提出一种同时估计运动和噪声的方法，用于事件相机的去噪。

Motivation: 现有的事件相机去噪方法将运动估计等任务与去噪分开处理，而运动是事件数据的内在部分，因此需要同时处理。

Details

Method: 通过将广泛使用的对比度最大化框架中的单步运动估计替换为其他运动估计器（如深度神经网络），实现灵活的运动和噪声联合估计。 Result: 在E-MLB去噪基准上达到最先进水平，在DND21基准上表现竞争性，同时在运动估计和强度重建任务中有效。 Conclusion: 该方法不仅推动了事件数据去噪理论的发展，还具有实际应用价值，代码已开源。 Abstract: Event cameras are emerging vision sensors, whose noise is challenging to characterize. Existing denoising methods for event cameras consider other tasks such as motion estimation separately (i.e., sequentially after denoising). However, motion is an intrinsic part of event data, since scene edges cannot be sensed without motion. This work proposes, to the best of our knowledge, the first method that simultaneously estimates motion in its various forms (e.g., ego-motion, optical flow) and noise. The method is flexible, as it allows replacing the 1-step motion estimation of the widely-used Contrast Maximization framework with any other motion estimator, such as deep neural networks. The experiments show that the proposed method achieves state-of-the-art results on the E-MLB denoising benchmark and competitive results on the DND21 benchmark, while showing its efficacy on motion estimation and intensity reconstruction tasks. We believe that the proposed approach contributes to strengthening the theory of event-data denoising, as well as impacting practical denoising use-cases, as we release the code upon acceptance. Project page: https://github.com/tub-rip/ESMD

STEP: Staged Parameter-Efficient Pre-training for Large Language Models

Kazuki Yano,Takumi Ito,Jun Suzuki

Task: 提出一种名为STEP的方法，以减少大型语言模型预训练时的内存需求。

Motivation: 预训练大型语言模型时，参数规模导致内存需求巨大，亟需高效解决方案。

Details

Method: 结合参数高效调优技术和模型增长策略，提出STaged parameter-Efficient Pre-training (STEP)。 Result: STEP在保持性能的同时，最大内存需求减少53.9%，且在下游任务中表现与常规预训练模型相当。 Conclusion: STEP是一种高效且性能相当的预训练方法，显著降低了内存需求。 Abstract: Pre-training large language models (LLMs) faces significant memory challenges due to the large size of model parameters. We introduce STaged parameter-Efficient Pre-training (STEP), which integrates parameter-efficient tuning techniques with model growth. We conduct experiments on pre-training LLMs of various sizes and demonstrate that STEP achieves up to a 53.9% reduction in maximum memory requirements compared to vanilla pre-training while maintaining equivalent performance. Furthermore, we show that the model by STEP performs comparably to vanilla pre-trained models on downstream tasks after instruction tuning.

UCS: A Universal Model for Curvilinear Structure Segmentation

Dianshuo Li,Li Chen,Yunxiang Cao,Kai Zhu,Jun Cheng

Task: 提出一种通用曲线结构分割（UCS）模型，以增强Segment Anything Model（SAM）在曲线结构分割任务中的泛化能力。

Motivation: 现有方法在特定领域表现优异但泛化能力有限，而SAM虽泛化能力强但未针对曲线结构优化。

Details

Method: UCS模型通过稀疏适配器和提示生成模块改进SAM编码器，并采用双压缩模块的掩码解码器。 Result: 在多领域数据集上，UCS展现出最先进的泛化和开放集分割性能。 Conclusion: UCS为通用曲线结构分割设立了新基准。 Abstract: Curvilinear structure segmentation (CSS) is vital in various domains, including medical imaging, landscape analysis, industrial surface inspection, and plant analysis. While existing methods achieve high performance within specific domains, their generalizability is limited. On the other hand, large-scale models such as Segment Anything Model (SAM) exhibit strong generalization but are not optimized for curvilinear structures. Existing adaptations of SAM primarily focus on general object segmentation and lack specialized design for CSS tasks. To bridge this gap, we propose the Universal Curvilinear structure Segmentation (\textit{UCS}) model, which adapts SAM to CSS tasks while enhancing its generalization. \textit{UCS} features a novel encoder architecture integrating a pretrained SAM encoder with two innovations: a Sparse Adapter, strategically inserted to inherit the pre-trained SAM encoder's generalization capability while minimizing the number of fine-tuning parameters, and a Prompt Generation module, which leverages Fast Fourier Transform with a high-pass filter to generate curve-specific prompts. Furthermore, the \textit{UCS} incorporates a mask decoder that eliminates reliance on manual interaction through a dual-compression module: a Hierarchical Feature Compression module, which aggregates the outputs of the sampled encoder to enhance detail preservation, and a Guidance Feature Compression module, which extracts and compresses image-driven guidance features. Evaluated on a comprehensive multi-domain dataset, including an in-house dataset covering eight natural curvilinear structures, \textit{UCS} demonstrates state-of-the-art generalization and open-set segmentation performance across medical, engineering, natural, and plant imagery, establishing a new benchmark for universal CSS.

Rethinking Multilingual Continual Pretraining: Data Mixing for Adapting LLMs Across Languages and Resources

Zihao Li,Shaoxiong Ji,Hengyu Luo,Jörg Tiedemann

Task: 系统评估36种持续预训练（CPT）配置对多语言模型性能的影响。

Motivation: 解决大型语言模型（LLMs）在不同语言间性能不平衡的问题，尤其是高资源语言与低资源语言之间的差距。

Details

Method: 通过三种多语言基础模型，采用单语、双语和代码增强数据策略，对30多种语言进行系统评估。 Result: 发现双语CPT提升多语言分类但导致生成问题；代码数据增强提升分类但对生成质量有轻微负面影响；语言分类对跨语言迁移的影响与先前研究不同。 Conclusion: 强调多语言表示学习的复杂性，需系统性研究语言分类以指导未来多语言CPT策略。 Abstract: Large Language Models (LLMs) exhibit significant disparities in performance across languages, primarily benefiting high-resource languages while marginalizing underrepresented ones. Continual Pretraining (CPT) has emerged as a promising approach to address this imbalance, although the relative effectiveness of monolingual, bilingual, and code-augmented data strategies remains unclear. This study systematically evaluates 36 CPT configurations involving three multilingual base models, across 30+ languages categorized as altruistic, selfish, and stagnant, spanning various resource levels. Our findings reveal three major insights: (1) Bilingual CPT improves multilingual classification but often causes language mixing issues during generation. (2) Including programming code data during CPT consistently enhances multilingual classification accuracy, particularly benefiting low-resource languages, but introduces a trade-off by slightly degrading generation quality. (3) Contrary to prior work, we observe substantial deviations from language classifications according to their impact on cross-lingual transfer: Languages classified as altruistic often negatively affect related languages, selfish languages show conditional and configuration-dependent behavior, and stagnant languages demonstrate surprising adaptability under certain CPT conditions. These nuanced interactions emphasize the complexity of multilingual representation learning, underscoring the importance of systematic studies on generalizable language classification to inform future multilingual CPT strategies.

A Survey of Pathology Foundation Model: Progress and Future Directions

Conghao Xiong,Hao Chen,Joseph J. Y. Sung

Task: 提出一种层次化分类法，用于系统分析病理学基础模型（PFMs）及其评估任务。

Motivation: 当前病理学基础模型缺乏系统分析框架，限制了其在计算病理学中的应用和发展。

Details

Method: 通过模型范围、模型预训练和模型设计的层次化分类法，系统分析PFMs，并将其评估任务分为幻灯片级、补丁级、多模态和生物学任务。 Result: 提出了全面的PFM分析框架和评估标准，并指出了PFM开发和利用中的关键挑战。 Conclusion: 该研究为病理学基础模型的未来发展提供了方向，并提供了相关资源支持。 Abstract: Computational pathology, analyzing whole slide images for automated cancer diagnosis, relies on the multiple instance learning framework where performance heavily depends on the feature extractor and aggregator. Recent Pathology Foundation Models (PFMs), pretrained on large-scale histopathology data, have significantly enhanced capabilities of extractors and aggregators but lack systematic analysis frameworks. This survey presents a hierarchical taxonomy organizing PFMs through a top-down philosophy that can be utilized to analyze FMs in any domain: model scope, model pretraining, and model design. Additionally, we systematically categorize PFM evaluation tasks into slide-level, patch-level, multimodal, and biological tasks, providing comprehensive benchmarking criteria. Our analysis identifies critical challenges in both PFM development (pathology-specific methodology, end-to-end pretraining, data-model scalability) and utilization (effective adaptation, model maintenance), paving the way for future directions in this promising field. Resources referenced in this survey are available at https://github.com/BearCleverProud/AwesomeWSI.

GlotEval: A Test Suite for Massively Multilingual Evaluation of Large Language Models

Hengyu Luo,Zihao Li,Joseph Attieh,Sawal Devkota,Ona de Gibert,Shaoxiong Ji,Peiqin Lin,Bhavani Sai Praneeth Varma Mantina,Ananda Sreenidhi,Raúl Vázquez,Mengjie Wang,Samea Yusofi,Jörg Tiedemann

Task: 介绍GlotEval，一个用于多语言评估的轻量级框架。

Motivation: 现有评估框架过于关注英语和高资源语言，忽视了多语言和低资源语言环境下的模型性能。

Details

Method: 设计GlotEval框架，支持七项关键任务，覆盖数十至数百种语言，强调一致性多语言基准测试和语言特定提示模板。 Result: GlotEval能够精确诊断模型在不同语言环境中的优缺点，并通过多语言翻译案例验证其适用性。 Conclusion: GlotEval填补了多语言评估的空白，为学术界和工业界提供了实用的工具。 Abstract: Large language models (LLMs) are advancing at an unprecedented pace globally, with regions increasingly adopting these models for applications in their primary language. Evaluation of these models in diverse linguistic environments, especially in low-resource languages, has become a major challenge for academia and industry. Existing evaluation frameworks are disproportionately focused on English and a handful of high-resource languages, thereby overlooking the realistic performance of LLMs in multilingual and lower-resource scenarios. To address this gap, we introduce GlotEval, a lightweight framework designed for massively multilingual evaluation. Supporting seven key tasks (machine translation, text classification, summarization, open-ended generation, reading comprehension, sequence labeling, and intrinsic evaluation), spanning over dozens to hundreds of languages, GlotEval highlights consistent multilingual benchmarking, language-specific prompt templates, and non-English-centric machine translation. This enables a precise diagnosis of model strengths and weaknesses in diverse linguistic contexts. A multilingual translation case study demonstrates GlotEval's applicability for multilingual and language-specific evaluations.

Can You Count to Nine? A Human Evaluation Benchmark for Counting Limits in Modern Text-to-Video Models

Xuyang Guo,Zekai Huang,Jiayan Huo,Yingyu Liang,Zhenmei Shi,Zhao Song,Jiahao Zhang

Task: 评估2025年最先进的文本到视频模型在计数能力上的表现。

Motivation: 尽管生成模型在文本到视频任务上取得了显著进展，但在遵循简单数值约束方面仍存在挑战。

Details

Method: 提出T2VCountBench基准，通过严格的人工评估测量生成对象的数量，并覆盖多种开源和商业模型。 Result: 所有现有模型在基本数值任务上表现不佳，几乎无法生成对象数量为9或更少的视频。 Conclusion: 研究揭示了当前文本到视频生成的重要挑战，为未来改进数值约束遵循能力提供了方向。 Abstract: Generative models have driven significant progress in a variety of AI tasks, including text-to-video generation, where models like Video LDM and Stable Video Diffusion can produce realistic, movie-level videos from textual instructions. Despite these advances, current text-to-video models still face fundamental challenges in reliably following human commands, particularly in adhering to simple numerical constraints. In this work, we present T2VCountBench, a specialized benchmark aiming at evaluating the counting capability of SOTA text-to-video models as of 2025. Our benchmark employs rigorous human evaluations to measure the number of generated objects and covers a diverse range of generators, covering both open-source and commercial models. Extensive experiments reveal that all existing models struggle with basic numerical tasks, almost always failing to generate videos with an object count of 9 or fewer. Furthermore, our comprehensive ablation studies explore how factors like video style, temporal dynamics, and multilingual inputs may influence counting performance. We also explore prompt refinement techniques and demonstrate that decomposing the task into smaller subtasks does not easily alleviate these limitations. Our findings highlight important challenges in current text-to-video generation and provide insights for future research aimed at improving adherence to basic numerical constraints.

Adaptive Elicitation of Latent Information Using Natural Language

Jimmy Wang,Thomas Zollo,Richard Zemel,Hongseok Namkoong

Task: 提出一种自适应信息获取框架，通过主动减少潜在实体的不确定性来优化信息收集策略。

Motivation: 自然语言是信息获取的有力媒介，但现有的大型语言模型和微调算法缺乏战略性信息收集机制，无法有效减少对潜在实体的不确定性。

Details

Method: 采用预测性不确定性视角，通过元学习语言模型模拟未来观察，实现复杂自然语言的可扩展不确定性量化，并利用自回归前向模拟选择最具信息量的查询。 Result: 在20个问题游戏、动态意见调查和自适应学生评估等实验中，该方法在识别关键未知信息和改进下游预测方面优于基线方法。 Conclusion: 该框架展示了在自然语言环境中战略性信息收集的潜力，能够有效减少不确定性并优化信息获取策略。 Abstract: Eliciting information to reduce uncertainty about a latent entity is a critical task in many application domains, e.g., assessing individual student learning outcomes, diagnosing underlying diseases, or learning user preferences. Though natural language is a powerful medium for this purpose, large language models (LLMs) and existing fine-tuning algorithms lack mechanisms for strategically gathering information to refine their own understanding of the latent entity. To harness the generalization power and world knowledge of LLMs in developing effective information-gathering strategies, we propose an adaptive elicitation framework that actively reduces uncertainty on the latent entity. Since probabilistic modeling of an abstract latent entity is difficult, our framework adopts a predictive view of uncertainty, using a meta-learned language model to simulate future observations and enable scalable uncertainty quantification over complex natural language. Through autoregressive forward simulation, our model quantifies how new questions reduce epistemic uncertainty, enabling the development of sophisticated information-gathering strategies to choose the most informative next queries. In experiments on the 20 questions game, dynamic opinion polling, and adaptive student assessment, our method consistently outperforms baselines in identifying critical unknowns and improving downstream predictions, illustrating the promise of strategic information gathering in natural language settings.

UniRVQA: A Unified Framework for Retrieval-Augmented Vision Question Answering via Self-Reflective Joint Training

Jiaqi Deng,Kaize Shi,Zonghan Wu,Huan Huo,Dingxian Wang,Guandong Xu

Task: 提出一个统一的检索增强视觉问答框架（UniRVQA），以解决现有方法中检索与生成任务分离导致的性能问题。

Motivation: 现有KB-VQA系统通常采用分离的检索器和生成器框架，限制了参数化知识共享，且多模态预训练模型在细粒度检索任务上表现不佳。

Details

Method: UniRVQA通过统一框架整合多模态预训练模型，引入反射回答机制和后期交互，增强细粒度理解和知识边界评估。 Result: UniRVQA在回答准确率上显著提升4.7%，并在基础MLLMs的VQA性能上平均提升7.5%。 Conclusion: UniRVQA通过统一框架和反射机制，有效提升了知识密集型视觉问答任务的性能。 Abstract: Knowledge-based Vision Question Answering (KB-VQA) systems address complex visual-grounded questions requiring external knowledge, such as web-sourced encyclopedia articles. Existing methods often use sequential and separate frameworks for the retriever and the generator with limited parametric knowledge sharing. However, since both retrieval and generation tasks require accurate understanding of contextual and external information, such separation can potentially lead to suboptimal system performance. Another key challenge is the integration of multimodal information. General-purpose multimodal pre-trained models, while adept at multimodal representation learning, struggle with fine-grained retrieval required for knowledge-intensive visual questions. Recent specialized pre-trained models mitigate the issue, but are computationally expensive. To bridge the gap, we propose a Unified Retrieval-Augmented VQA framework (UniRVQA). UniRVQA adapts general multimodal pre-trained models for fine-grained knowledge-intensive tasks within a unified framework, enabling cross-task parametric knowledge sharing and the extension of existing multimodal representation learning capability. We further introduce a reflective-answering mechanism that allows the model to explicitly evaluate and refine its knowledge boundary. Additionally, we integrate late interaction into the retrieval-augmented generation joint training process to enhance fine-grained understanding of queries and documents. Our approach achieves competitive performance against state-of-the-art models, delivering a significant 4.7% improvement in answering accuracy, and brings an average 7.5% boost in base MLLMs' VQA performance.

Towards Understanding and Improving Refusal in Compressed Models via Mechanistic Interpretability

Vishnu Kabir Chhabra,Mohammad Mahdi Khalili

Task: 研究压缩模型的安全性，通过拒绝行为的机制分析，并提出一种轻量级方法以增强压缩模型的安全性。

Motivation: 大型语言模型的压缩可能导致安全性下降，而机制可解释性领域的新发现为分析模型安全性提供了新视角。

Details

Method: 采用可解释性驱动的视角分析拒绝行为的机制，并提出一种轻量级、计算高效的方法。 Result: 提出了一种在不影响性能或实用性的情况下增强压缩模型安全性的方法。 Conclusion: 通过机制可解释性分析，能够有效提升压缩模型的安全性，同时保持其性能。 Abstract: The rapid growth of large language models has spurred significant interest in model compression as a means to enhance their accessibility and practicality. While extensive research has explored model compression through the lens of safety, findings suggest that safety-aligned models often lose elements of trustworthiness post-compression. Simultaneously, the field of mechanistic interpretability has gained traction, with notable discoveries, such as the identification of a single direction in the residual stream mediating refusal behaviors across diverse model architectures. In this work, we investigate the safety of compressed models by examining the mechanisms of refusal, adopting a novel interpretability-driven perspective to evaluate model safety. Furthermore, leveraging insights from our interpretability analysis, we propose a lightweight, computationally efficient method to enhance the safety of compressed models without compromising their performance or utility.

DocSAM: Unified Document Image Segmentation via Query Decomposition and Heterogeneous Mixed Learning

Xiao-Hui Li,Fei Yin,Cheng-Lin Liu

Task: 提出DocSAM，一个基于Transformer的统一框架，用于多种文档图像分割任务。

Motivation: 解决现有方法因任务分离导致的泛化能力有限和资源浪费问题。

Details

Method: 使用Sentence-BERT将类别名称映射为语义查询，通过注意力机制与图像特征交互，预测实例和语义分割掩码。 Result: DocSAM在准确性、效率和适应性上优于现有方法。 Conclusion: DocSAM通过联合训练异构数据集，提升了鲁棒性和泛化能力，减少了计算和存储资源消耗。 Abstract: Document image segmentation is crucial for document analysis and recognition but remains challenging due to the diversity of document formats and segmentation tasks. Existing methods often address these tasks separately, resulting in limited generalization and resource wastage. This paper introduces DocSAM, a transformer-based unified framework designed for various document image segmentation tasks, such as document layout analysis, multi-granularity text segmentation, and table structure recognition, by modelling these tasks as a combination of instance and semantic segmentation. Specifically, DocSAM employs Sentence-BERT to map category names from each dataset into semantic queries that match the dimensionality of instance queries. These two sets of queries interact through an attention mechanism and are cross-attended with image features to predict instance and semantic segmentation masks. Instance categories are predicted by computing the dot product between instance and semantic queries, followed by softmax normalization of scores. Consequently, DocSAM can be jointly trained on heterogeneous datasets, enhancing robustness and generalization while reducing computational and storage resources. Comprehensive evaluations show that DocSAM surpasses existing methods in accuracy, efficiency, and adaptability, highlighting its potential for advancing document image understanding and segmentation across various applications. Codes are available at https://github.com/xhli-git/DocSAM.

A Perplexity and Menger Curvature-Based Approach for Similarity Evaluation of Large Language Models

Yuantao Zhang,Zhankui Yang

Task: 提出一种新的度量标准来量化大型语言模型（LLM）的相似性，以解决模型复制和所有权问题。

Motivation: 大型语言模型的兴起引发了关于数据使用和模型所有权的伦理问题，例如通过轻微修改现有模型来虚假宣称新模型的开发。

Details

Method: 利用困惑度曲线和Menger曲率差异来量化模型相似性。 Result: 实验验证了该方法的优越性，能够泛化到不同模型和领域，并能有效检测模型复制行为。 Conclusion: 该方法有助于保护大型语言模型的原创性和完整性，代码已开源。 Abstract: The rise of Large Language Models (LLMs) has brought about concerns regarding copyright infringement and unethical practices in data and model usage. For instance, slight modifications to existing LLMs may be used to falsely claim the development of new models, leading to issues of model copying and violations of ownership rights. This paper addresses these challenges by introducing a novel metric for quantifying LLM similarity, which leverages perplexity curves and differences in Menger curvature. Comprehensive experiments validate the performance of our methodology, demonstrating its superiority over baseline methods and its ability to generalize across diverse models and domains. Furthermore, we highlight the capability of our approach in detecting model replication through simulations, emphasizing its potential to preserve the originality and integrity of LLMs. Code is available at https://github.com/zyttt-coder/LLM_similarity.

TARAC: Mitigating Hallucination in LVLMs via Temporal Attention Real-time Accumulative Connection

Chunzhao Xie,Tongxuan Liu,Lei Jiang,Yuting Zeng,jinrong Guo,Yunheng Shen,Weizhe Huang,Jing Li,Xiaohua Xu

Task: 研究视觉-语言模型中注意力衰减与幻觉现象之间的相关性，并提出一种无需训练的实时动态注意力累积方法（TARAC）来缓解幻觉问题。

Motivation: 大型视觉-语言模型在实际应用中因幻觉问题受限，现有方法未能完全解决注意力衰减导致的幻觉。

Details

Method: 提出Temporal Attention Real-time Accumulative Connection (TARAC)，动态累积和更新模型对图像标记的注意力。 Result: TARAC显著减少了幻觉现象，在CHAIR基准测试中比VCD方法降低了$C_S$ 25.2和$C_I$ 8.7。 Conclusion: TARAC通过增强对图像标记的注意力，有效缓解了注意力衰减导致的幻觉问题，具有实际应用潜力。 Abstract: Large Vision-Language Models have demonstrated remarkable performance across various tasks; however, the challenge of hallucinations constrains their practical applications. The hallucination problem arises from multiple factors, including the inherent hallucinations in language models, the limitations of visual encoders in perception, and biases introduced by multimodal data. Extensive research has explored ways to mitigate hallucinations. For instance, OPERA prevents the model from overly focusing on "anchor tokens", thereby reducing hallucinations, whereas VCD mitigates hallucinations by employing a contrastive decoding approach. In this paper, we investigate the correlation between the decay of attention to image tokens and the occurrence of hallucinations. Based on this finding, we propose Temporal Attention Real-time Accumulative Connection (TARAC), a novel training-free method that dynamically accumulates and updates LVLMs' attention on image tokens during generation. By enhancing the model's attention to image tokens, TARAC mitigates hallucinations caused by the decay of attention on image tokens. We validate the effectiveness of TARAC across multiple models and datasets, demonstrating that our approach substantially mitigates hallucinations. In particular, TARAC reduces $C_S$ by 25.2 and $C_I$ by 8.7 compared to VCD on the CHAIR benchmark.

Sensitivity Meets Sparsity: The Impact of Extremely Sparse Parameter Patterns on Theory-of-Mind of Large Language Models

Yuheng Wu,Wentao Guo,Zirui Liu,Heng Ji,Zhaozhuo Xu,Denghui Zhang

Task: 研究大型语言模型（LLMs）中Theory-of-Mind（ToM）能力的机制性涌现。

Motivation: 探讨极稀疏参数模式在ToM能力中的作用，以增进对LLMs社会推理能力的理解，并连接AI可解释性与认知科学。

Details

Method: 提出一种新方法识别ToM敏感参数，并分析其与LLMs核心架构组件的相互作用。 Result: 扰动极少数（0.001%）ToM敏感参数会显著降低ToM性能，并影响上下文定位和语言理解；这些参数与位置编码模块（如RoPE）密切相关。 Conclusion: 研究揭示了LLMs如何获得社会推理能力，为模型对齐、偏见缓解及人机交互系统改进提供了启示。 Abstract: This paper investigates the emergence of Theory-of-Mind (ToM) capabilities in large language models (LLMs) from a mechanistic perspective, focusing on the role of extremely sparse parameter patterns. We introduce a novel method to identify ToM-sensitive parameters and reveal that perturbing as little as 0.001% of these parameters significantly degrades ToM performance while also impairing contextual localization and language understanding. To understand this effect, we analyze their interaction with core architectural components of LLMs. Our findings demonstrate that these sensitive parameters are closely linked to the positional encoding module, particularly in models using Rotary Position Embedding (RoPE), where perturbations disrupt dominant-frequency activations critical for contextual processing. Furthermore, we show that perturbing ToM-sensitive parameters affects LLM's attention mechanism by modulating the angle between queries and keys under positional encoding. These insights provide a deeper understanding of how LLMs acquire social reasoning abilities, bridging AI interpretability with cognitive science. Our results have implications for enhancing model alignment, mitigating biases, and improving AI systems designed for human interaction.

EMF: Event Meta Formers for Event-based Real-time Traffic Object Detection

Muhammad Ahmed Ullah Khan,Abdul Hannan Khan,Andreas Dengel

Task: 提出一种高效且性能优越的事件相机目标检测模型，以替代传统计算密集型的基于Transformer的方法。

Motivation: 事件相机具有高时间分辨率和低存储带宽需求，但其性能尚未在自动驾驶等关键应用中超越传统相机，现有基于Transformer的方法计算成本高且未能充分利用事件相机的特性。

Details

Method: 设计了一种新颖的事件目标检测主干网络，包含专为事件数据定制的Event Progression Extractor模块，并采用基于卷积的高效Metaformer架构。 Result: 在Prophesee Gen1数据集上，模型性能提升1.6 mAP，推理时间减少14%，成为该领域最快的DNN架构，且泛化能力和数据扩展性更优。 Conclusion: 提出的EMF模型在事件相机目标检测中实现了高效与高性能的平衡，为实际应用提供了可行解决方案。 Abstract: Event cameras have higher temporal resolution, and require less storage and bandwidth compared to traditional RGB cameras. However, due to relatively lagging performance of event-based approaches, event cameras have not yet replace traditional cameras in performance-critical applications like autonomous driving. Recent approaches in event-based object detection try to bridge this gap by employing computationally expensive transformer-based solutions. However, due to their resource-intensive components, these solutions fail to exploit the sparsity and higher temporal resolution of event cameras efficiently. Moreover, these solutions are adopted from the vision domain, lacking specificity to the event cameras. In this work, we explore efficient and performant alternatives to recurrent vision transformer models and propose a novel event-based object detection backbone. The proposed backbone employs a novel Event Progression Extractor module, tailored specifically for event data, and uses Metaformer concept with convolution-based efficient components. We evaluate the resultant model on well-established traffic object detection benchmarks and conduct cross-dataset evaluation to test its ability to generalize. The proposed model outperforms the state-of-the-art on Prophesee Gen1 dataset by 1.6 mAP while reducing inference time by 14%. Our proposed EMF becomes the fastest DNN-based architecture in the domain by outperforming most efficient event-based object detectors. Moreover, the proposed model shows better ability to generalize to unseen data and scales better with the abundance of data.

Lost in Multilinguality: Dissecting Cross-lingual Factual Inconsistency in Transformer Language Models

Mingyang Wang,Heike Adel,Lukas Lange,Yihong Liu,Ercong Nie,Jannik Strötgen,Hinrich Schütze

Task: 研究多语言模型（MLMs）中跨语言不一致性的原因并提出解决方法。

Motivation: MLMs在不同语言中对语义相同的提示给出不一致的回应，但其原因尚未明确。

Details

Method: 使用机制解释性方法分析MLMs的跨语言不一致性，并提出线性捷径方法绕过最终层的计算。 Result: 发现MLMs在大多数层中通过语言无关的概念空间编码知识，仅在最后几层转向语言特定空间，语言转换失败导致目标语言预测错误。线性捷径方法提高了预测准确性和跨语言一致性。 Conclusion: 揭示了MLMs的内部机制，并提出了一种轻量级有效策略以提升事实输出的一致性。 Abstract: Multilingual language models (MLMs) store factual knowledge across languages but often struggle to provide consistent responses to semantically equivalent prompts in different languages. While previous studies point out this cross-lingual inconsistency issue, the underlying causes remain unexplored. In this work, we use mechanistic interpretability methods to investigate cross-lingual inconsistencies in MLMs. We find that MLMs encode knowledge in a language-independent concept space through most layers, and only transition to language-specific spaces in the final layers. Failures during the language transition often result in incorrect predictions in the target language, even when the answers are correct in other languages. To mitigate this inconsistency issue, we propose a linear shortcut method that bypasses computations in the final layers, enhancing both prediction accuracy and cross-lingual consistency. Our findings shed light on the internal mechanisms of MLMs and provide a lightweight, effective strategy for producing more consistent factual outputs.

Multi-identity Human Image Animation with Structural Video Diffusion

Zhenzhi Wang,Yixuan Li,Yanhong Zeng,Yuwei Guo,Dahua Lin,Tianfan Xue,Bo Dai

Task: 从单张图像生成高质量、可控的多人物交互视频。

Motivation: 现有方法在单人物场景中表现良好，但难以处理多人物交互的复杂性，尤其是外观与姿态的关联以及3D动态建模。

Details

Method: 提出Structural Video Diffusion框架，包括身份特定嵌入和结构学习机制，结合深度与表面法线线索建模人-物交互。 Result: 实验表明，该方法在生成逼真、连贯的多人物交互视频上表现优异。 Conclusion: Structural Video Diffusion推动了以人物为中心的视频生成技术的进步。 Abstract: Generating human videos from a single image while ensuring high visual quality and precise control is a challenging task, especially in complex scenarios involving multiple individuals and interactions with objects. Existing methods, while effective for single-human cases, often fail to handle the intricacies of multi-identity interactions because they struggle to associate the correct pairs of human appearance and pose condition and model the distribution of 3D-aware dynamics. To address these limitations, we present Structural Video Diffusion, a novel framework designed for generating realistic multi-human videos. Our approach introduces two core innovations: identity-specific embeddings to maintain consistent appearances across individuals and a structural learning mechanism that incorporates depth and surface-normal cues to model human-object interactions. Additionally, we expand existing human video dataset with 25K new videos featuring diverse multi-human and object interaction scenarios, providing a robust foundation for training. Experimental results demonstrate that Structural Video Diffusion achieves superior performance in generating lifelike, coherent videos for multiple subjects with dynamic and rich interactions, advancing the state of human-centric video generation.

negativas: a prototype for searching and classifying sentential negation in speech data

Túlio Sousa de Gois,Paloma Batista Cardoso

Task: 开发一个工具（negativas）来自动识别巴西葡萄牙语中三种否定结构（NEG1、NEG2、NEG3）。

Motivation: 巴西葡萄牙语中否定结构的使用频率不均，尤其是NEG2和NEG3的低频出现导致研究困难，现有解释主观且难以推广。

Details

Method: 通过分析标注数据集（22个访谈），结合自然语言处理技术开发工具，并评估其准确性。 Result: 工具成功识别了3,338个否定实例，分类准确率达93%，但NEG2和NEG3的识别仍存在困难。 Conclusion: negativas工具在识别高频NEG1上表现良好，但对低频结构的识别仍需改进。 Abstract: Negation is a universal feature of natural languages. In Brazilian Portuguese, the most commonly used negation particle is n\~ao, which can scope over nouns or verbs. When it scopes over a verb, n\~ao can occur in three positions: pre-verbal (NEG1), double negation (NEG2), or post-verbal (NEG3), e.g., n\~ao gosto, n\~ao gosto n\~ao, gosto n\~ao ("I do not like it"). From a variationist perspective, these structures are different forms of expressing negation. Pragmatically, they serve distinct communicative functions, such as politeness and modal evaluation. Despite their grammatical acceptability, these forms differ in frequency. NEG1 dominates across Brazilian regions, while NEG2 and NEG3 appear more rarely, suggesting its use is contextually restricted. This low-frequency challenges research, often resulting in subjective, non-generalizable interpretations of verbal negation with n\~ao. To address this, we developed negativas, a tool for automatically identifying NEG1, NEG2, and NEG3 in transcribed data. The tool's development involved four stages: i) analyzing a dataset of 22 interviews from the Falares Sergipanos database, annotated by three linguists, ii) creating a code using natural language processing (NLP) techniques, iii) running the tool, iv) evaluating accuracy. Inter-annotator consistency, measured using Fleiss' Kappa, was moderate (0.57). The tool identified 3,338 instances of n\~ao, classifying 2,085 as NEG1, NEG2, or NEG3, achieving a 93% success rate. However, negativas has limitations. NEG1 accounted for 91.5% of identified structures, while NEG2 and NEG3 represented 7.2% and 1.2%, respectively. The tool struggled with NEG2, sometimes misclassifying instances as overlapping structures (NEG1/NEG2/NEG3).

Scaling Federated Learning Solutions with Kubernetes for Synthesizing Histopathology Images

Andrei-Alexandru Preda,Iulian-Marius Tăiatu,Dumitru-Clementin Cercel

Task: 结合视觉Transformer和生成对抗网络生成结直肠癌组织病理学图像，并通过联邦学习技术在多节点Kubernetes环境中验证其性能。

Motivation: 解决组织病理学图像数据稀缺和隐私问题，同时提升分类准确性。

Details

Method: 使用视觉Transformer和生成对抗网络生成图像，并通过联邦学习在分布式环境中训练模型。 Result: 生成的图像质量高，分类准确性得到提升，且在联邦学习环境中表现良好。 Conclusion: 该方法有效解决了数据稀缺和隐私问题，同时提升了模型性能。 Abstract: In the field of deep learning, large architectures often obtain the best performance for many tasks, but also require massive datasets. In the histological domain, tissue images are expensive to obtain and constitute sensitive medical information, raising concerns about data scarcity and privacy. Vision Transformers are state-of-the-art computer vision models that have proven helpful in many tasks, including image classification. In this work, we combine vision Transformers with generative adversarial networks to generate histopathological images related to colorectal cancer and test their quality by augmenting a training dataset, leading to improved classification accuracy. Then, we replicate this performance using the federated learning technique and a realistic Kubernetes setup with multiple nodes, simulating a scenario where the training dataset is split among several hospitals unable to share their information directly due to privacy concerns.

Could AI Trace and Explain the Origins of AI-Generated Images and Text?

Hongchao Fang,Can Qin,Ran Xu,Feng Liu,Yixin Liu,Lichao Sun,Dongwon Lee,Lifu Huang,Wenpeng Yin

Task: 系统性地比较和检测AI生成内容（图像和文本）的来源和属性，填补现有研究的空白。

Motivation: AI生成内容在现实世界中的普及引发了伦理和社会问题，但目前缺乏对多维度（如全AI生成与部分AI生成、通用与恶意用例）的系统研究，以及AI系统是否能解释伪造内容的来源。

Details

Method: 引入AI-FAKER数据集，包含超过28万个样本，涵盖多种LLMs和LMMs，覆盖通用和恶意用例。 Result: 实验发现：(i) AI作者检测不仅依赖生成内容，还与模型的原始训练意图相关；(ii) GPT-4o对OpenAI自身模型生成的内容提供一致但不够具体的解释。 Conclusion: AI-FAKER填补了研究空白，揭示了AI生成内容检测的关键因素，并提出了未来研究方向。 Abstract: AI-generated content is becoming increasingly prevalent in the real world, leading to serious ethical and societal concerns. For instance, adversaries might exploit large multimodal models (LMMs) to create images that violate ethical or legal standards, while paper reviewers may misuse large language models (LLMs) to generate reviews without genuine intellectual effort. While prior work has explored detecting AI-generated images and texts, and occasionally tracing their source models, there is a lack of a systematic and fine-grained comparative study. Important dimensions--such as AI-generated images vs. text, fully vs. partially AI-generated images, and general vs. malicious use cases--remain underexplored. Furthermore, whether AI systems like GPT-4o can explain why certain forged content is attributed to specific generative models is still an open question, with no existing benchmark addressing this. To fill this gap, we introduce AI-FAKER, a comprehensive multimodal dataset with over 280,000 samples spanning multiple LLMs and LMMs, covering both general and malicious use cases for AI-generated images and texts. Our experiments reveal two key findings: (i) AI authorship detection depends not only on the generated output but also on the model's original training intent; and (ii) GPT-4o provides highly consistent but less specific explanations when analyzing content produced by OpenAI's own models, such as DALL-E and GPT-4o itself.

CoMBO: Conflict Mitigation via Branched Optimization for Class Incremental Segmentation

Kai Fang,Anqi Zhang,Guangyu Gao,Jianbo Jiao,Chi Harold Liu,Yunchao Wei

Task: 解决类别增量分割（CIS）中的灾难性遗忘与新类别学习之间的冲突。

Motivation: 类别增量分割中，灾难性遗忘与新类别学习的冲突导致性能平衡问题，需要一种方法来解决这一冲突。

Details

Method: 提出了一种名为CoMBO的新方法，包括查询冲突减少模块、半学习半蒸馏策略（HDHL）和基于重要性的知识蒸馏（IKD）。 Result: 在类别增量全景和语义分割实验中，CoMBO表现出优越性能。 Conclusion: CoMBO通过分支优化和冲突缓解策略，有效平衡了新旧类别的学习，解决了类别增量分割中的核心问题。 Abstract: Effective Class Incremental Segmentation (CIS) requires simultaneously mitigating catastrophic forgetting and ensuring sufficient plasticity to integrate new classes. The inherent conflict above often leads to a back-and-forth, which turns the objective into finding the balance between the performance of previous~(old) and incremental~(new) classes. To address this conflict, we introduce a novel approach, Conflict Mitigation via Branched Optimization~(CoMBO). Within this approach, we present the Query Conflict Reduction module, designed to explicitly refine queries for new classes through lightweight, class-specific adapters. This module provides an additional branch for the acquisition of new classes while preserving the original queries for distillation. Moreover, we develop two strategies to further mitigate the conflict following the branched structure, \textit{i.e.}, the Half-Learning Half-Distillation~(HDHL) over classification probabilities, and the Importance-Based Knowledge Distillation~(IKD) over query features. HDHL selectively engages in learning for classification probabilities of queries that match the ground truth of new classes, while aligning unmatched ones to the corresponding old probabilities, thus ensuring retention of old knowledge while absorbing new classes via learning negative samples. Meanwhile, IKD assesses the importance of queries based on their matching degree to old classes, prioritizing the distillation of important features and allowing less critical features to evolve. Extensive experiments in Class Incremental Panoptic and Semantic Segmentation settings have demonstrated the superior performance of CoMBO. Project page: https://guangyu-ryan.github.io/CoMBO.

Cross-Asset Risk Management: Integrating LLMs for Real-Time Monitoring of Equity, Fixed Income, and Currency Markets

Jie Yang,Yiqiu Tang,Yongjie Li,Lihua Zhang,Haoran Zhang

Task: 提出一种基于大语言模型（LLMs）的跨资产风险管理框架，用于实时监控股票、固定收益和货币市场。

Motivation: 利用LLMs的强大能力，整合多样化数据源，提升风险管理的动态性和决策效率。

Details

Method: 通过LLMs分析金融文本、新闻和市场报告，结合实时数据集成和高级分析技术，动态评估风险。 Result: 框架通过回测和实时模拟验证，相比传统方法在预测市场变化方面表现出更高的准确性。 Conclusion: 该框架通过实时数据集成和LLMs的先进应用，提升了金融机构的风险管理能力，促进了金融稳定性。 Abstract: Large language models (LLMs) have emerged as powerful tools in the field of finance, particularly for risk management across different asset classes. In this work, we introduce a Cross-Asset Risk Management framework that utilizes LLMs to facilitate real-time monitoring of equity, fixed income, and currency markets. This innovative approach enables dynamic risk assessment by aggregating diverse data sources, ultimately enhancing decision-making processes. Our model effectively synthesizes and analyzes market signals to identify potential risks and opportunities while providing a holistic view of asset classes. By employing advanced analytics, we leverage LLMs to interpret financial texts, news articles, and market reports, ensuring that risks are contextualized within broader market narratives. Extensive backtesting and real-time simulations validate the framework, showing increased accuracy in predicting market shifts compared to conventional methods. The focus on real-time data integration enhances responsiveness, allowing financial institutions to manage risks adeptly under varying market conditions and promoting financial stability through the advanced application of LLMs in risk analysis.

JarvisIR: Elevating Autonomous Driving Perception with Intelligent Image Restoration

Yunlong Lin,Zixu Lin,Haoyu Chen,Panwang Pan,Chenxin Li,Sixiang Chen,Yeying Jin,Wenbo Li,Xinghao Ding

Task: 开发一种名为JarvisIR的视觉感知系统，以应对复杂天气条件下的图像退化问题。

Motivation: 现有视觉感知系统在复杂天气条件下表现不佳，依赖特定退化先验或存在显著领域差距，需要更鲁棒和自主的解决方案。

Details

Method: 提出JarvisIR，利用VLM作为控制器管理多个专家修复模型，并通过两阶段框架（监督微调和人类反馈对齐）增强系统鲁棒性和泛化能力。 Result: JarvisIR在CleanBench-Real数据集上的感知指标平均提升50%，表现出卓越的决策和修复能力。 Conclusion: JarvisIR通过结合VLM和专家模型，显著提升了复杂天气条件下的视觉感知性能，为实际应用提供了有效解决方案。 Abstract: Vision-centric perception systems struggle with unpredictable and coupled weather degradations in the wild. Current solutions are often limited, as they either depend on specific degradation priors or suffer from significant domain gaps. To enable robust and autonomous operation in real-world conditions, we propose JarvisIR, a VLM-powered agent that leverages the VLM as a controller to manage multiple expert restoration models. To further enhance system robustness, reduce hallucinations, and improve generalizability in real-world adverse weather, JarvisIR employs a novel two-stage framework consisting of supervised fine-tuning and human feedback alignment. Specifically, to address the lack of paired data in real-world scenarios, the human feedback alignment enables the VLM to be fine-tuned effectively on large-scale real-world data in an unsupervised manner. To support the training and evaluation of JarvisIR, we introduce CleanBench, a comprehensive dataset consisting of high-quality and large-scale instruction-responses pairs, including 150K synthetic entries and 80K real entries. Extensive experiments demonstrate that JarvisIR exhibits superior decision-making and restoration capabilities. Compared with existing methods, it achieves a 50% improvement in the average of all perception metrics on CleanBench-Real. Project page: https://cvpr2025-jarvisir.github.io/.

Dynamic Hedging Strategies in Derivatives Markets with LLM-Driven Sentiment and News Analytics

Jie Yang,Yiqiu Tang,Yongjie Li,Lihua Zhang,Haoran Zhang

Task: 提出一种利用大语言模型（LLMs）进行情感分析和新闻分析以指导对冲决策的新框架。

Motivation: 动态对冲策略在衍生品市场中对风险管理至关重要，而市场情绪和波动性对表现有重大影响。

Details

Method: 通过分析新闻文章、社交媒体和财务报告等文本数据，捕捉反映市场情绪的关键指标，实时调整对冲策略。 Result: 历史数据回测显示，动态对冲策略在风险调整后收益上优于传统静态方法。 Conclusion: 将LLM驱动的情感分析引入对冲实践，显著提升了衍生品交易中的决策过程，有效降低风险。 Abstract: Dynamic hedging strategies are essential for effective risk management in derivatives markets, where volatility and market sentiment can greatly impact performance. This paper introduces a novel framework that leverages large language models (LLMs) for sentiment analysis and news analytics to inform hedging decisions. By analyzing textual data from diverse sources like news articles, social media, and financial reports, our approach captures critical sentiment indicators that reflect current market conditions. The framework allows for real-time adjustments to hedging strategies, adapting positions based on continuous sentiment signals. Backtesting results on historical derivatives data reveal that our dynamic hedging strategies achieve superior risk-adjusted returns compared to conventional static approaches. The incorporation of LLM-driven sentiment analysis into hedging practices presents a significant advancement in decision-making processes within derivatives trading. This research showcases how sentiment-informed dynamic hedging can enhance portfolio management and effectively mitigate associated risks.

SDEIT: Semantic-Driven Electrical Impedance Tomography

Dong Liu,Yuanchao Wu,Bowen Tong,Jiansong Deng

Task: 提出一种名为SDEIT的新框架，将Stable Diffusion 3.5与EIT结合，利用自然语言提示作为语义先验指导重建过程。

Motivation: 解决EIT中设计有效正则化和整合先验信息的挑战，尤其是解剖结构的复杂性和变异性。

Details

Method: 结合隐式神经表示（INR）网络和即插即用优化方案，利用SD生成的图像作为生成先验。 Result: 在模拟和实验数据上，SDEIT优于现有技术，提供更高的准确性和鲁棒性。 Conclusion: SDEIT为多模态先验整合到EIT等不适定逆问题开辟了新途径。 Abstract: Regularization methods using prior knowledge are essential in solving ill-posed inverse problems such as Electrical Impedance Tomography (EIT). However, designing effective regularization and integrating prior information into EIT remains challenging due to the complexity and variability of anatomical structures. In this work, we introduce SDEIT, a novel semantic-driven framework that integrates Stable Diffusion 3.5 into EIT, marking the first use of large-scale text-to-image generation models in EIT. SDEIT employs natural language prompts as semantic priors to guide the reconstruction process. By coupling an implicit neural representation (INR) network with a plug-and-play optimization scheme that leverages SD-generated images as generative priors, SDEIT improves structural consistency and recovers fine details. Importantly, this method does not rely on paired training datasets, increasing its adaptability to varied EIT scenarios. Extensive experiments on both simulated and experimental data demonstrate that SDEIT outperforms state-of-the-art techniques, offering superior accuracy and robustness. This work opens a new pathway for integrating multimodal priors into ill-posed inverse problems like EIT.

CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization

Weiwei Sun,Shengyu Feng,Shanda Li,Yiming Yang

Task: 介绍并评估CO-Bench，一个包含36个实际组合优化问题的基准测试套件，用于系统研究基于LLM的代理在组合优化中的潜力。

Motivation: 目前缺乏全面的基准测试来系统研究基于LLM的代理在解决结构化、约束密集型问题中的潜力，因此需要填补这一空白。

Details

Method: 开发CO-Bench基准测试套件，包含36个实际组合优化问题，并评估多种代理框架与人类设计算法的性能对比。 Result: 揭示了当前方法的优势和局限性，并指出了未来研究的有前景方向。 Conclusion: CO-Bench为系统研究基于LLM的代理在组合优化中的应用提供了重要工具，并推动了该领域的进一步发展。 Abstract: Although LLM-based agents have attracted significant attention in domains such as software engineering and machine learning research, their role in advancing combinatorial optimization (CO) remains relatively underexplored. This gap underscores the need for a deeper understanding of their potential in tackling structured, constraint-intensive problems-a pursuit currently limited by the absence of comprehensive benchmarks for systematic investigation. To address this, we introduce CO-Bench, a benchmark suite featuring 36 real-world CO problems drawn from a broad range of domains and complexity levels. CO-Bench includes structured problem formulations and curated data to support rigorous investigation of LLM agents. We evaluate multiple agent frameworks against established human-designed algorithms, revealing key strengths and limitations of current approaches and identifying promising directions for future research. CO-Bench is publicly available at https://github.com/sunnweiwei/CO-Bench.

Interpretable Single-View 3D Gaussian Splatting using Unsupervised Hierarchical Disentangled Representation Learning

Yuyang Zhang,Baao Xie,Hu Zhu,Qi Wang,Huanting Guo,Xin Jin,Wenjun Zeng

Task: 提出一种可解释的单视角3D高斯泼溅框架（3DisGS），通过分层解耦表示学习（DRL）发现粗粒度和细粒度的3D语义。

Motivation: 现有3D高斯泼溅方法在理解底层3D语义方面存在挑战，影响了模型的可控性和可解释性。

Details

Method: 采用双分支架构（点云初始化分支和三平面-高斯生成分支）实现粗粒度解耦，并通过基于DRL的编码器适配器进一步发现细粒度语义表示。 Result: 模型在保持高质量和快速重建的同时实现了3D解耦。 Conclusion: 这是首个实现无监督可解释3D高斯泼溅的工作。 Abstract: Gaussian Splatting (GS) has recently marked a significant advancement in 3D reconstruction, delivering both rapid rendering and high-quality results. However, existing 3DGS methods pose challenges in understanding underlying 3D semantics, which hinders model controllability and interpretability. To address it, we propose an interpretable single-view 3DGS framework, termed 3DisGS, to discover both coarse- and fine-grained 3D semantics via hierarchical disentangled representation learning (DRL). Specifically, the model employs a dual-branch architecture, consisting of a point cloud initialization branch and a triplane-Gaussian generation branch, to achieve coarse-grained disentanglement by separating 3D geometry and visual appearance features. Subsequently, fine-grained semantic representations within each modality are further discovered through DRL-based encoder-adapters. To our knowledge, this is the first work to achieve unsupervised interpretable 3DGS. Evaluations indicate that our model achieves 3D disentanglement while preserving high-quality and rapid reconstruction.

Balancing Complexity and Informativeness in LLM-Based Clustering: Finding the Goldilocks Zone

Justin Miller,Tristram Alexander

Task: 研究短文本数据聚类中信息量与可解释性之间的平衡问题。

Motivation: 传统评估指标忽视了信息量与可解释性之间的权衡，受语言交际效率原则启发，探索最优聚类数量。

Details

Method: 利用大语言模型（LLM）生成聚类名称，并通过语义密度、信息论和聚类准确性评估其效果。 Result: 高斯混合模型（GMM）在LLM生成的嵌入上聚类，提高了语义密度；但随着聚类数量增加，可解释性下降。 Conclusion: 发现16-22个聚类的“黄金区间”，既保持区分度又具备可解释性，为理论和实践提供指导。 Abstract: The challenge of clustering short text data lies in balancing informativeness with interpretability. Traditional evaluation metrics often overlook this trade-off. Inspired by linguistic principles of communicative efficiency, this paper investigates the optimal number of clusters by quantifying the trade-off between informativeness and cognitive simplicity. We use large language models (LLMs) to generate cluster names and evaluate their effectiveness through semantic density, information theory, and clustering accuracy. Our results show that Gaussian Mixture Model (GMM) clustering on embeddings generated by a LLM, increases semantic density compared to random assignment, effectively grouping similar bios. However, as clusters increase, interpretability declines, as measured by a generative LLM's ability to correctly assign bios based on cluster names. A logistic regression analysis confirms that classification accuracy depends on the semantic similarity between bios and their assigned cluster names, as well as their distinction from alternatives. These findings reveal a "Goldilocks zone" where clusters remain distinct yet interpretable. We identify an optimal range of 16-22 clusters, paralleling linguistic efficiency in lexical categorization. These insights inform both theoretical models and practical applications, guiding future research toward optimising cluster interpretability and usefulness.

GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill

Jieming Cui,Tengyu Liu,Ziyu Meng,Jiale Yu,Ran Song,Wei Zhang,Yixin Zhu,Siyuan Huang

Task: 提出一种名为GROVE的通用奖励框架，用于实现开放词汇物理技能学习。

Motivation: 当前强化学习方法存在局限性：手动设计的奖励缺乏跨任务的可扩展性，而基于演示的方法难以泛化到训练分布之外。

Details

Method: 结合大型语言模型（LLMs）和视觉语言模型（VLMs），通过迭代设计过程生成和改进奖励系统，并开发Pose2CLIP以解决模拟与自然图像之间的领域差距。 Result: GROVE在运动自然度和任务完成度上分别提高了22.2%和25.7%，且训练速度比现有方法快8.4倍。 Conclusion: GROVE为模拟环境中的可扩展物理技能学习奠定了基础。 Abstract: Learning open-vocabulary physical skills for simulated agents presents a significant challenge in artificial intelligence. Current reinforcement learning approaches face critical limitations: manually designed rewards lack scalability across diverse tasks, while demonstration-based methods struggle to generalize beyond their training distribution. We introduce GROVE, a generalized reward framework that enables open-vocabulary physical skill learning without manual engineering or task-specific demonstrations. Our key insight is that Large Language Models(LLMs) and Vision Language Models(VLMs) provide complementary guidance -- LLMs generate precise physical constraints capturing task requirements, while VLMs evaluate motion semantics and naturalness. Through an iterative design process, VLM-based feedback continuously refines LLM-generated constraints, creating a self-improving reward system. To bridge the domain gap between simulation and natural images, we develop Pose2CLIP, a lightweight mapper that efficiently projects agent poses directly into semantic feature space without computationally expensive rendering. Extensive experiments across diverse embodiments and learning paradigms demonstrate GROVE's effectiveness, achieving 22.2% higher motion naturalness and 25.7% better task completion scores while training 8.4x faster than previous methods. These results establish a new foundation for scalable physical skill acquisition in simulated environments.

Constructing the Truth: Text Mining and Linguistic Networks in Public Hearings of Case 03 of the Special Jurisdiction for Peace (JEP)

Juan Sosa,Alejandro Urrego,Cesar Prieto,Emma J. Camargo-Díaz

Task: 通过自然语言分析和语义共现模型探索、系统化和可视化哥伦比亚特别和平法庭（JEP）案例03中受害者和出庭方的叙述模式。

Motivation: 揭示哥伦比亚武装冲突中“假阳性”事件的受害化、责任和承认动态，为过渡司法案例提供可复制的工具。

Details

Method: 构建skipgram网络并分析其模块性，识别主题集群以揭示区域和程序状态差异。 Result: 识别出揭示受害化、责任和承认动态的主题集群，为司法和法外真相的集体构建提供工具。 Conclusion: 该计算方法为过渡司法案例提供了创新工具，支持真相、正义、赔偿和不重复的支柱。 Abstract: Case 03 of the Special Jurisdiction for Peace (JEP), focused on the so-called false positives in Colombia, represents one of the most harrowing episodes of the Colombian armed conflict. This article proposes an innovative methodology based on natural language analysis and semantic co-occurrence models to explore, systematize, and visualize narrative patterns present in the public hearings of victims and appearing parties. By constructing skipgram networks and analyzing their modularity, the study identifies thematic clusters that reveal regional and procedural status differences, providing empirical evidence on dynamics of victimization, responsibility, and acknowledgment in this case. This computational approach contributes to the collective construction of both judicial and extrajudicial truth, offering replicable tools for other transitional justice cases. The work is grounded in the pillars of truth, justice, reparation, and non-repetition, proposing a critical and in-depth reading of contested memories.

The Effects of Grouped Structural Global Pruning of Vision Transformers on Domain Generalisation

Hamza Riaz,Alan F. Smeaton

Task: 提出一种针对预训练视觉Transformer的分组结构化剪枝方法，以在有限计算资源下提升模型效率。

Motivation: 随着AI模型（如大型语言模型和视觉Transformer）规模的增大，在有限计算资源设备上部署这些模型成为挑战，尤其是在处理领域泛化任务时。

Details

Method: 通过依赖图分析识别并移除Transformer中冗余的神经元、权重、滤波器或注意力头组，采用多种选择指标进行剪枝。 Result: 在PACS和Office-Home DG基准测试中，剪枝后的模型在推理速度和微调时间上显著提升，同时准确率和领域泛化性能损失极小。例如，ViT、BeiT和DeiT模型剪枝50%后，速度提升2.5倍、1.81倍和2.15倍，准确率仅下降2.94%、1.42%和1.72%。 Conclusion: 该方法在模型效率和领域泛化性能之间取得了有效平衡，为资源受限设备上的模型部署提供了可行方案。 Abstract: With the growing sizes of AI models like large language models (LLMs) and vision transformers, deploying them on devices with limited computational resources is a significant challenge particularly when addressing domain generalisation (DG) tasks. This paper introduces a novel grouped structural pruning method for pre-trained vision transformers (ViT, BeiT, and DeiT), evaluated on the PACS and Office-Home DG benchmarks. Our method uses dependency graph analysis to identify and remove redundant groups of neurons, weights, filters, or attention heads within transformers, using a range of selection metrics. Grouped structural pruning is applied at pruning ratios of 50\%, 75\% and 95\% and the models are then fine-tuned on selected distributions from DG benchmarks to evaluate their overall performance in DG tasks. Results show significant improvements in inference speed and fine-tuning time with minimal trade-offs in accuracy and DG task performance. For instance, on the PACS benchmark, pruning ViT, BeiT, and DeiT models by 50\% using the Hessian metric resulted in accuracy drops of only -2.94\%, -1.42\%, and -1.72\%, respectively, while achieving speed boosts of 2.5x, 1.81x, and 2.15x. These findings demonstrate the effectiveness of our approach in balancing model efficiency with domain generalisation performance.

IMPersona: Evaluating Individual Level LM Impersonation

Quan Shi,Carlos Jimenez,Stephen Dong,Brian Seo,Caden Yao,Adam Kelch,Karthik Narasimhan

Task: 评估语言模型在模拟特定个体写作风格和个人知识方面的能力。

Motivation: 随着语言模型在对话文本生成方面越来越接近人类能力，需要评估其模拟特定个体的能力，以探讨潜在的应用和风险。

Details

Method: 使用监督微调和分层记忆检索系统（IMPersona框架）对开源模型（如Llama-3.1-8B-Instruct）进行实验。 Result: 在盲测对话实验中，微调模型被误认为人类的概率为44.44%，远高于基于提示的方法（25.00%）。 Conclusion: 研究揭示了个性化语言模型的潜在应用和风险，提出了检测方法和防御策略，并呼吁关注隐私、安全和伦理问题。 Abstract: As language models achieve increasingly human-like capabilities in conversational text generation, a critical question emerges: to what extent can these systems simulate the characteristics of specific individuals? To evaluate this, we introduce IMPersona, a framework for evaluating LMs at impersonating specific individuals' writing style and personal knowledge. Using supervised fine-tuning and a hierarchical memory-inspired retrieval system, we demonstrate that even modestly sized open-source models, such as Llama-3.1-8B-Instruct, can achieve impersonation abilities at concerning levels. In blind conversation experiments, participants (mis)identified our fine-tuned models with memory integration as human in 44.44% of interactions, compared to just 25.00% for the best prompting-based approach. We analyze these results to propose detection methods and defense strategies against such impersonation attempts. Our findings raise important questions about both the potential applications and risks of personalized language models, particularly regarding privacy, security, and the ethical deployment of such technologies in real-world contexts.

Evaluating Graphical Perception with Multimodal LLMs

Rami Huu Nguyen,Kenichi Maeda,Mahsa Geshvadi,Daniel Haehn

Task: 评估多模态大语言模型（MLLMs）在图表数值回归任务中的表现，并与人类图形感知能力进行比较。

Motivation: 尽管MLLMs在图像分析方面取得了显著进展，但在图表数值回归任务中的表现尚未充分探索。

Details

Method: 通过复现Cleveland和McGill的1984年经典实验，比较微调、预训练模型和零样本提示与人类任务表现。 Result: MLLMs在某些情况下优于人类表现，但在其他情况下表现不佳。 Conclusion: 研究揭示了MLLMs在数据可视化任务中的成功与失败之处，为进一步理解其能力提供了依据。 Abstract: Multimodal Large Language Models (MLLMs) have remarkably progressed in analyzing and understanding images. Despite these advancements, accurately regressing values in charts remains an underexplored area for MLLMs. For visualization, how do MLLMs perform when applied to graphical perception tasks? Our paper investigates this question by reproducing Cleveland and McGill's seminal 1984 experiment and comparing it against human task performance. Our study primarily evaluates fine-tuned and pretrained models and zero-shot prompting to determine if they closely match human graphical perception. Our findings highlight that MLLMs outperform human task performance in some cases but not in others. We highlight the results of all experiments to foster an understanding of where MLLMs succeed and fail when applied to data visualization.

Hallucination Detection using Multi-View Attention Features

Yuya Ogasa,Yuki Arase

Task: 检测大型语言模型输出中的词级幻觉。

Motivation: 先前研究发现幻觉发生时注意力模式异常，因此提取注意力矩阵特征以辅助检测。

Details

Method: 从注意力矩阵提取三类特征（平均注意力、注意力多样性、关注范围多样性），输入Transformer分类器进行词级分类。 Result: 在长上下文输入任务（如数据到文本和摘要）中，该方法优于基线模型。 Conclusion: 提出的方法能有效检测词级幻觉，尤其在长上下文任务中表现优异。 Abstract: This study tackles token-level hallucination detection in outputs of large language models. Previous studies revealed that attention exhibits irregular patterns when hallucination occurs. Inspired by this, we extract features from the attention matrix that provide complementary views of (a) the average attention each token receives, which helps identify whether certain tokens are overly influential or ignored, (b) the diversity of attention each token receives, which reveals whether attention is biased toward specific subsets, and (c) the diversity of tokens a token attends to during generation, which indicates whether the model references a narrow or broad range of information. These features are input to a Transformer-based classifier to conduct token-level classification to identify hallucinated spans. Experimental results indicate that the proposed method outperforms strong baselines on hallucination detection with longer input contexts, i.e., data-to-text and summarization tasks.

Resilience of Vision Transformers for Domain Generalisation in the Presence of Out-of-Distribution Noisy Images

Hamza Riaz,Alan F. Smeaton

Task: 评估BEIT架构在合成分布外（OOD）基准测试中的表现，以解决视觉变换器在域泛化（DG）中的挑战。

Motivation: 现代AI模型在受控环境中表现优异，但在数据分布不可预测变化的真实场景中表现不佳，域泛化（DG）是主要挑战。

Details

Method: 通过网格模式（25%、50%、75%遮挡）策略性遮挡图像区域，结合Segment Anything和Grounding DINO进行精确对象定位，生成OOD测试案例。 Result: BEIT在PACS和Office-Home基准测试中分别保持94%和87%的准确率，显著优于CNN和其他视觉变换器（最高37%优势）。 Conclusion: 提出了生成OOD基准的可扩展方法，并证明MIM和自注意力机制通过学习不变特征增强域泛化能力，为AI系统在不确定性下的可靠泛化提供蓝图。 Abstract: Modern AI models excel in controlled settings but often fail in real-world scenarios where data distributions shift unpredictably - a challenge known as domain generalisation (DG). This paper tackles this limitation by rigorously evaluating vision tramsformers, specifically the BEIT architecture which is a model pre-trained with masked image modelling (MIM), against synthetic out-of-distribution (OOD) benchmarks designed to mimic real-world noise and occlusions. We introduce a novel framework to generate OOD test cases by strategically masking object regions in images using grid patterns (25\%, 50\%, 75\% occlusion) and leveraging cutting-edge zero-shot segmentation via Segment Anything and Grounding DINO to ensure precise object localisation. Experiments across three benchmarks (PACS, Office-Home, DomainNet) demonstrate BEIT's known robustness while maintaining 94\% accuracy on PACS and 87\% on Office-Home, despite significant occlusions, outperforming CNNs and other vision transformers by margins of up to 37\%. Analysis of self-attention distances reveals that the BEIT dependence on global features correlates with its resilience. Furthermore, our synthetic benchmarks expose critical failure modes: performance degrades sharply when occlusions disrupt object shapes e.g. 68\% drop for external grid masking vs. 22\% for internal masking. This work provides two key advances (1) a scalable method to generate OOD benchmarks using controllable noise, and (2) empirical evidence that MIM and self-attention mechanism in vision transformers enhance DG by learning invariant features. These insights bridge the gap between lab-trained models and real-world deployment that offer a blueprint for building AI systems that generalise reliably under uncertainty.

Generative Large Language Models Trained for Detecting Errors in Radiology Reports

Cong Sun,Kurt Teichman,Yiliang Zhou,Brian Critelli,David Nauheim,Graham Keir,Xindi Wang,Judy Zhong,Adam E Flanders,George Shih,Yifan Peng

Task: 构建并评估一个数据集，用于检测放射学报告中的错误，并比较不同模型在错误检测中的性能。

Motivation: 通过合成数据和真实数据的结合，提高放射学报告中错误的检测能力，以提升报告质量。

Details

Method: 构建包含合成和真实放射学报告的数据集，使用零样本提示、少样本提示或微调策略优化模型（如Llama-3、GPT-4、BiomedBERT），并通过F1分数、置信区间和放射科医生评估模型性能。 Result: 微调后的Llama-3-70B-Instruct模型在零样本提示下表现最佳，F1分数分别为：否定错误0.769，左右错误0.772，间隔变化错误0.750，转录错误0.828，总体0.780。放射科医生验证了模型检测到的错误。 Conclusion: 生成式LLM在合成和真实放射学报告上的微调显著提升了错误检测能力。 Abstract: In this retrospective study, a dataset was constructed with two parts. The first part included 1,656 synthetic chest radiology reports generated by GPT-4 using specified prompts, with 828 being error-free synthetic reports and 828 containing errors. The second part included 614 reports: 307 error-free reports between 2011 and 2016 from the MIMIC-CXR database and 307 corresponding synthetic reports with errors generated by GPT-4 on the basis of these MIMIC-CXR reports and specified prompts. All errors were categorized into four types: negation, left/right, interval change, and transcription errors. Then, several models, including Llama-3, GPT-4, and BiomedBERT, were refined using zero-shot prompting, few-shot prompting, or fine-tuning strategies. Finally, the performance of these models was evaluated using the F1 score, 95\% confidence interval (CI) and paired-sample t-tests on our constructed dataset, with the prediction results further assessed by radiologists. Using zero-shot prompting, the fine-tuned Llama-3-70B-Instruct model achieved the best performance with the following F1 scores: 0.769 for negation errors, 0.772 for left/right errors, 0.750 for interval change errors, 0.828 for transcription errors, and 0.780 overall. In the real-world evaluation phase, two radiologists reviewed 200 randomly selected reports output by the model. Of these, 99 were confirmed to contain errors detected by the models by both radiologists, and 163 were confirmed to contain model-detected errors by at least one radiologist. Generative LLMs, fine-tuned on synthetic and MIMIC-CXR radiology reports, greatly enhanced error detection in radiology reports.

Progressive Multi-Source Domain Adaptation for Personalized Facial Expression Recognition

Muhammad Osama Zeeshan,Marco Pedersoli,Alessandro Lameiras Koerich,Eric Grange

Task: 提出一种渐进式多源无监督域适应（MSDA）方法，用于个性化面部表情识别（FER），以解决源域与目标域之间的大分布差异和负迁移问题。

Motivation: 由于源域与目标域之间存在显著的分布差异，传统的MSDA方法可能导致负迁移和计算成本增加，因此需要一种更有效的方法来选择和整合最相关的源域信息。

Details

Method: 提出渐进式MSDA方法，逐步引入与目标域相似的源域信息，并使用基于密度的记忆机制防止灾难性遗忘。 Result: 在Biovid和UNBC-McMaster疼痛数据集上的实验验证了该方法的有效性。 Conclusion: 渐进式MSDA方法能够有效减少负迁移，提高模型在目标域上的性能，同时降低计算成本。 Abstract: Personalized facial expression recognition (FER) involves adapting a machine learning model using samples from labeled sources and unlabeled target domains. Given the challenges of recognizing subtle expressions with considerable interpersonal variability, state-of-the-art unsupervised domain adaptation (UDA) methods focus on the multi-source UDA (MSDA) setting, where each domain corresponds to a specific subject, and improve model accuracy and robustness. However, when adapting to a specific target, the diverse nature of multiple source domains translates to a large shift between source and target data. State-of-the-art MSDA methods for FER address this domain shift by considering all the sources to adapt to the target representations. Nevertheless, adapting to a target subject presents significant challenges due to large distributional differences between source and target domains, often resulting in negative transfer. In addition, integrating all sources simultaneously increases computational costs and causes misalignment with the target. To address these issues, we propose a progressive MSDA approach that gradually introduces information from subjects based on their similarity to the target subject. This will ensure that only the most relevant sources from the target are selected, which helps avoid the negative transfer caused by dissimilar sources. We first exploit the closest sources to reduce the distribution shift with the target and then move towards the furthest while only considering the most relevant sources based on the predetermined threshold. Furthermore, to mitigate catastrophic forgetting caused by the incremental introduction of source subjects, we implemented a density-based memory mechanism that preserves the most relevant historical source samples for adaptation. Our experiments show the effectiveness of our proposed method on pain datasets: Biovid and UNBC-McMaster.

Compression Laws for Large Language Models

Ayan Sengupta,Siddhant Chaudhary,Tanmoy Chakraborty

Task: 研究语言模型（LLMs）的压缩规律及其对下游任务性能的影响。

Motivation: 探索模型压缩如何影响预训练语言模型在下游任务中的表现，填补现有研究中对压缩效应的理解空白。

Details

Method: 通过超过1000次实验，对8个参数规模从0.5B到14B的模型进行结构化压缩，并分析压缩比与性能的关系。 Result: 测试交叉熵损失随压缩比呈二次增长，而下游任务性能仅线性下降；恢复微调可提升生成损失达55%，高压缩比（90%）下推理速度提升60%。 Conclusion: 模型压缩对大型模型尤为有益，尤其是在资源受限环境中无法使用更小模型时，提供了实际应用中的指导原则。 Abstract: We introduce compression laws for language language models (LLMs). While recent scaling laws have sought to understand how LLMs scale with respect to model size, pre-training data, and computational resources, we focus on understanding how model compression affects the performance of a pre-trained LLM on downstream tasks. We empirically examine the effects of structured model compression on LLMs through over $1000$ experiments across eight models with sizes ranging from $0.5B$ to $14B$ parameters. Our findings indicate that the test cross-entropy loss increases quadratically with the compression ratio, whereas performance on downstream tasks declines only linearly. Our study emphasizes the importance of recovery fine-tuning in enhancing generation loss, showing that the test loss of compressed LLMs can improve by up to 55% with recovery fine-tuning. At higher compression ratios (up to 90%), compressed LLMs demonstrate a speed increase of 60% during inference compared to their uncompressed counterparts, compensating for the performance degradation at this level. However, for smaller models ($\le 7B$), the computational gains are limited, peaking at just 35%. We conclude that model compression can be highly beneficial for larger models, especially when a smaller model within the same computational budget is not available. These insights provide the practical guidelines for utilizing model compression techniques for adopting LLMs in real-life applications in resource-constrained settings.

ADA-Net: Attention-Guided Domain Adaptation Network with Contrastive Learning for Standing Dead Tree Segmentation Using Aerial Imagery

Mete Ahishali,Anis Ur Rahman,Einari Heinaro,Samuli Junttila

Task: 提出一种基于域适应的新方法，用于从多光谱航拍图像中分割站立枯树。

Motivation: 气候变化导致大规模树木死亡事件，但由于数据有限，这些事件可能未被检测到，而站立枯树的信息对理解森林生态系统功能和恢复力至关重要。

Details

Method: 提出了一种注意力引导的域适应网络（ADA-Net），结合对比学习，将源域图像转换为目标域图像，并利用预训练的分割网络进行迁移学习。 Result: ADA-Net在域适应性能上优于现有方法，生成的数据集（USA2Finland）与目标域图像具有相似特征。 Conclusion: ADA-Net为大规模森林监测提供了一种有效的解决方案，并公开了软件实现和数据集。 Abstract: Information on standing dead trees is important for understanding forest ecosystem functioning and resilience but has been lacking over large geographic regions. Climate change has caused large-scale tree mortality events that can remain undetected due to limited data. In this study, we propose a novel method for segmenting standing dead trees using aerial multispectral orthoimages. Because access to annotated datasets has been a significant problem in forest remote sensing due to the need for forest expertise, we introduce a method for domain transfer by leveraging domain adaptation to learn a transformation from a source domain X to target domain Y. In this Image-to-Image translation task, we aim to utilize available annotations in the target domain by pre-training a segmentation network. When images from a new study site without annotations are introduced (source domain X), these images are transformed into the target domain. Then, transfer learning is applied by inferring the pre-trained network on domain-adapted images. In addition to investigating the feasibility of current domain adaptation approaches for this objective, we propose a novel approach called the Attention-guided Domain Adaptation Network (ADA-Net) with enhanced contrastive learning. Accordingly, the ADA-Net approach provides new state-of-the-art domain adaptation performance levels outperforming existing approaches. We have evaluated the proposed approach using two datasets from Finland and the US. The USA images are converted to the Finland domain, and we show that the synthetic USA2Finland dataset exhibits similar characteristics to the Finland domain images. The software implementation is shared at https://github.com/meteahishali/ADA-Net. The data is publicly available at https://www.kaggle.com/datasets/meteahishali/aerial-imagery-for-standing-dead-tree-segmentation.

StyleRec: A Benchmark Dataset for Prompt Recovery in Writing Style Transformation

Shenyang Liu,Yang Gao,Shaoyan Zhai,Liqiang Wang

Task: 探索一种独特的提示恢复任务，专注于风格转换和重述的提示重建，而非典型的问答任务。

Motivation: 随着大型语言模型（LLMs）的普及，提示恢复（从LLM输出中重建提示）的重要性日益增加，但大多数用户只能通过API访问模型，缺乏内部权重，仅依赖输出和logits，增加了恢复的复杂性。

Details

Method: 通过LLM辅助创建数据集，采用多种技术确保质量，并测试零样本、少样本、越狱、思维链、微调等方法，以及一种针对性能不佳情况的新型规范提示回退策略。 Result: 结果表明，单样本和微调方法效果最佳，但也揭示了传统句子相似性度量在评估提示恢复时的缺陷。 Conclusion: 贡献包括（1）基准数据集，（2）全面的提示恢复策略实验，（3）当前评估指标的局限性识别，推动了无限制输入提示结构的通用提示恢复研究。 Abstract: Prompt Recovery, reconstructing prompts from the outputs of large language models (LLMs), has grown in importance as LLMs become ubiquitous. Most users access LLMs through APIs without internal model weights, relying only on outputs and logits, which complicates recovery. This paper explores a unique prompt recovery task focused on reconstructing prompts for style transfer and rephrasing, rather than typical question-answering. We introduce a dataset created with LLM assistance, ensuring quality through multiple techniques, and test methods like zero-shot, few-shot, jailbreak, chain-of-thought, fine-tuning, and a novel canonical-prompt fallback for poor-performing cases. Our results show that one-shot and fine-tuning yield the best outcomes but highlight flaws in traditional sentence similarity metrics for evaluating prompt recovery. Contributions include (1) a benchmark dataset, (2) comprehensive experiments on prompt recovery strategies, and (3) identification of limitations in current evaluation metrics, all of which advance general prompt recovery research, where the structure of the input prompt is unrestricted.

3R-GS: Best Practice in Optimizing Camera Poses Along with 3DGS

Zhisheng Huang,Peng Wang,Jingdong Zhang,Yuan Liu,Xin Li,Wenping Wang

Task: 提出3R-GS框架，联合优化3D高斯和相机参数，以提升神经渲染的质量和相机姿态估计的精度。

Motivation: 解决现有3D高斯渲染方法对SfM相机姿态的依赖问题，尤其是在挑战性场景（如无纹理场景）中的鲁棒性和精度不足。

Details

Method: 通过结合大尺度重建先验MASt3R-SfM，联合优化3D高斯和相机参数，并引入优化策略以克服初始化和全局优化限制。 Result: 实验表明，3R-GS在保持计算效率的同时，实现了高质量的新视角合成和精确的相机姿态估计。 Conclusion: 3R-GS通过联合优化策略，显著提升了神经渲染的鲁棒性和相机姿态估计的精度。 Abstract: 3D Gaussian Splatting (3DGS) has revolutionized neural rendering with its efficiency and quality, but like many novel view synthesis methods, it heavily depends on accurate camera poses from Structure-from-Motion (SfM) systems. Although recent SfM pipelines have made impressive progress, questions remain about how to further improve both their robust performance in challenging conditions (e.g., textureless scenes) and the precision of camera parameter estimation simultaneously. We present 3R-GS, a 3D Gaussian Splatting framework that bridges this gap by jointly optimizing 3D Gaussians and camera parameters from large reconstruction priors MASt3R-SfM. We note that naively performing joint 3D Gaussian and camera optimization faces two challenges: the sensitivity to the quality of SfM initialization, and its limited capacity for global optimization, leading to suboptimal reconstruction results. Our 3R-GS, overcomes these issues by incorporating optimized practices, enabling robust scene reconstruction even with imperfect camera registration. Extensive experiments demonstrate that 3R-GS delivers high-quality novel view synthesis and precise camera pose estimation while remaining computationally efficient. Project page: https://zsh523.github.io/3R-GS/

PolyGuard: A Multilingual Safety Moderation Tool for 17 Languages

Priyanshu Kumar,Devansh Jain,Akhila Yerukola,Liwei Jiang,Himanshu Beniwal,Thomas Hartvigsen,Maarten Sap

Task: 开发POLYGUARD，一种多语言安全模型，用于保护大型语言模型（LLM）的生成内容。

Motivation: 当前多语言安全审核主要集中在少数语言（如英语、中文），且安全定义范围有限，导致审核能力存在显著差距。

Details

Method: 发布POLYGUARD模型及配套数据集POLYGUARDMIX（包含17种语言的1.91M样本）和POLYGUARDPROMPTS（29K样本的评估基准）。 Result: POLYGUARD在多个安全和毒性基准测试中表现优异，比现有最佳开源和商业安全分类器高出5.5%。 Conclusion: POLYGUARD的贡献推动了全球用户使用更安全的多语言LLM的努力。 Abstract: Truly multilingual safety moderation efforts for Large Language Models (LLMs) have been hindered by a narrow focus on a small set of languages (e.g., English, Chinese) as well as a limited scope of safety definition, resulting in significant gaps in moderation capabilities. To bridge these gaps, we release POLYGUARD, a new state-of-the-art multilingual safety model for safeguarding LLM generations, and the corresponding training and evaluation datasets. POLYGUARD is trained on POLYGUARDMIX, the largest multilingual safety training corpus to date containing 1.91M samples across 17 languages (e.g., Chinese, Czech, English, Hindi). We also introduce POLYGUARDPROMPTS, a high quality multilingual benchmark with 29K samples for the evaluation of safety guardrails. Created by combining naturally occurring multilingual human-LLM interactions and human-verified machine translations of an English-only safety dataset (WildGuardMix; Han et al., 2024), our datasets contain prompt-output pairs with labels of prompt harmfulness, response harmfulness, and response refusal. Through extensive evaluations across multiple safety and toxicity benchmarks, we demonstrate that POLYGUARD outperforms existing state-of-the-art open-weight and commercial safety classifiers by 5.5%. Our contributions advance efforts toward safer multilingual LLMs for all global users.

MedM-VL: What Makes a Good Medical LVLM?

Yiming Shi,Shaoshuai Yang,Xun Zhu,Haoyu Wang,Miao Li,Ji Wu

Task: 研究基于LLaVA框架的医学大型视觉语言模型（LVLMs）的架构设计，以支持2D和3D医学图像的多模态任务。

Motivation: 传统浅层和任务特定模型难以满足临床实践中复杂性和可扩展性的需求，大型语言模型（LLMs）的发展推动了医学LVLMs的研发。

Details

Method: 基于LLaVA框架，设计并构建了两种针对2D和3D模态的医学LVLMs模型，支持通用医学任务和领域特定微调。 Result: 开发了模块化和可扩展的代码库MedM-VL，并发布了两种LVLM变体：MedM-VL-2D和MedM-VL-CT-Chest。 Conclusion: 医学LVLMs为多模态视觉语言任务提供了统一解决方案，可作为有效的基础模型，促进进一步研究和应用。 Abstract: Medical image analysis is a fundamental component. As deep learning progresses, the focus has shifted from single-task applications, such as classification and segmentation, to more complex multimodal tasks, including medical visual question answering and report generation. Traditional shallow and task-specific models are increasingly limited in addressing the complexity and scalability required in clinical practice. The emergence of large language models (LLMs) has driven the development of medical Large Vision-Language Models (LVLMs), offering a unified solution for diverse vision-language tasks. In this study, we investigate various architectural designs for medical LVLMs based on the widely adopted LLaVA framework, which follows an encoder-connector-LLM paradigm. We construct two distinct models targeting 2D and 3D modalities, respectively. These models are designed to support both general-purpose medical tasks and domain-specific fine-tuning, thereby serving as effective foundation models. To facilitate reproducibility and further research, we develop a modular and extensible codebase, MedM-VL, and release two LVLM variants: MedM-VL-2D for 2D medical image analysis and MedM-VL-CT-Chest for 3D CT-based applications. The code and models are available at: https://github.com/MSIIP/MedM-VL

Pre-trained Language Models and Few-shot Learning for Medical Entity Extraction

Xiaokai Wang,Guiran Liu,Binrong Zhu,Jacky He,Hongye Zheng,Hanlu Zhang

Task: 提出一种基于Transformer的医学实体抽取方法，以提升医学文献的信息抽取能力。

Motivation: 考虑到医学文本的专业性和复杂性，研究旨在比较不同预训练语言模型在医学实体抽取任务中的表现，并探索Few-shot Learning在低资源场景下的应用。

Details

Method: 比较了BERT、BioBERT、PubMedBERT和ClinicalBERT等预训练语言模型，以及CRF、Span-based和Seq2Seq等实体抽取方法，并研究了Few-shot Learning的效果。 Result: PubMedBERT在医学实体抽取任务中表现最佳（F1-score = 88.8%），Span-based方法在实体边界识别上表现最优（F1-score = 88.6%），Few-shot Learning在低资源条件下仍能达到79.1%的F1-score。 Conclusion: 结合预训练语言模型和Few-shot Learning可提升医学实体抽取的准确性，未来可结合知识图谱和主动学习策略进一步提升模型的泛化性和稳定性。 Abstract: This study proposes a medical entity extraction method based on Transformer to enhance the information extraction capability of medical literature. Considering the professionalism and complexity of medical texts, we compare the performance of different pre-trained language models (BERT, BioBERT, PubMedBERT, ClinicalBERT) in medical entity extraction tasks. Experimental results show that PubMedBERT achieves the best performance (F1-score = 88.8%), indicating that a language model pre-trained on biomedical literature is more effective in the medical domain. In addition, we analyze the impact of different entity extraction methods (CRF, Span-based, Seq2Seq) and find that the Span-based approach performs best in medical entity extraction tasks (F1-score = 88.6%). It demonstrates superior accuracy in identifying entity boundaries. In low-resource scenarios, we further explore the application of Few-shot Learning in medical entity extraction. Experimental results show that even with only 10-shot training samples, the model achieves an F1-score of 79.1%, verifying the effectiveness of Few-shot Learning under limited data conditions. This study confirms that the combination of pre-trained language models and Few-shot Learning can enhance the accuracy of medical entity extraction. Future research can integrate knowledge graphs and active learning strategies to improve the model's generalization and stability, providing a more effective solution for medical NLP research. Keywords- Natural Language Processing, medical named entity recognition, pre-trained language model, Few-shot Learning, information extraction, deep learning

NCL-CIR: Noise-aware Contrastive Learning for Composed Image Retrieval

Peng Gao,Yujian Lee,Zailong Chen,Hui zhang,Xubo Liu,Yiyang Hu,Guquang Jing

Task: 提出一种噪声感知的对比学习方法（NCL-CIR）来解决组合图像检索（CIR）中查询对与目标图像不匹配的问题。

Motivation: 现有CIR方法假设查询对与目标图像完美对齐，而现实中常因文本不准确、图像质量低或标注错误导致不匹配，产生噪声对，影响模型性能。

Details

Method: 提出NCL-CIR，包含权重补偿块（WCB）和噪声对过滤块（NFB），结合高斯混合模型（GMM）预测噪声对并设计软标签的噪声对比估计损失函数。 Result: 实验表明NCL-CIR在基准数据集上表现优异，有效减少不匹配样本的影响。 Conclusion: NCL-CIR通过噪声感知机制显著提升了CIR任务的性能，解决了噪声对问题。 Abstract: Composed Image Retrieval (CIR) seeks to find a target image using a multi-modal query, which combines an image with modification text to pinpoint the target. While recent CIR methods have shown promise, they mainly focus on exploring relationships between the query pairs (image and text) through data augmentation or model design. These methods often assume perfect alignment between queries and target images, an idealized scenario rarely encountered in practice. In reality, pairs are often partially or completely mismatched due to issues like inaccurate modification texts, low-quality target images, and annotation errors. Ignoring these mismatches leads to numerous False Positive Pair (FFPs) denoted as noise pairs in the dataset, causing the model to overfit and ultimately reducing its performance. To address this problem, we propose the Noise-aware Contrastive Learning for CIR (NCL-CIR), comprising two key components: the Weight Compensation Block (WCB) and the Noise-pair Filter Block (NFB). The WCB coupled with diverse weight maps can ensure more stable token representations of multi-modal queries and target images. Meanwhile, the NFB, in conjunction with the Gaussian Mixture Model (GMM) predicts noise pairs by evaluating loss distributions, and generates soft labels correspondingly, allowing for the design of the soft-label based Noise Contrastive Estimation (NCE) loss function. Consequently, the overall architecture helps to mitigate the influence of mismatched and partially matched samples, with experimental results demonstrating that NCL-CIR achieves exceptional performance on the benchmark datasets.

On the Spatial Structure of Mixture-of-Experts in Transformers

Daniel Bershatsky,Ivan Oseledets

Task: 研究MoE路由器在专家选择中是否主要依赖语义特征。

Motivation: 挑战MoE路由器仅依赖语义特征的常见假设，探索位置标记信息在路由决策中的作用。

Details

Method: 通过广泛的实证分析验证假设，并提出现象学解释。 Result: 发现位置标记信息在路由决策中起关键作用。 Conclusion: 研究结果对基于MoE的架构具有实际意义。 Abstract: A common assumption is that MoE routers primarily leverage semantic features for expert selection. However, our study challenges this notion by demonstrating that positional token information also plays a crucial role in routing decisions. Through extensive empirical analysis, we provide evidence supporting this hypothesis, develop a phenomenological explanation of the observed behavior, and discuss practical implications for MoE-based architectures.

AnomalyHybrid: A Domain-agnostic Generative Framework for General Anomaly Detection

Ying Zhao

Task: 提出一种领域无关的异常生成框架AnomalyHybrid，通过结合参考图像和目标图像生成真实且多样化的异常。

Motivation: 现有方法多专注于工业异常生成，难以推广到其他应用领域，因此需要一种通用的异常生成方法。

Details

Method: 基于GAN的框架，包含两个解码器，分别将参考图像的外观整合到目标图像的深度和边缘结构中。 Result: 在多个数据集上表现优异，生成质量高，且在下游任务（分类、检测、分割）中超越现有方法。 Conclusion: AnomalyHybrid是一种高效、通用的异常生成方法，无需标注即可训练，适用于多种应用场景。 Abstract: Anomaly generation is an effective way to mitigate data scarcity for anomaly detection task. Most existing works shine at industrial anomaly generation with multiple specialists or large generative models, rarely generalizing to anomalies in other applications. In this paper, we present AnomalyHybrid, a domain-agnostic framework designed to generate authentic and diverse anomalies simply by combining the reference and target images. AnomalyHybrid is a Generative Adversarial Network(GAN)-based framework having two decoders that integrate the appearance of reference image into the depth and edge structures of target image respectively. With the help of depth decoders, AnomalyHybrid achieves authentic generation especially for the anomalies with depth values changing, such a s protrusion and dent. More, it relaxes the fine granularity structural control of the edge decoder and brings more diversity. Without using annotations, AnomalyHybrid is easily trained with sets of color, depth and edge of same images having different augmentations. Extensive experiments carried on HeliconiusButterfly, MVTecAD and MVTec3D datasets demonstrate that AnomalyHybrid surpasses the GAN-based state-of-the-art on anomaly generation and its downstream anomaly classification, detection and segmentation tasks. On MVTecAD dataset, AnomalyHybrid achieves 2.06/0.32 IS/LPIPS for anomaly generation, 52.6 Acc for anomaly classification with ResNet34, 97.3/72.9 AP for image/pixel-level anomaly detection with a simple UNet.

An overview of model uncertainty and variability in LLM-based sentiment analysis. Challenges, mitigation strategies and the role of explainability

David Herrera-Poyatos,Carlos Peláez-González,Cristina Zuheros,Andrés Herrera-Poyatos,Virilo Tejedor,Francisco Herrera,Rosana Montes

Task: 系统性地探索大型语言模型（LLMs）在情感分析中的模型变异性问题（MVP）。

Motivation: LLMs在情感分析中表现出显著的不确定性和变异性，这对实现可靠和一致的结果提出了关键挑战。

Details

Method: 分析MVP的核心原因，提供示例和案例研究，并探讨关键挑战和缓解策略，特别是温度参数的作用和可解释性的重要性。 Result: 通过结构化视角，提高了情感分析模型的稳定性、可重复性和可信度。 Conclusion: 研究有助于开发更可靠、可解释和稳健的情感分析模型，推动其在高风险领域的应用。 Abstract: Large Language Models (LLMs) have significantly advanced sentiment analysis, yet their inherent uncertainty and variability pose critical challenges to achieving reliable and consistent outcomes. This paper systematically explores the Model Variability Problem (MVP) in LLM-based sentiment analysis, characterized by inconsistent sentiment classification, polarization, and uncertainty arising from stochastic inference mechanisms, prompt sensitivity, and biases in training data. We analyze the core causes of MVP, presenting illustrative examples and a case study to highlight its impact. In addition, we investigate key challenges and mitigation strategies, paying particular attention to the role of temperature as a driver of output randomness and emphasizing the crucial role of explainability in improving transparency and user trust. By providing a structured perspective on stability, reproducibility, and trustworthiness, this study helps develop more reliable, explainable, and robust sentiment analysis models, facilitating their deployment in high-stakes domains such as finance, healthcare, and policymaking, among others.

OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning

Shihao Wang,Zhiding Yu,Xiaohui Jiang,Shiyi Lan,Min Shi,Nadine Chang,Jan Kautz,Ying Li,Jose M. Alvarez

Task: 提出OmniDrive数据集和框架，通过反事实推理将视觉语言模型的能力从2D扩展到3D驾驶任务。

Motivation: 现有的视觉语言模型在2D领域表现优异，但在3D驾驶任务中的应用仍需提升，以实现更全面的理解和决策。

Details

Method: 提出OmniDrive数据集，采用反事实推理生成高质量数据，并设计Omni-L和Omni-Q两种框架评估视觉语言对齐与3D感知的重要性。 Result: 在DriveLM Q&A基准和nuScenes开环规划任务中取得显著改进。 Conclusion: OmniDrive数据集和方法有效提升了3D驾驶任务中的视觉语言模型性能，为LLM-agent设计提供了关键见解。 Abstract: The advances in vision-language models (VLMs) have led to a growing interest in autonomous driving to leverage their strong reasoning capabilities. However, extending these capabilities from 2D to full 3D understanding is crucial for real-world applications. To address this challenge, we propose OmniDrive, a holistic vision-language dataset that aligns agent models with 3D driving tasks through counterfactual reasoning. This approach enhances decision-making by evaluating potential scenarios and their outcomes, similar to human drivers considering alternative actions. Our counterfactual-based synthetic data annotation process generates large-scale, high-quality datasets, providing denser supervision signals that bridge planning trajectories and language-based reasoning. Futher, we explore two advanced OmniDrive-Agent frameworks, namely Omni-L and Omni-Q, to assess the importance of vision-language alignment versus 3D perception, revealing critical insights into designing effective LLM-agents. Significant improvements on the DriveLM Q\&A benchmark and nuScenes open-loop planning demonstrate the effectiveness of our dataset and methods.

Directed Graph-alignment Approach for Identification of Gaps in Short Answers

Archana Sahu,Plaban Kumar Bhowmick

Task: 自动识别学生答案与参考答案之间的缺失项（即“gap”），用于形成性评估。

Motivation: 通过识别学生答案中的缺失项，为学生提供反馈，帮助改进学习效果。

Details

Method: 将学生答案与参考答案建模为一对有向图，通过图对齐方法识别缺失项（单词、短语或句子级别）。 Result: 在不同数据集（UNT、SciEntsBank、Beetle）上表现良好，性能因数据集和答案类型而异。 Conclusion: 提出的方法在识别缺失项方面表现有潜力，可用于学生反馈。 Abstract: In this paper, we have presented a method for identifying missing items known as gaps in the student answers by comparing them against the corresponding model answer/reference answers, automatically. The gaps can be identified at word, phrase or sentence level. The identified gaps are useful in providing feedback to the students for formative assessment. The problem of gap identification has been modelled as an alignment of a pair of directed graphs representing a student answer and the corresponding model answer for a given question. To validate the proposed approach, the gap annotated student answers considering answers from three widely known datasets in the short answer grading domain, namely, University of North Texas (UNT), SciEntsBank, and Beetle have been developed and this gap annotated student answers' dataset is available at: https://github.com/sahuarchana7/gaps-answers-dataset. Evaluation metrics used in the traditional machine learning tasks have been adopted to evaluate the task of gap identification. Though performance of the proposed approach varies across the datasets and the types of the answers, overall the performance is observed to be promising.

UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding

Yang Jiao,Haibo Qiu,Zequn Jie,Shaoxiang Chen,Jingjing Chen,Lin Ma,Yu-Gang Jiang

Task: 提出UniToken，一种结合离散和连续表示的自回归生成模型，用于统一视觉理解和图像生成任务。

Motivation: 现有方法依赖单一视觉表示，无法同时捕捉高层语义和低层细节，限制了多任务性能。

Details

Method: 通过统一的视觉编码框架，结合离散和连续表示，支持多维信息提取和选择性知识融合。 Result: 在多个基准测试中达到最先进性能，验证了模型的统一性和有效性。 Conclusion: UniToken为视觉理解和图像生成任务提供了统一且强大的基础，代码和模型已开源。 Abstract: We introduce UniToken, an auto-regressive generation model that encodes visual inputs through a combination of discrete and continuous representations, enabling seamless integration of unified visual understanding and image generation tasks. Unlike previous approaches that rely on unilateral visual representations, our unified visual encoding framework captures both high-level semantics and low-level details, delivering multidimensional information that empowers heterogeneous tasks to selectively assimilate domain-specific knowledge based on their inherent characteristics. Through in-depth experiments, we uncover key principles for developing a unified model capable of both visual understanding and image generation. Extensive evaluations across a diverse range of prominent benchmarks demonstrate that UniToken achieves state-of-the-art performance, surpassing existing approaches. These results establish UniToken as a robust foundation for future research in this domain. The code and models are available at https://github.com/SxJyJay/UniToken.

Saliency-driven Dynamic Token Pruning for Large Language Models

Yao Tao,Yehui Tang,Yun Wang,Mingjian Zhu,Hailin Hu,Yunhe Wang

Task: 提出一种基于显著性驱动的动态令牌修剪框架（SDTP），以减少大型语言模型（LLMs）在长序列推理中的计算复杂度。

Motivation: 由于注意力机制的二次计算复杂度，LLMs在长序列推理中效率低下，且并非所有令牌对推理贡献相同。

Details

Method: 设计了一个轻量级的显著性驱动预测模块，通过隐藏状态估计令牌重要性分数，并在不同层级动态修剪冗余令牌，同时提出基于排名的优化策略。 Result: 实验表明，SDTP能修剪65%的输入令牌，减少33%~47%的FLOPs，推理速度提升1.75倍，性能接近原模型。 Conclusion: SDTP是一种高效且通用的令牌修剪方法，可与其他压缩技术结合进一步优化推理效率。 Abstract: Despite the recent success of large language models (LLMs), LLMs are particularly challenging in long-sequence inference scenarios due to the quadratic computational complexity of the attention mechanism. Inspired by the interpretability theory of feature attribution in neural network models, we observe that not all tokens have the same contribution. Based on this observation, we propose a novel token pruning framework, namely Saliency-driven Dynamic Token Pruning (SDTP), to gradually and dynamically prune redundant tokens based on the input context. Specifically, a lightweight saliency-driven prediction module is designed to estimate the importance score of each token with its hidden state, which is added to different layers of the LLM to hierarchically prune redundant tokens. Furthermore, a ranking-based optimization strategy is proposed to minimize the ranking divergence of the saliency score and the predicted importance score. Extensive experiments have shown that our framework is generalizable to various models and datasets. By hierarchically pruning 65\% of the input tokens, our method greatly reduces 33\% $\sim$ 47\% FLOPs and achieves speedup up to 1.75$\times$ during inference, while maintaining comparable performance. We further demonstrate that SDTP can be combined with KV cache compression method for further compression.

FluentLip: A Phonemes-Based Two-stage Approach for Audio-Driven Lip Synthesis with Optical Flow Consistency

Shiyan Liu,Rui Qu,Yan Jin

Task: 提出一种名为FluentLip的两阶段方法，用于音频驱动的唇部合成，以解决唇部同步、清晰度和视频流畅性的挑战。

Motivation: 尽管先前研究在同步和视觉质量上取得进展，但唇部清晰度和视频流畅性仍是持续存在的问题。

Details

Method: 结合音素提取器和编码器进行多模态学习，使用光流一致性损失确保帧间自然过渡，并在GAN训练中引入扩散链以提高稳定性和效率。 Result: 实验表明FluentLip在平滑性和自然性上显著优于五种SOTA方法，FID和PER分别提升16.3%和35.2%。 Conclusion: FluentLip在音频驱动的唇部合成中表现出色，显著提升了同步性、清晰度和流畅性。 Abstract: Generating consecutive images of lip movements that align with a given speech in audio-driven lip synthesis is a challenging task. While previous studies have made strides in synchronization and visual quality, lip intelligibility and video fluency remain persistent challenges. This work proposes FluentLip, a two-stage approach for audio-driven lip synthesis, incorporating three featured strategies. To improve lip synchronization and intelligibility, we integrate a phoneme extractor and encoder to generate a fusion of audio and phoneme information for multimodal learning. Additionally, we employ optical flow consistency loss to ensure natural transitions between image frames. Furthermore, we incorporate a diffusion chain during the training of Generative Adversarial Networks (GANs) to improve both stability and efficiency. We evaluate our proposed FluentLip through extensive experiments, comparing it with five state-of-the-art (SOTA) approaches across five metrics, including a proposed metric called Phoneme Error Rate (PER) that evaluates lip pose intelligibility and video fluency. The experimental results demonstrate that our FluentLip approach is highly competitive, achieving significant improvements in smoothness and naturalness. In particular, it outperforms these SOTA approaches by approximately $\textbf{16.3%}$ in Fr\'echet Inception Distance (FID) and $\textbf{35.2%}$ in PER.

An Empirical Comparison of Text Summarization: A Multi-Dimensional Evaluation of Large Language Models

Anantharaman Janakiraman,Behnaz Ghoraani

Task: 评估17种大型语言模型在文本摘要任务中的性能。

Motivation: 解决信息过载问题，并为不同领域（如新闻、医学、商业）提供高效的摘要工具。

Details

Method: 使用多维度框架评估模型在7个数据集上的表现，涵盖不同输出长度和多种指标（事实一致性、语义相似性、词汇重叠、人类感知质量及效率因素）。 Result: 发现模型性能差异显著，某些模型在特定方面表现突出（如事实准确性、人类感知质量、处理效率）。性能因数据集而异，技术领域表现较差，对话内容表现较好。 Conclusion: 研究为不同应用场景提供了基于证据的模型选择建议，并改进了评估方法，综合考虑了准确性、效率和成本效益。 Abstract: Text summarization is crucial for mitigating information overload across domains like journalism, medicine, and business. This research evaluates summarization performance across 17 large language models (OpenAI, Google, Anthropic, open-source) using a novel multi-dimensional framework. We assessed models on seven diverse datasets (BigPatent, BillSum, CNN/DailyMail, PubMed, SAMSum, WikiHow, XSum) at three output lengths (50, 100, 150 tokens) using metrics for factual consistency, semantic similarity, lexical overlap, and human-like quality, while also considering efficiency factors. Our findings reveal significant performance differences, with specific models excelling in factual accuracy (deepseek-v3), human-like quality (claude-3-5-sonnet), and processing efficiency/cost-effectiveness (gemini-1.5-flash, gemini-2.0-flash). Performance varies dramatically by dataset, with models struggling on technical domains but performing well on conversational content. We identified a critical tension between factual consistency (best at 50 tokens) and perceived quality (best at 150 tokens). Our analysis provides evidence-based recommendations for different use cases, from high-stakes applications requiring factual accuracy to resource-constrained environments needing efficient processing. This comprehensive approach enhances evaluation methodology by integrating quality metrics with operational considerations, incorporating trade-offs between accuracy, efficiency, and cost-effectiveness to guide model selection for specific applications.

Evaluation framework for Image Segmentation Algorithms

Tatiana Merkulova,Bharani Jayakumar

Task: 提出一个全面的图像分割算法评估框架，涵盖传统方法、机器学习方法和深度学习技术。

Motivation: 探讨图像分割的基本概念及其重要性，以及交互式分割在提高准确性中的作用。

Details

Method: 详细介绍了多种分割方法，包括阈值法、边缘检测、区域生长、特征提取、随机森林、支持向量机、卷积神经网络、U-Net和Mask R-CNN，并描述了实验设置和三种主要方法：算法辅助用户、用户辅助算法和混合方法。 Result: 通过交并比（IoU）、计算时间和用户交互时间等评估指标进行性能比较，分析了每种方法的优缺点和权衡。 Conclusion: 总结了这些方法在不同场景中的实际适用性，并提出了未来工作的方向，包括扩展数据集、开发更具代表性的方法、集成实时反馈以及探索弱监督和自监督学习范式。 Abstract: This paper presents a comprehensive evaluation framework for image segmentation algorithms, encompassing naive methods, machine learning approaches, and deep learning techniques. We begin by introducing the fundamental concepts and importance of image segmentation, and the role of interactive segmentation in enhancing accuracy. A detailed background theory section explores various segmentation methods, including thresholding, edge detection, region growing, feature extraction, random forests, support vector machines, convolutional neural networks, U-Net, and Mask R-CNN. The implementation and experimental setup are thoroughly described, highlighting three primary approaches: algorithm assisting user, user assisting algorithm, and hybrid methods. Evaluation metrics such as Intersection over Union (IoU), computation time, and user interaction time are employed to measure performance. A comparative analysis presents detailed results, emphasizing the strengths, limitations, and trade-offs of each method. The paper concludes with insights into the practical applicability of these approaches across various scenarios and outlines future work, focusing on expanding datasets, developing more representative approaches, integrating real-time feedback, and exploring weakly supervised and self-supervised learning paradigms to enhance segmentation accuracy and efficiency. Keywords: Image Segmentation, Interactive Segmentation, Machine Learning, Deep Learning, Computer Vision

KnowsLM: A framework for evaluation of small language models for knowledge augmentation and humanised conversations

Chitranshu Harbola,Anupam Purwar

Task: 研究LoRA秩、数据集规模和提示前缀设计对知识保留和风格对齐的影响。

Motivation: 在对话AI中，如何利用中小型语言模型生成简洁、上下文感知且类人的对话是一个复杂挑战。

Details

Method: 结合微调（fine-tuning）和RAG（检索增强生成）方法，评估知识准确性、对话质量和简洁性。 Result: 微调在风格一致性上表现更好，而RAG在事实准确性上更优。 Conclusion: 微调适用于风格适应，RAG更适合实时知识增强。 Abstract: In the evolving landscape of conversational AI, generating concise, context-aware, and human-like dialogue using small and medium-sized language models (LLMs) remains a complex challenge. This study investigates the influence of LoRA rank, dataset scale, and prompt prefix design on both knowledge retention and stylistic alignment. While fine-tuning improves fluency and enables stylistic customization, its ability to integrate unseen knowledge is constrained -- particularly with smaller datasets. Conversely, RAG-augmented models, equipped to incorporate external documents at inference, demonstrated superior factual accuracy on out-of-distribution prompts, though they lacked the stylistic consistency achieved by fine-tuning. Evaluations by LLM-based judges across knowledge accuracy, conversational quality, and conciseness suggest that fine-tuning is best suited for tone adaptation, whereas RAG excels at real-time knowledge augmentation.

Thermoxels: a voxel-based method to generate simulation-ready 3D thermal models

Etienne Chassaing,Florent Forest,Olga Fink,Malcolm Mielle

Task: 提出一种名为Thermoxels的基于体素的方法，用于从稀疏的RGB和热图像生成与有限元分析（FEA）兼容的模型。

Motivation: 现有建筑评估方法依赖定性热成像或手动CAD设计，限制了数据驱动的节能决策，而Thermoxels旨在解决这一问题。

Details

Method: 使用RGB和热图像对作为输入，通过体素表示场景的几何和温度信息，并将其优化为FEA兼容的四面体网格。 Result: Thermoxels能够生成RGB+热网格，并在热传导模拟中实现收敛，其图像合成能力与现有方法相当。 Conclusion: Thermoxels为建筑能效评估提供了一种高效、精确的解决方案，并展示了其在实际应用中的潜力。 Abstract: In the European Union, buildings account for 42% of energy use and 35% of greenhouse gas emissions. Since most existing buildings will still be in use by 2050, retrofitting is crucial for emissions reduction. However, current building assessment methods rely mainly on qualitative thermal imaging, which limits data-driven decisions for energy savings. On the other hand, quantitative assessments using finite element analysis (FEA) offer precise insights but require manual CAD design, which is tedious and error-prone. Recent advances in 3D reconstruction, such as Neural Radiance Fields (NeRF) and Gaussian Splatting, enable precise 3D modeling from sparse images but lack clearly defined volumes and the interfaces between them needed for FEA. We propose Thermoxels, a novel voxel-based method able to generate FEA-compatible models, including both geometry and temperature, from a sparse set of RGB and thermal images. Using pairs of RGB and thermal images as input, Thermoxels represents a scene's geometry as a set of voxels comprising color and temperature information. After optimization, a simple process is used to transform Thermoxels' models into tetrahedral meshes compatible with FEA. We demonstrate Thermoxels' capability to generate RGB+Thermal meshes of 3D scenes, surpassing other state-of-the-art methods. To showcase the practical applications of Thermoxels' models, we conduct a simple heat conduction simulation using FEA, achieving convergence from an initial state defined by Thermoxels' thermal reconstruction. Additionally, we compare Thermoxels' image synthesis abilities with current state-of-the-art methods, showing competitive results, and discuss the limitations of existing metrics in assessing mesh quality.

DynClean: Training Dynamics-based Label Cleaning for Distantly-Supervised Named Entity Recognition

Qi Zhang,Huitong Pan,Zhijia Chen,Longin Jan Latecki,Cornelia Caragea,Eduard Dragut

Task: 提出一种基于训练动态的标签清洗方法，用于改进远程监督命名实体识别（DS-NER）的性能。

Motivation: 远程标注会引入大量错误标签，限制了DS-NER的性能，现有方法多集中于复杂模型设计，而标签清洗方法在NER领域研究较少。

Details

Method: 利用模型训练过程中的动态行为来表征远程标注样本，并引入自动阈值估计策略定位错误标签。 Result: 在清洗后的DS-NER数据集上训练的模型F1分数显著提升（3.18%至8.95%），且方法在四个数据集上优于多种先进DS-NER方法。 Conclusion: 标签清洗方法能有效提升DS-NER性能，为远程监督学习提供了一种新思路。 Abstract: Distantly Supervised Named Entity Recognition (DS-NER) has attracted attention due to its scalability and ability to automatically generate labeled data. However, distant annotation introduces many mislabeled instances, limiting its performance. Most of the existing work attempt to solve this problem by developing intricate models to learn from the noisy labels. An alternative approach is to attempt to clean the labeled data, thus increasing the quality of distant labels. This approach has received little attention for NER. In this paper, we propose a training dynamics-based label cleaning approach, which leverages the behavior of a model as training progresses to characterize the distantly annotated samples. We also introduce an automatic threshold estimation strategy to locate the errors in distant labels. Extensive experimental results demonstrate that: (1) models trained on our cleaned DS-NER datasets, which were refined by directly removing identified erroneous annotations, achieve significant improvements in F1-score, ranging from 3.18% to 8.95%; and (2) our method outperforms numerous advanced DS-NER approaches across four datasets.

PRISM: Probabilistic Representation for Integrated Shape Modeling and Generation

Lei Cheng,Mahdi Saleh,Qing Cheng,Lu Sang,Hongli Xu,Daniel Cremers,Federico Tombari

Task: 提出一种名为PRISM的组合方法，用于解决3D全形状生成中复杂几何和语义建模的挑战。

Motivation: 当前方法难以有效整合3D形状的上下文和结构信息，导致生成形状的质量和可控性不足。

Details

Method: 结合分类扩散模型、统计形状模型（SSM）和高斯混合模型（GMM），通过组合SSM捕捉部件级几何变化，使用GMM表示部件语义。 Result: 实验表明，PRISM在形状生成和操作任务中显著优于现有方法，生成形状具有高保真度和多样性，同时保持结构一致性。 Conclusion: PRISM通过组合方法提升了3D形状生成的性能和可控性，代码将公开。 Abstract: Despite the advancements in 3D full-shape generation, accurately modeling complex geometries and semantics of shape parts remains a significant challenge, particularly for shapes with varying numbers of parts. Current methods struggle to effectively integrate the contextual and structural information of 3D shapes into their generative processes. We address these limitations with PRISM, a novel compositional approach for 3D shape generation that integrates categorical diffusion models with Statistical Shape Models (SSM) and Gaussian Mixture Models (GMM). Our method employs compositional SSMs to capture part-level geometric variations and uses GMM to represent part semantics in a continuous space. This integration enables both high fidelity and diversity in generated shapes while preserving structural coherence. Through extensive experiments on shape generation and manipulation tasks, we demonstrate that our approach significantly outperforms previous methods in both quality and controllability of part-level operations. Our code will be made publicly available.

Steering off Course: Reliability Challenges in Steering Language Models

Patrick Queiroz Da Silva,Hari Sethuraman,Dheeraj Rajagopal,Hannaneh Hajishirzi,Sachin Kumar

Task: 系统评估三种语言模型引导方法（DoLa、函数向量和任务向量）在多种模型上的鲁棒性。

Motivation: 现有研究仅针对少数模型评估引导方法，缺乏对方法鲁棒性的全面理解。

Details

Method: 在14个家族的36个模型（参数规模从1.5B到70B）上测试三种引导方法。 Result: 引导方法的效果存在显著差异，许多模型未显示改进甚至性能下降。 Conclusion: 这些方法的基本假设存在缺陷，挑战了其作为可扩展引导解决方案的可靠性。 Abstract: Steering methods for language models (LMs) have gained traction as lightweight alternatives to fine-tuning, enabling targeted modifications to model activations. However, prior studies primarily report results on a few models, leaving critical gaps in understanding the robustness of these methods. In this work, we systematically examine three prominent steering methods -- DoLa, function vectors, and task vectors. In contrast to the original studies, which evaluated a handful of models, we test up to 36 models belonging to 14 families with sizes ranging from 1.5B to 70B parameters. Our experiments reveal substantial variability in the effectiveness of the steering approaches, with a large number of models showing no improvement and at times degradation in steering performance. Our analysis demonstrate fundamental flaws in the assumptions underlying these methods, challenging their reliability as scalable steering solutions.

VSLAM-LAB: A Comprehensive Framework for Visual SLAM Methods and Datasets

Alejandro Fontan,Tobias Fischer,Javier Civera,Michael Milford

Task: 提出一个统一的框架VSLAM-LAB，以简化VSLAM系统的开发、评估和部署。

Motivation: 解决VSLAM研究中工具链分散、系统配置复杂和评估方法不一致的问题。

Details

Method: 设计一个支持VSLAM算法编译与配置、数据集自动下载与预处理、标准化实验设计与评估的统一框架。 Result: VSLAM-LAB通过简化工作流程和提供标准化工具，提高了研究效率和可重复性。 Conclusion: VSLAM-LAB为研究人员提供了一个高效、兼容性强的工具，推动了VSLAM技术的进步。 Abstract: Visual Simultaneous Localization and Mapping (VSLAM) research faces significant challenges due to fragmented toolchains, complex system configurations, and inconsistent evaluation methodologies. To address these issues, we present VSLAM-LAB, a unified framework designed to streamline the development, evaluation, and deployment of VSLAM systems. VSLAM-LAB simplifies the entire workflow by enabling seamless compilation and configuration of VSLAM algorithms, automated dataset downloading and preprocessing, and standardized experiment design, execution, and evaluation--all accessible through a single command-line interface. The framework supports a wide range of VSLAM systems and datasets, offering broad compatibility and extendability while promoting reproducibility through consistent evaluation metrics and analysis tools. By reducing implementation complexity and minimizing configuration overhead, VSLAM-LAB empowers researchers to focus on advancing VSLAM methodologies and accelerates progress toward scalable, real-world solutions. We demonstrate the ease with which user-relevant benchmarks can be created: here, we introduce difficulty-level-based categories, but one could envision environment-specific or condition-specific categories.

Eylon Caplan,Tania Chakraborty,Dan Goldwasser

Task: 定义并解决一个新任务“Group Theorization”，即系统需撰写能够区分不同人口统计学群体表达的理论。

Motivation: 理解不同人口群体的表达方式对社会科学和评估大型语言模型（LLM）中的偏见至关重要，但现有方法难以生成可推广的理论。

Details

Method: 构建了一个名为Splits!的大规模数据集，通过中性主题和人口统计学特征分割Reddit帖子，并提出简单的评估框架。 Result: 公开发布了Splits!数据集和评估脚本，帮助研究者评估方法如何推断和可能误读群体表达差异。 Conclusion: 通过Group Theorization任务和Splits!数据集，为研究群体表达差异提供了新工具和方法。 Abstract: Understanding how people of various demographics think, feel, and express themselves (collectively called group expression) is essential for social science and underlies the assessment of bias in Large Language Models (LLMs). While LLMs can effectively summarize group expression when provided with empirical examples, coming up with generalizable theories of how a group's expression manifests in real-world text is challenging. In this paper, we define a new task called Group Theorization, in which a system must write theories that differentiate expression across demographic groups. We make available a large dataset on this task, Splits!, constructed by splitting Reddit posts by neutral topics (e.g. sports, cooking, and movies) and by demographics (e.g. occupation, religion, and race). Finally, we suggest a simple evaluation framework for assessing how effectively a method can generate 'better' theories about group expression, backed by human validation. We publicly release the raw corpora and evaluation scripts for Splits! to help researchers assess how methods infer--and potentially misrepresent--group differences in expression. We make Splits! and our evaluation module available at https://github.com/eyloncaplan/splits.

Spatial-Geometry Enhanced 3D Dynamic Snake Convolutional Neural Network for Hyperspectral Image Classification

Guandong Li,Mengxia Ye

Task: 提出一种基于改进3D-DenseNet模型的空间几何增强3D动态蛇形网络（SG-DSCNet），用于高光谱图像分类。

Motivation: 解决深度神经网络在高光谱图像分类中面临的复杂稀疏地物分布、小簇结构和多分支特征导致的漏检问题。

Details

Method: 采用动态蛇形卷积（DSCConv）增强核灵活性，提出多视角特征融合策略，通过动态核聚合提升模型表示能力。 Result: 在IN、UP和KSC数据集上表现优于主流高光谱分类方法。 Conclusion: SG-DSCNet通过动态特征响应和多视角融合，显著提升了高光谱图像分类性能。 Abstract: Deep neural networks face several challenges in hyperspectral image classification, including complex and sparse ground object distributions, small clustered structures, and elongated multi-branch features that often lead to missing detections. To better adapt to ground object distributions and achieve adaptive dynamic feature responses while skipping redundant information, this paper proposes a Spatial-Geometry Enhanced 3D Dynamic Snake Network (SG-DSCNet) based on an improved 3D-DenseNet model. The network employs Dynamic Snake Convolution (DSCConv), which introduces deformable offsets to enhance kernel flexibility through constrained self-learning, thereby improving regional perception of ground objects. Additionally, we propose a multi-view feature fusion strategy that generates multiple morphological kernel templates from DSCConv to observe target structures from different perspectives and achieve efficient feature fusion through summarizing key characteristics. This dynamic approach enables the model to focus more flexibly on critical spatial structures when processing different regions, rather than relying on fixed receptive fields of single static kernels. The DSC module enhances model representation capability through dynamic kernel aggregation without increasing network depth or width. Experimental results demonstrate superior performance on the IN, UP, and KSC datasets, outperforming mainstream hyperspectral classification methods.

scAgent: Universal Single-Cell Annotation via a LLM Agent

Yuren Mao,Yu Mi,Peigen Liu,Mengfei Zhang,Hanqing Liu,Yunjun Gao

Task: 提出一种基于大语言模型的通用细胞注释框架scAgent，用于识别和发现多种组织中的细胞类型。

Motivation: 现有的细胞类型注释方法局限于特定组织中的固定细胞类型，而通用细胞注释方法在跨组织、发现新细胞类型和扩展新细胞类型方面研究较少。

Details

Method: 基于单细胞RNA-seq数据和大语言模型（LLMs）构建scAgent框架。 Result: 在160种细胞类型和35种组织中的实验表明，scAgent在通用细胞类型注释、新细胞发现和扩展新细胞类型方面表现优异。 Conclusion: scAgent是一种高效且通用的细胞注释框架，能够跨组织识别和发现细胞类型，并具有数据高效性。 Abstract: Cell type annotation is critical for understanding cellular heterogeneity. Based on single-cell RNA-seq data and deep learning models, good progress has been made in annotating a fixed number of cell types within a specific tissue. However, universal cell annotation, which can generalize across tissues, discover novel cell types, and extend to novel cell types, remains less explored. To fill this gap, this paper proposes scAgent, a universal cell annotation framework based on Large Language Models (LLMs). scAgent can identify cell types and discover novel cell types in diverse tissues; furthermore, it is data efficient to learn novel cell types. Experimental studies in 160 cell types and 35 tissues demonstrate the superior performance of scAgent in general cell-type annotation, novel cell discovery, and extensibility to novel cell type.

Domain Generalization for Face Anti-spoofing via Content-aware Composite Prompt Engineering

Jiabao Guo,Ajian Liu,Yunfeng Diao,Jin Zhang,Hui Ma,Bo Zhao,Richang Hong,Meng Wang

Task: 提出一种名为内容感知复合提示工程（CCPE）的新方法，用于解决领域泛化（DG）在人脸反欺骗（FAS）中的挑战。

Motivation: 现有的基于CLIP的算法在DG FAS中存在两个问题：类别语义缺失和单一提示形式无法描述多种欺骗类型。

Details

Method: 通过实例级别的复合提示（包括固定模板和可学习提示）和跨模态引导模块（CGM）动态调整特征融合。 Result: 在多个跨域实验中验证了CCPE的有效性，并取得了最先进的（SOTA）结果。 Conclusion: CCPE通过内容感知提示和跨模态引导，显著提升了DG FAS的性能。 Abstract: The challenge of Domain Generalization (DG) in Face Anti-Spoofing (FAS) is the significant interference of domain-specific signals on subtle spoofing clues. Recently, some CLIP-based algorithms have been developed to alleviate this interference by adjusting the weights of visual classifiers. However, our analysis of this class-wise prompt engineering suffers from two shortcomings for DG FAS: (1) The categories of facial categories, such as real or spoof, have no semantics for the CLIP model, making it difficult to learn accurate category descriptions. (2) A single form of prompt cannot portray the various types of spoofing. In this work, instead of class-wise prompts, we propose a novel Content-aware Composite Prompt Engineering (CCPE) that generates instance-wise composite prompts, including both fixed template and learnable prompts. Specifically, our CCPE constructs content-aware prompts from two branches: (1) Inherent content prompt explicitly benefits from abundant transferred knowledge from the instruction-based Large Language Model (LLM). (2) Learnable content prompts implicitly extract the most informative visual content via Q-Former. Moreover, we design a Cross-Modal Guidance Module (CGM) that dynamically adjusts unimodal features for fusion to achieve better generalized FAS. Finally, our CCPE has been validated for its effectiveness in multiple cross-domain experiments and achieves state-of-the-art (SOTA) results.

Causal Retrieval with Semantic Consideration

Hyunseo Shin,Wonseok Hwang

Task: 提出一种结合语义和因果关系的检索模型CAWAI，以提升知识密集型领域中对话AI系统的性能。

Motivation: 现有检索模型主要关注表面语义相似性，忽略了更深层次的因果关系，而这对知识密集型领域（如生物医学和法律）的准确性至关重要。

Details

Method: 提出CAWAI模型，通过双目标训练（语义和因果关系）来改进检索效果。 Result: CAWAI在多种因果检索任务中表现优异，尤其在大规模检索场景下，并展现出强大的零样本泛化能力。 Conclusion: CAWAI通过结合语义和因果关系，显著提升了知识密集型领域中的检索性能。 Abstract: Recent advancements in large language models (LLMs) have significantly enhanced the performance of conversational AI systems. To extend their capabilities to knowledge-intensive domains such as biomedical and legal fields, where the accuracy is critical, LLMs are often combined with information retrieval (IR) systems to generate responses based on retrieved documents. However, for IR systems to effectively support such applications, they must go beyond simple semantic matching and accurately capture diverse query intents, including causal relationships. Existing IR models primarily focus on retrieving documents based on surface-level semantic similarity, overlooking deeper relational structures such as causality. To address this, we propose CAWAI, a retrieval model that is trained with dual objectives: semantic and causal relations. Our extensive experiments demonstrate that CAWAI outperforms various models on diverse causal retrieval tasks especially under large-scale retrieval settings. We also show that CAWAI exhibits strong zero-shot generalization across scientific domain QA tasks.

VideoAgent2: Enhancing the LLM-Based Agent System for Long-Form Video Understanding by Uncertainty-Aware CoT

Zhuo Zhi,Qiangqiang Wu,Minghe shen,Wenbo Li,Yinchuan Li,Kun Shao,Kaiwen Zhou

Task: 提出一种专为长视频分析设计的链式思维（CoT）过程，以解决现有方法在长视频理解中的挑战。

Motivation: 现有方法依赖大型语言模型（LLM）的推理能力，缺乏针对长视频场景的专用机制，且易受外部工具错误或噪声影响。

Details

Method: 提出带有计划调整模式的CoT过程，结合启发式不确定性估计，指导LLM逐步规划和调整信息收集策略。 Result: 实验表明，不确定性感知的CoT有效减少了外部工具的噪声，VideoAgent2在三个长视频基准测试中平均优于之前最先进方法13.1%。 Conclusion: VideoAgent2通过专用CoT和不确定性估计，显著提升了长视频理解的可靠性和性能。 Abstract: Long video understanding has emerged as an increasingly important yet challenging task in computer vision. Agent-based approaches are gaining popularity for processing long videos, as they can handle extended sequences and integrate various tools to capture fine-grained information. However, existing methods still face several challenges: (1) they often rely solely on the reasoning ability of large language models (LLMs) without dedicated mechanisms to enhance reasoning in long video scenarios; and (2) they remain vulnerable to errors or noise from external tools. To address these issues, we propose a specialized chain-of-thought (CoT) process tailored for long video analysis. Our proposed CoT with plan-adjust mode enables the LLM to incrementally plan and adapt its information-gathering strategy. We further incorporate heuristic uncertainty estimation of both the LLM and external tools to guide the CoT process. This allows the LLM to assess the reliability of newly collected information, refine its collection strategy, and make more robust decisions when synthesizing final answers. Empirical experiments show that our uncertainty-aware CoT effectively mitigates noise from external tools, leading to more reliable outputs. We implement our approach in a system called VideoAgent2, which also includes additional modules such as general context acquisition and specialized tool design. Evaluation on three dedicated long video benchmarks (and their subsets) demonstrates that VideoAgent2 outperforms the previous state-of-the-art agent-based method, VideoAgent, by an average of 13.1% and achieves leading performance among all zero-shot approaches

Sequential-NIAH: A Needle-In-A-Haystack Benchmark for Extracting Sequential Needles from Long Contexts

Yifei Yu,Qian-Wen Zhang,Lingfeng Qiao,Di Yin,Fang Li,Jie Wang,Zengxi Chen,Suncong Zheng,Xiaolong Liang,Xing Sun

Task: 评估大型语言模型（LLMs）处理长文本上下文的能力，特别是从冗长输入中提取与特定查询相关的信息。

Motivation: 需要评估LLMs在长文本中提取顺序信息的能力，以推动相关研究的进展。

Details

Method: 引入Sequential-NIAH基准测试，包含三种针生成管道（合成、真实、开放域QA），并训练了一个合成数据驱动的评估模型。 Result: 在六种知名LLMs上实验，最佳模型准确率仅为63.15%，表明长上下文和更多针的挑战。 Conclusion: Sequential-NIAH是评估LLMs长文本提取能力的重要基准，验证了其可靠性并指出了改进空间。 Abstract: Evaluating the ability of large language models (LLMs) to handle extended contexts is critical, particularly for retrieving information relevant to specific queries embedded within lengthy inputs. We introduce Sequential-NIAH, a benchmark specifically designed to evaluate the capability of LLMs to extract sequential information items (known as needles) from long contexts. The benchmark comprises three types of needle generation pipelines: synthetic, real, and open-domain QA. It includes contexts ranging from 8K to 128K tokens in length, with a dataset of 14,000 samples (2,000 reserved for testing). To facilitate evaluation on this benchmark, we trained a synthetic data-driven evaluation model capable of evaluating answer correctness based on chronological or logical order, achieving an accuracy of 99.49% on synthetic test data. We conducted experiments on six well-known LLMs, revealing that even the best-performing model achieved a maximum accuracy of only 63.15%. Further analysis highlights the growing challenges posed by increasing context lengths and the number of needles, underscoring substantial room for improvement. Additionally, noise robustness experiments validate the reliability of the benchmark, making Sequential-NIAH an important reference for advancing research on long text extraction capabilities of LLMs.

Statistical Guarantees Of False Discovery Rate In Medical Instance Segmentation Tasks Based on Conformal Risk Control

Mengxia Dai,Wenqian Luo,Tianyang Li

Task: 提出一种基于共形预测理论的鲁棒质量控制框架，用于医学图像实例分割中的置信度校准问题。

Motivation: 解决深度学习模型（如Mask R-CNN和BlendMask）在高风险医学场景中因置信度校准问题导致的误诊风险。

Details

Method: 设计了一种基于用户定义风险水平α的校准感知损失函数，动态调整分割阈值，并通过可交换校准数据确保测试数据的FNR或FDR低于α。 Result: 框架在测试集上严格限制了FDR指标，且无需修改主流分割模型和数据集的结构。 Conclusion: 提出的框架有效解决了医学图像分割中的置信度校准问题，为高风险场景提供了可靠的质量控制。 Abstract: Instance segmentation plays a pivotal role in medical image analysis by enabling precise localization and delineation of lesions, tumors, and anatomical structures. Although deep learning models such as Mask R-CNN and BlendMask have achieved remarkable progress, their application in high-risk medical scenarios remains constrained by confidence calibration issues, which may lead to misdiagnosis. To address this challenge, we propose a robust quality control framework based on conformal prediction theory. This framework innovatively constructs a risk-aware dynamic threshold mechanism that adaptively adjusts segmentation decision boundaries according to clinical requirements.Specifically, we design a \textbf{calibration-aware loss function} that dynamically tunes the segmentation threshold based on a user-defined risk level $\alpha$. Utilizing exchangeable calibration data, this method ensures that the expected FNR or FDR on test data remains below $\alpha$ with high probability. The framework maintains compatibility with mainstream segmentation models (e.g., Mask R-CNN, BlendMask+ResNet-50-FPN) and datasets (PASCAL VOC format) without requiring architectural modifications. Empirical results demonstrate that we rigorously bound the FDR metric marginally over the test set via our developed calibration framework.

Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs

Will Cai,Tianneng Shi,Xuandong Zhao,Dawn Song

Task: Formalize and address the problem of detecting model substitution in Large Language Model (LLM) APIs.

Motivation: The lack of transparency in black-box LLM APIs undermines trust and fairness, as providers may substitute models with cheaper alternatives without disclosure.

Details

Method: Systematically evaluate existing verification techniques (output-based statistical tests, benchmark evaluations, log probability analysis) under realistic attack scenarios. Result: Existing methods relying solely on text outputs have limitations, especially against subtle or adaptive attacks; log probability analysis offers stronger guarantees but is less accessible. Conclusion: Hardware-based solutions like Trusted Execution Environments (TEEs) may provide provable model integrity, balancing security, performance, and adoption. Abstract: The proliferation of Large Language Models (LLMs) accessed via black-box APIs introduces a significant trust challenge: users pay for services based on advertised model capabilities (e.g., size, performance), but providers may covertly substitute the specified model with a cheaper, lower-quality alternative to reduce operational costs. This lack of transparency undermines fairness, erodes trust, and complicates reliable benchmarking. Detecting such substitutions is difficult due to the black-box nature, typically limiting interaction to input-output queries. This paper formalizes the problem of model substitution detection in LLM APIs. We systematically evaluate existing verification techniques, including output-based statistical tests, benchmark evaluations, and log probability analysis, under various realistic attack scenarios like model quantization, randomized substitution, and benchmark evasion. Our findings reveal the limitations of methods relying solely on text outputs, especially against subtle or adaptive attacks. While log probability analysis offers stronger guarantees when available, its accessibility is often limited. We conclude by discussing the potential of hardware-based solutions like Trusted Execution Environments (TEEs) as a pathway towards provable model integrity, highlighting the trade-offs between security, performance, and provider adoption. Code is available at https://github.com/sunblaze-ucb/llm-api-audit

Building LLM Agents by Incorporating Insights from Computer Systems

Yapeng Mi,Zhi Gao,Xiaojian Ma,Qing Li

Task: 提出一种基于计算机系统视角的结构化框架，用于设计LLM驱动的自主代理。

Motivation: 当前LLM代理的设计缺乏系统性原则，导致通用性和可扩展性受限。

Details

Method: 借鉴冯·诺依曼架构，提出模块化设计和通用原则的框架。 Result: 通过计算机系统视角的综述和比较分析，为LLM代理的系统化设计提供基础。 Conclusion: 该框架为LLM代理的设计和发展提供了系统化的方向和未来研究路径。 Abstract: LLM-driven autonomous agents have emerged as a promising direction in recent years. However, many of these LLM agents are designed empirically or based on intuition, often lacking systematic design principles, which results in diverse agent structures with limited generality and scalability. In this paper, we advocate for building LLM agents by incorporating insights from computer systems. Inspired by the von Neumann architecture, we propose a structured framework for LLM agentic systems, emphasizing modular design and universal principles. Specifically, this paper first provides a comprehensive review of LLM agents from the computer system perspective, then identifies key challenges and future directions inspired by computer system design, and finally explores the learning mechanisms for LLM agents beyond the computer system. The insights gained from this comparative analysis offer a foundation for systematic LLM agent design and advancement.

Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models

Yubo Li,Xiaobin Shen,Xinyu Yao,Xueying Ding,Yidi Miao,Ramayya Krishnan,Rema Padman

Task: 全面综述和评估大语言模型（LLMs）在多轮交互中的表现及其增强方法。

Motivation: 现实应用需要复杂的多轮交互能力，但目前LLMs在此方面仍面临挑战。

Details

Method: 系统分析多轮交互的挑战，整理现有基准和数据集，并综述多种增强方法（如模型中心策略、外部集成方法和基于代理的技术）。 Result: 总结了多轮交互的现状和增强方法，并提出了未来研究方向。 Conclusion: 多轮交互是LLMs的重要发展方向，需进一步研究以提高其鲁棒性和有效性。 Abstract: Recent advancements in large language models (LLMs) have revolutionized their ability to handle single-turn tasks, yet real-world applications demand sophisticated multi-turn interactions. This survey provides a comprehensive review of recent advancements in evaluating and enhancing multi-turn interactions in LLMs. Focusing on task-specific scenarios, from instruction following in diverse domains such as math and coding to complex conversational engagements in roleplay, healthcare, education, and even adversarial jailbreak settings, we systematically examine the challenges of maintaining context, coherence, fairness, and responsiveness over prolonged dialogues. The paper organizes current benchmarks and datasets into coherent categories that reflect the evolving landscape of multi-turn dialogue evaluation. In addition, we review a range of enhancement methodologies under multi-turn settings, including model-centric strategies (contextual learning, supervised fine-tuning, reinforcement learning, and new architectures), external integration approaches (memory-augmented, retrieval-based methods, and knowledge graph), and agent-based techniques for collaborative interactions. Finally, we discuss open challenges and propose future directions for research to further advance the robustness and effectiveness of multi-turn interactions in LLMs. Related resources and papers are available at https://github.com/yubol-cmu/Awesome-Multi-Turn-LLMs.

Learning Conditionally Independent Transformations using Normal Subgroups in Group Theory

Kayato Nishitsunoi,Yoshiyuki Ohmura,Takayuki Komatsu,Yasuo Kuniyoshi

Task: 提出一种利用正规子群分离条件独立变换的无监督表示学习方法。

Motivation: 无监督表示学习中分离不同变换的理论框架尚不完善，现有方法主要针对可交换变换，无法处理条件独立但非交换的变换。

Details

Method: 借鉴伽罗瓦理论中的正规子群分解方法，提出一种新方法分离条件独立变换。 Result: 实验表明，该方法能成功无监督地分类条件独立变换（如旋转和平移）。 Conclusion: 正规子群分解与表示学习中的变换分类存在紧密联系。 Abstract: Humans develop certain cognitive abilities to recognize objects and their transformations without explicit supervision, highlighting the importance of unsupervised representation learning. A fundamental challenge in unsupervised representation learning is to separate different transformations in learned feature representations. Although algebraic approaches have been explored, a comprehensive theoretical framework remains underdeveloped. Existing methods decompose transformations based on algebraic independence, but these methods primarily focus on commutative transformations and do not extend to cases where transformations are conditionally independent but noncommutative. To extend current representation learning frameworks, we draw inspiration from Galois theory, where the decomposition of groups through normal subgroups provides an approach for the analysis of structured transformations. Normal subgroups naturally extend commutativity under certain conditions and offer a foundation for the categorization of transformations, even when they do not commute. In this paper, we propose a novel approach that leverages normal subgroups to enable the separation of conditionally independent transformations, even in the absence of commutativity. Through experiments on geometric transformations in images, we show that our method successfully categorizes conditionally independent transformations, such as rotation and translation, in an unsupervised manner, suggesting a close link between group decomposition via normal subgroups and transformation categorization in representation learning.

T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models

Minki Kang,Jongwon Jeong,Jaewoong Cho

Task: 研究小型语言模型（sLM）在测试时计算扩展下是否能可靠地自我验证其输出。

Motivation: 现有研究主要依赖更大的模型作为验证器，而小型语言模型的自我验证能力尚未充分探索。

Details

Method: 提出工具集成的自我验证方法（T1），将记忆密集型验证步骤委托给外部工具（如代码解释器）。 Result: 实验表明，使用T1的Llama-3.2 1B模型在测试时扩展下表现优于更大的Llama-3.1 8B模型，并在数学和多领域知识密集型任务中表现出色。 Conclusion: 工具集成能显著提升小型语言模型的自我验证能力。 Abstract: Recent studies have demonstrated that test-time compute scaling effectively improves the performance of small language models (sLMs). However, prior research has mainly examined test-time compute scaling with an additional larger model as a verifier, leaving self-verification by sLMs underexplored. In this work, we investigate whether sLMs can reliably self-verify their outputs under test-time scaling. We find that even with knowledge distillation from larger verifiers, sLMs struggle with verification tasks requiring memorization, such as numerical calculations and fact-checking. To address this limitation, we propose Tool-integrated self-verification (T1), which delegates memorization-heavy verification steps to external tools, such as a code interpreter. Our theoretical analysis shows that tool integration reduces memorization demands and improves test-time scaling performance. Experiments on the MATH benchmark demonstrate that, with T1, a Llama-3.2 1B model under test-time scaling outperforms the significantly larger Llama-3.1 8B model. Moreover, T1 generalizes effectively to both mathematical (MATH500) and multi-domain knowledge-intensive tasks (MMLU-Pro). Our findings highlight the potential of tool integration to substantially improve the self-verification abilities of sLMs.

Skin Color Measurement from Dermatoscopic Images: An Evaluation on a Synthetic Dataset

Marin Benčević,Robert Šojo,Irena Galić

Task: 评估皮肤颜色测量方法在皮肤镜图像中的表现。

Motivation: 通过合成数据集（S-SYNTH）控制多种变量，以评估方法对光照条件的鲁棒性和不变性。

Details

Method: 评估四类图像比色方法：基于分割、基于补丁、颜色量化和神经网络。 Result: 基于分割和颜色量化的方法表现鲁棒且对光照不变，而基于补丁的方法需要校准；神经网络结合模糊处理可提供光照不变的预测，但泛化能力未验证。 Conclusion: 提出了设计公平可靠皮肤颜色估计方法的实用建议。 Abstract: This paper presents a comprehensive evaluation of skin color measurement methods from dermatoscopic images using a synthetic dataset (S-SYNTH) with controlled ground-truth melanin content, lesion shapes, hair models, and 18 distinct lighting conditions. This allows for rigorous assessment of the robustness and invariance to lighting conditions. We assess four classes of image colorimetry approaches: segmentation-based, patch-based, color quantization, and neural networks. We use these methods to estimate the Individual Typology Angle (ITA) and Fitzpatrick types from dermatoscopic images. Our results show that segmentation-based and color quantization methods yield robust, lighting-invariant estimates, whereas patch-based approaches exhibit significant lighting-dependent biases that require calibration. Furthermore, neural network models, particularly when combined with heavy blurring to reduce overfitting, can provide light-invariant Fitzpatrick predictions, although their generalization to real-world images remains unverified. We conclude with practical recommendations for designing fair and reliable skin color estimation methods.

TathyaNyaya and FactLegalLlama: Advancing Factual Judgment Prediction and Explanation in the Indian Legal Context

Shubham Kumar Nigam,Balaramamahanthi Deepak Patnaik,Shivam Mishra,Noel Shallum,Kripabandhu Ghosh,Arnab Bhattacharya

Task: 构建TathyaNyaya数据集并开发FactLegalLlama模型，用于印度法律背景下的基于事实的判决预测与解释（FJPE）。

Motivation: 开发一个专注于事实数据的AI驱动决策工具，以增强法律系统的透明度和可解释性。

Details

Method: 结合TathyaNyaya数据集（印度最高法院和高等法院的判决）和FactLegalLlama模型（基于LLaMa-3-8B微调的模型），用于判决预测和解释生成。 Result: TathyaNyaya数据集在规模和多样性上超越现有数据集，FactLegalLlama模型在预测准确性和解释质量上表现优异。 Conclusion: TathyaNyaya和FactLegalLlama为AI辅助法律决策提供了重要资源，强调了事实精确性和领域特定调优的重要性。 Abstract: In the landscape of Fact-based Judgment Prediction and Explanation (FJPE), reliance on factual data is essential for developing robust and realistic AI-driven decision-making tools. This paper introduces TathyaNyaya, the largest annotated dataset for FJPE tailored to the Indian legal context, encompassing judgments from the Supreme Court of India and various High Courts. Derived from the Hindi terms "Tathya" (fact) and "Nyaya" (justice), the TathyaNyaya dataset is uniquely designed to focus on factual statements rather than complete legal texts, reflecting real-world judicial processes where factual data drives outcomes. Complementing this dataset, we present FactLegalLlama, an instruction-tuned variant of the LLaMa-3-8B Large Language Model (LLM), optimized for generating high-quality explanations in FJPE tasks. Finetuned on the factual data in TathyaNyaya, FactLegalLlama integrates predictive accuracy with coherent, contextually relevant explanations, addressing the critical need for transparency and interpretability in AI-assisted legal systems. Our methodology combines transformers for binary judgment prediction with FactLegalLlama for explanation generation, creating a robust framework for advancing FJPE in the Indian legal domain. TathyaNyaya not only surpasses existing datasets in scale and diversity but also establishes a benchmark for building explainable AI systems in legal analysis. The findings underscore the importance of factual precision and domain-specific tuning in enhancing predictive performance and interpretability, positioning TathyaNyaya and FactLegalLlama as foundational resources for AI-assisted legal decision-making.

AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection

Peng Wu,Wanshun Su,Guansong Pang,Yujia Sun,Qingsen Yan,Peng Wang,Yanning Zhang

Task: 提出一种新颖的弱监督框架，利用音频-视觉协作进行鲁棒的视频异常检测。

Motivation: 解决传统视觉检测方法在复杂环境中信息不足和高误报率的问题。

Details

Method: 采用对比语言-图像预训练（CLIP）的跨模态表示学习能力，提出高效的音频-视觉融合和动态音频-视觉提示，并开发不确定性驱动的特征蒸馏模块。 Result: 在多个基准测试中表现出优越性能，音频集成显著提高了异常检测的准确性。 Conclusion: 该框架通过音频-视觉协作和不确定性驱动的方法，显著提升了视频异常检测的鲁棒性和准确性。 Abstract: With the increasing adoption of video anomaly detection in intelligent surveillance domains, conventional visual-based detection approaches often struggle with information insufficiency and high false-positive rates in complex environments. To address these limitations, we present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection. Capitalizing on the exceptional cross-modal representation learning capabilities of Contrastive Language-Image Pretraining (CLIP) across visual, audio, and textual domains, our framework introduces two major innovations: an efficient audio-visual fusion that enables adaptive cross-modal integration through lightweight parametric adaptation while maintaining the frozen CLIP backbone, and a novel audio-visual prompt that dynamically enhances text embeddings with key multimodal information based on the semantic correlation between audio-visual features and textual labels, significantly improving CLIP's generalization for the video anomaly detection task. Moreover, to enhance robustness against modality deficiency during inference, we further develop an uncertainty-driven feature distillation module that synthesizes audio-visual representations from visual-only inputs. This module employs uncertainty modeling based on the diversity of audio-visual features to dynamically emphasize challenging features during the distillation process. Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy in various scenarios. Notably, with unimodal data enhanced by uncertainty-driven distillation, our approach consistently outperforms current unimodal VAD methods.

Can LLMs Interpret and Leverage Structured Linguistic Representations? A Case Study with AMRs

Ankush Raut,Xiaofeng Zhu,Maria Leonor Pacheco

Task: 评估大型语言模型（LLMs）利用结构化语言表示（AMR）处理上下文信息的能力。

Motivation: 研究AMR结构对LLMs在不同语言任务中表现的影响，尤其是短上下文和长上下文任务。

Details

Method: 使用8位量化和指令调优版本的Llama 3.1、Phi-3和Mistral 7B，分析AMR对模型性能的影响。 Result: 短上下文中AMR会降低性能，而长上下文中（如对话摘要）能显著提升性能（如Llama 3.1的余弦相似度从66.2%提高到76%）。 Conclusion: AMR对LLMs的性能提升在长上下文任务中更显著，且对较新、较大的模型效果更好。 Abstract: This paper evaluates the ability of Large Language Models (LLMs) to leverage contextual information in the form of structured linguistic representations. Specifically, we examine the impact of encoding both short and long contexts using Abstract Meaning Representation (AMR) structures across a diverse set of language tasks. We perform our analysis using 8-bit quantized and instruction-tuned versions of Llama 3.1 (8B), Phi-3, and Mistral 7B. Our results indicate that, for tasks involving short contexts, augmenting the prompt with the AMR of the original language context often degrades the performance of the underlying LLM. However, for tasks that involve long contexts, such as dialogue summarization in the SAMSum dataset, this enhancement improves LLM performance, for example, by increasing the zero-shot cosine similarity score of Llama 3.1 from 66.2% to 76%. This improvement is more evident in the newer and larger LLMs, but does not extend to the older or smaller ones. In addition, we observe that LLMs can effectively reconstruct the original text from a linearized AMR, achieving a cosine similarity of 81.3% in the best-case scenario.

Attributed Synthetic Data Generation for Zero-shot Domain-specific Image Classification

Shijian Wang,Linxin Song,Ryotaro Shimizu,Masayuki Goto,Hanqian Wu

Task: 提出一种名为AttrSyn的方法，用于在零样本领域特定图像分类中生成更多样化的合成训练图像。

Motivation: 现有方法依赖简单提示策略，导致合成图像多样性不足，性能不如真实图像。

Details

Method: 利用大型语言模型生成带有属性的提示，从而生成更多样化的合成图像。 Result: 在两种细粒度数据集上的实验表明，AttrSyn生成的合成图像训练效果显著优于CLIP的零样本分类，并持续超越简单提示策略。 Conclusion: AttrSyn通过生成多样化的合成图像，显著提升了零样本领域特定图像分类的性能。 Abstract: Zero-shot domain-specific image classification is challenging in classifying real images without ground-truth in-domain training examples. Recent research involved knowledge from texts with a text-to-image model to generate in-domain training images in zero-shot scenarios. However, existing methods heavily rely on simple prompt strategies, limiting the diversity of synthetic training images, thus leading to inferior performance compared to real images. In this paper, we propose AttrSyn, which leverages large language models to generate attributed prompts. These prompts allow for the generation of more diverse attributed synthetic images. Experiments for zero-shot domain-specific image classification on two fine-grained datasets show that training with synthetic images generated by AttrSyn significantly outperforms CLIP's zero-shot classification under most situations and consistently surpasses simple prompt strategies.

Improving Multilingual Retrieval-Augmented Language Models through Dialectic Reasoning Argumentations

Leonardo Ranaldi,Federico Ranaldi,Fabio Massimo Zanzotto,Barry Haddow,Alexandra Birch

Task: 通过引入Dialectic-RAG (DRAG)方法，提升检索增强生成（RAG）在大型语言模型（LLMs）中的分析和批判能力。

Motivation: 解决RAG在多语言检索中可能遇到的知识冲突和异质性问题，使其更具分析性和批判性。

Details

Method: 提出DRAG方法，基于论证性解释（Argumentative Explanations）结构化推理过程，系统评估检索信息，比较、对比并解决冲突观点。 Result: 实验表明，DRAG显著提升了RAG方法的性能，计算成本低且对知识扰动具有鲁棒性。 Conclusion: DRAG为RAG提供了一种更分析、批判和可靠的方法，适用于多语言检索场景。 Abstract: Retrieval-augmented generation (RAG) is key to enhancing large language models (LLMs) to systematically access richer factual knowledge. Yet, using RAG brings intrinsic challenges, as LLMs must deal with potentially conflicting knowledge, especially in multilingual retrieval, where the heterogeneity of knowledge retrieved may deliver different outlooks. To make RAG more analytical, critical and grounded, we introduce Dialectic-RAG (DRAG), a modular approach guided by Argumentative Explanations, i.e., structured reasoning process that systematically evaluates retrieved information by comparing, contrasting, and resolving conflicting perspectives. Given a query and a set of multilingual related documents, DRAG selects and exemplifies relevant knowledge for delivering dialectic explanations that, by critically weighing opposing arguments and filtering extraneous content, clearly determine the final response. Through a series of in-depth experiments, we show the impact of our framework both as an in-context learning strategy and for constructing demonstrations to instruct smaller models. The final results demonstrate that DRAG significantly improves RAG approaches, requiring low-impact computational effort and providing robustness to knowledge perturbations.

Enhance Then Search: An Augmentation-Search Strategy with Foundation Models for Cross-Domain Few-Shot Object Detection

Jiancheng Pan,Yanxing Liu,Xiao He,Long Peng,Jiahao Li,Yuze Sun,Xiaomeng Huang

Task: 通过结合图像数据增强技术和基于网格的子域搜索策略，提升基础模型在跨域少样本目标检测（CD-FSOD）任务中的性能。

Motivation: 基础模型在跨域少样本目标检测任务中表现优异，但如何进一步优化其性能并减少对大量数据的依赖是关键问题。

Details

Method: 采用图像数据增强技术和基于网格的子域搜索策略，结合GroundingDINO模型，优化参数配置。 Result: 显著提升了基础模型在数据稀缺环境下的性能，并提供了优化跨域泛化能力的方法。 Conclusion: 该方法为视觉语言模型在数据稀缺环境中的实际部署提供了重要支持，并减少了重新训练的负担。 Abstract: Foundation models pretrained on extensive datasets, such as GroundingDINO and LAE-DINO, have performed remarkably in the cross-domain few-shot object detection (CD-FSOD) task. Through rigorous few-shot training, we found that the integration of image-based data augmentation techniques and grid-based sub-domain search strategy significantly enhances the performance of these foundation models. Building upon GroundingDINO, we employed several widely used image augmentation methods and established optimization objectives to effectively navigate the expansive domain space in search of optimal sub-domains. This approach facilitates efficient few-shot object detection and introduces an approach to solving the CD-FSOD problem by efficiently searching for the optimal parameter configuration from the foundation model. Our findings substantially advance the practical deployment of vision-language models in data-scarce environments, offering critical insights into optimizing their cross-domain generalization capabilities without labor-intensive retraining. Code is available at https://github.com/jaychempan/ETS.

I only read it for the plot! Maturity Ratings Affect Fanfiction Style and Community Engagement

Mia Jacobsen,Ross Deans Kristensen-McLachlan

Task: 分析不同粉丝小说成熟度评级的文本特征及其对读者参与度的影响。

Motivation: 探讨粉丝小说写作动机与成熟度评级的关系，以及社区规范和粉丝行为对文化产品的影响。

Details

Method: 通过比较不同成熟度评级的文本特征及其在粉丝群体中的差异。 Result: 发现明确标记为成人的粉丝小说具有独特的文本特征。 Conclusion: 研究深化了对粉丝社区中读者和作者动机的理解，并强调了社区规范对文化产品的影响。 Abstract: We consider the textual profiles of different fanfiction maturity ratings, how they vary across fan groups, and how this relates to reader engagement metrics. Previous studies have shown that fanfiction writing is motivated by a combination of admiration for and frustration with the fan object. These findings emerge when looking at fanfiction as a whole, as well as when it is divided into subgroups, also called fandoms. However, maturity ratings are used to indicate the intended audience of the fanfiction, as well as whether the story includes mature themes and explicit scenes. Since these ratings can be used to filter readers and writers, they can also be seen as a proxy for different reader/writer motivations and desires. We find that explicit fanfiction in particular has a distinct textual profile when compared to other maturity ratings. These findings thus nuance our understanding of reader/writer motivations in fanfiction communities, and also highlights the influence of the community norms and fan behavior more generally on these cultural products.

SAM2MOT: A Novel Paradigm of Multi-Object Tracking by Segmentation

Junjie Jiang,Zelin Wang,Manqi Zhao,Yin Li,DongSheng Jiang

Task: 提出一种基于分割的多目标跟踪方法SAM2MOT，扩展了SAM2的单目标跟踪能力。

Motivation: 解决多目标跟踪中依赖检测准确性的问题，并实现零样本泛化和强对象关联能力。

Details

Method: 通过直接利用分割掩码生成跟踪框，引入轨迹管理系统和跨对象交互模块。 Result: 在DanceTrack、UAVDT和BDD100K数据集上取得最先进结果，尤其在DanceTrack上表现突出。 Conclusion: SAM2MOT在多目标跟踪中表现出色，具有零样本泛化和强对象关联的优势。 Abstract: Segment Anything 2 (SAM2) enables robust single-object tracking using segmentation. To extend this to multi-object tracking (MOT), we propose SAM2MOT, introducing a novel Tracking by Segmentation paradigm. Unlike Tracking by Detection or Tracking by Query, SAM2MOT directly generates tracking boxes from segmentation masks, reducing reliance on detection accuracy. SAM2MOT has two key advantages: zero-shot generalization, allowing it to work across datasets without fine-tuning, and strong object association, inherited from SAM2. To further improve performance, we integrate a trajectory manager system for precise object addition and removal, and a cross-object interaction module to handle occlusions. Experiments on DanceTrack, UAVDT, and BDD100K show state-of-the-art results. Notably, SAM2MOT outperforms existing methods on DanceTrack by +2.1 HOTA and +4.5 IDF1, highlighting its effectiveness in MOT.

Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models

Ruikang Liu,Yuxuan Sun,Manyi Zhang,Haoli Bai,Xianzhi Yu,Tiezheng Yu,Chun Yuan,Lu Hou

Task: 研究量化对推理语言模型性能的影响。

Motivation: 尽管量化已广泛用于降低大型语言模型的推理成本，但其对推理模型的影响尚未充分研究。

Details

Method: 对开源模型（DeepSeek-R1-Distilled Qwen、LLaMA家族、QwQ-32B）进行权重、KV缓存和激活量化，并在多个推理基准上评估。 Result: 发现W8A8或W4A16量化可实现无损，但更低比特宽度会显著降低准确性；模型大小、来源和任务难度是关键因素。 Conclusion: 量化模型性能可通过调整模型规模或推理步骤提升，相关代码和模型将开源。 Abstract: Recent advancements in reasoning language models have demonstrated remarkable performance in complex tasks, but their extended chain-of-thought reasoning process increases inference overhead. While quantization has been widely adopted to reduce the inference cost of large language models, its impact on reasoning models remains understudied. In this study, we conduct the first systematic study on quantized reasoning models, evaluating the open-sourced DeepSeek-R1-Distilled Qwen and LLaMA families ranging from 1.5B to 70B parameters, and QwQ-32B. Our investigation covers weight, KV cache, and activation quantization using state-of-the-art algorithms at varying bit-widths, with extensive evaluation across mathematical (AIME, MATH-500), scientific (GPQA), and programming (LiveCodeBench) reasoning benchmarks. Our findings reveal that while lossless quantization can be achieved with W8A8 or W4A16 quantization, lower bit-widths introduce significant accuracy risks. We further identify model size, model origin, and task difficulty as critical determinants of performance. Contrary to expectations, quantized models do not exhibit increased output lengths. In addition, strategically scaling the model sizes or reasoning steps can effectively enhance the performance. All quantized models and codes will be open-sourced in https://github.com/ruikangliu/Quantized-Reasoning-Models.

SnapPix: Efficient-Coding--Inspired In-Sensor Compression for Edge Vision

Weikai Lin,Tianrui Ma,Adith Boloor,Yu Feng,Ruofan Xing,Xuan Zhang,Yuhao Zhu

Task: 提出一种名为SnapPix的传感器-算法协同设计系统，用于在模拟域内压缩原始像素以减少边缘能耗。

Motivation: 边缘设备计算能力有限且需远程传输数据，降低能耗是关键。

Details

Method: 采用编码曝光（CE）作为传感器内压缩策略，提出任务无关的采样/曝光模式学习方法，并协同设计下游视觉模型以解决CE压缩图像的非均匀性问题。 Result: 在动作识别和视频重建任务中，SnapPix在相同速度下优于现有视频方法，能耗降低高达15.4倍。 Conclusion: SnapPix通过传感器-算法协同设计和硬件轻量级改进，显著降低了边缘图像采集的能耗。 Abstract: Energy-efficient image acquisition on the edge is crucial for enabling remote sensing applications where the sensor node has weak compute capabilities and must transmit data to a remote server/cloud for processing. To reduce the edge energy consumption, this paper proposes a sensor-algorithm co-designed system called SnapPix, which compresses raw pixels in the analog domain inside the sensor. We use coded exposure (CE) as the in-sensor compression strategy as it offers the flexibility to sample, i.e., selectively expose pixels, both spatially and temporally. SNAPPIX has three contributions. First, we propose a task-agnostic strategy to learn the sampling/exposure pattern based on the classic theory of efficient coding. Second, we co-design the downstream vision model with the exposure pattern to address the pixel-level non-uniformity unique to CE-compressed images. Finally, we propose lightweight augmentations to the image sensor hardware to support our in-sensor CE compression. Evaluating on action recognition and video reconstruction, SnapPix outperforms state-of-the-art video-based methods at the same speed while reducing the energy by up to 15.4x. We have open-sourced the code at: https://github.com/horizon-research/SnapPix.

Discovering dynamical laws for speech gestures

Sam Kirkham

Task: 发现控制言语发音动作的动力学模型。

Motivation: 探索言语发音动作背后的基本动力学原理，以理解复杂的物理运动如何映射到语言认知单元。

Details

Method: 使用稀疏符号回归算法从舌头和嘴唇的运动学数据中发现符号方程模型，并通过分析和数值模拟验证候选模型。 Result: 二阶线性模型在多数情况下表现良好，但约三分之一的情况需要非线性力来准确建模发音动作。 Conclusion: 自主、非线性、二阶微分方程是描述言语发音动作的可行动力学定律，未来研究可进一步探索数据驱动模型发现的潜力。 Abstract: A fundamental challenge in the cognitive sciences is discovering the dynamics that govern behaviour. Take the example of spoken language, which is characterised by a highly variable and complex set of physical movements that map onto the small set of cognitive units that comprise language. What are the fundamental dynamical principles behind the movements that structure speech production? In this study, we discover models in the form of symbolic equations that govern articulatory gestures during speech. A sparse symbolic regression algorithm is used to discover models from kinematic data on the tongue and lips. We explore these candidate models using analytical techniques and numerical simulations, and find that a second-order linear model achieves high levels of accuracy, but a nonlinear force is required to properly model articulatory dynamics in approximately one third of cases. This supports the proposal that an autonomous, nonlinear, second-order differential equation is a viable dynamical law for articulatory gestures in speech. We conclude by identifying future opportunities and obstacles in data-driven model discovery and outline prospects for discovering the dynamical principles that govern language, brain and behaviour.

The Point, the Vision and the Text: Does Point Cloud Boost Spatial Reasoning of Large Language Models?

Weichen Zhang,Ruiying Peng,Chen Gao,Jianjie Fang,Xin Zeng,Kaiyuan Li,Ziyou Wang,Jinqiang Cui,Xin Wang,Xinlei Chen,Yong Li

Task: 评估点云是否真正提升了3D大语言模型（LLMs）的空间推理能力。

Motivation: 尽管点云在3D空间推理中表现出潜力，但其具体作用尚未充分探索，因此需要全面评估和分析。

Details

Method: 通过替换点云为视觉和文本输入，评估不同输入模态下LLMs的空间推理能力，并提出新的3D问答基准ScanReQA。 Result: 1）无点云输入的LLMs在零样本情况下表现竞争性；2）现有3D LLMs难以理解二元空间关系；3）3D LLMs在利用点云结构坐标进行细粒度空间推理时存在局限。 Conclusion: 研究结果为3D LLMs的下一步发展提供了帮助，并为其他模态的基础模型提供了见解。 Abstract: 3D Large Language Models (LLMs) leveraging spatial information in point clouds for 3D spatial reasoning attract great attention. Despite some promising results, the role of point clouds in 3D spatial reasoning remains under-explored. In this work, we comprehensively evaluate and analyze these models to answer the research question: \textit{Does point cloud truly boost the spatial reasoning capacities of 3D LLMs?} We first evaluate the spatial reasoning capacity of LLMs with different input modalities by replacing the point cloud with the visual and text counterparts. We then propose a novel 3D QA (Question-answering) benchmark, ScanReQA, that comprehensively evaluates models' understanding of binary spatial relationships. Our findings reveal several critical insights: 1) LLMs without point input could even achieve competitive performance even in a zero-shot manner; 2) existing 3D LLMs struggle to comprehend the binary spatial relationships; 3) 3D LLMs exhibit limitations in exploiting the structural coordinates in point clouds for fine-grained spatial reasoning. We think these conclusions can help the next step of 3D LLMs and also offer insights for foundation models in other modalities. We release datasets and reproducible codes in the anonymous project page: https://3d-llm.xyz.

SAFT: Structure-aware Transformers for Textual Interaction Classification

Hongtao Wang,Renchi Yang,Hewen Wang,Haoran Zheng,Jianliang Xu

Task: 提出一种名为SAFT的新架构，用于有效融合文本和结构语义以提升文本交互分类（TIC）的性能。

Motivation: 现有TIC解决方案未能充分捕捉文本语义或忽视文本交互网络（TINs）的二分结构和节点异质性，导致性能受限。

Details

Method: 结合语言和图模块，利用线图注意力（LGA）/门控注意力单元（GAUs）和预训练语言模型（PLMs）建模交互级和词级信号，并通过代理令牌迭代耦合。同时开发了一种高效的理论方法编码局部和全局拓扑信息。 Result: 在多个真实TIN数据集上的实验表明，SAFT在TIC准确率上优于现有基线方法。 Conclusion: SAFT通过融合文本和结构语义，显著提升了文本交互分类的性能。 Abstract: Textual interaction networks (TINs) are an omnipresent data structure used to model the interplay between users and items on e-commerce websites, social networks, etc., where each interaction is associated with a text description. Classifying such textual interactions (TIC) finds extensive use in detecting spam reviews in e-commerce, fraudulent transactions in finance, and so on. Existing TIC solutions either (i) fail to capture the rich text semantics due to the use of context-free text embeddings, and/or (ii) disregard the bipartite structure and node heterogeneity of TINs, leading to compromised TIC performance. In this work, we propose SAFT, a new architecture that integrates language- and graph-based modules for the effective fusion of textual and structural semantics in the representation learning of interactions. In particular, line graph attention (LGA)/gated attention units (GAUs) and pretrained language models (PLMs) are capitalized on to model the interaction-level and token-level signals, which are further coupled via the proxy token in an iterative and contextualized fashion. Additionally, an efficient and theoretically-grounded approach is developed to encode the local and global topology information pertaining to interactions into structural embeddings. The resulting embeddings not only inject the structural features underlying TINs into the textual interaction encoding but also facilitate the design of graph sampling strategies. Extensive empirical evaluations on multiple real TIN datasets demonstrate the superiority of SAFT over the state-of-the-art baselines in TIC accuracy.

Opening the black box of deep learning: Validating the statistical association between explainable artificial intelligence (XAI) and clinical domain knowledge in fundus image-based glaucoma diagnosis

Han Yuan,Lican Kang,Yong Li

Task: 通过多种类激活映射（CAM）技术揭示深度学习模型在青光眼分类中的决策过程。

Motivation: 深度学习在医学图像任务中表现出色，但其黑盒特性阻碍了其在现实医疗环境中的广泛应用。

Details

Method: 使用四种深度神经网络（VGG-11、ResNet-18、DeiT-Tiny和Swin Transformer-Tiny）和五种CAM方法（Grad-CAM、XGrad-CAM、Score-CAM、Eigen-CAM和Layer-CAM）生成模型关注区域，并与临床解剖知识对比。 Result: 所有模型在关注区域中解剖结构的比例显著高于整张图像中的比例，且模型预测能力与关注区域中解剖结构的比例呈正相关。 Conclusion: 研究表明深度学习模型与人类临床医生的决策逻辑存在一致性，有助于提升对深度学习在医疗中可信度的信心。 Abstract: While deep learning has exhibited remarkable predictive capabilities in various medical image tasks, its inherent black-box nature has hindered its widespread implementation in real-world healthcare settings. Our objective is to unveil the decision-making processes of deep learning models in the context of glaucoma classification by employing several Class Activation Map (CAM) techniques to generate model focus regions and comparing them with clinical domain knowledge of the anatomical area (optic cup, optic disk, and blood vessels). Four deep neural networks, including VGG-11, ResNet-18, DeiT-Tiny, and Swin Transformer-Tiny, were developed using binary diagnostic labels of glaucoma and five CAM methods (Grad-CAM, XGrad-CAM, Score-CAM, Eigen-CAM, and Layer-CAM) were employed to highlight the model focus area. We applied the paired-sample t-test to compare the percentage of anatomies in the model focus area to the proportion of anatomies in the entire image. After that, Pearson's and Spearman's correlation tests were implemented to examine the relationship between model predictive ability and the percentage of anatomical structures in the model focus area. On five public glaucoma datasets, all deep learning models consistently displayed statistically significantly higher percentages of anatomical structures in the focus area than the proportions of anatomical structures in the entire image. Also, we validated the positive relationship between the percentage of anatomical structures in the focus area and model predictive performance. Our study provides evidence of the convergence of decision logic between deep neural networks and human clinicians through rigorous statistical tests. We anticipate that it can help alleviate clinicians' concerns regarding the trustworthiness of deep learning in healthcare. For reproducibility, the code and dataset have been released at GitHub.

Leveraging Large Language Models for Cost-Effective, Multilingual Depression Detection and Severity Assessment

Longdi Xian,Jianzhang Ni,Mingzhu Wang

Task: 评估四种大型语言模型在抑郁症检测中的性能，并进一步测试最佳模型在严重性评估和知识增强场景中的表现。

Motivation: 抑郁症是一种普遍的心理健康障碍，早期检测困难，而大型语言模型提供了高效且经济的方法。

Details

Method: 使用临床访谈数据评估四种LLM的性能，选择最佳模型并在严重性评估和知识增强场景中测试，同时在复杂诊断场景中评估鲁棒性。 Result: DeepSeek V3是最可靠且经济的抑郁症检测模型，在零样本和少样本场景中表现良好，但在严重性评估中与人类评估者一致性较低。 Conclusion: DeepSeek V3在抑郁症检测中具有潜力，但需进一步优化严重性评估和减少潜在偏见以提高临床可靠性。 Abstract: Depression is a prevalent mental health disorder that is difficult to detect early due to subjective symptom assessments. Recent advancements in large language models have offered efficient and cost-effective approaches for this objective. In this study, we evaluated the performance of four LLMs in depression detection using clinical interview data. We selected the best performing model and further tested it in the severity evaluation scenario and knowledge enhanced scenario. The robustness was evaluated in complex diagnostic scenarios using a dataset comprising 51074 statements from six different mental disorders. We found that DeepSeek V3 is the most reliable and cost-effective model for depression detection, performing well in both zero-shot and few-shot scenarios, with zero-shot being the most efficient choice. The evaluation of severity showed low agreement with the human evaluator, particularly for mild depression. The model maintains stably high AUCs for detecting depression in complex diagnostic scenarios. These findings highlight DeepSeek V3s strong potential for text-based depression detection in real-world clinical applications. However, they also underscore the need for further refinement in severity assessment and the mitigation of potential biases to enhance clinical reliability.

Advancing Egocentric Video Question Answering with Multimodal Large Language Models

Alkesh Patel,Vibhav Chitalia,Yinfei Yang

Task: 评估多模态大语言模型（MLLMs）在QaEgo4Dv2数据集上的表现，包括零样本和微调方法。

Motivation: 解决第一人称视角视频问答中的长时程时序推理和相机移动等挑战。

Details

Method: 使用四种MLLMs（GPT-4o、Gemini-1.5-Pro、Video-LLaVa-7B和Qwen2-VL-7B-Instruct）在QaEgo4Dv2数据集上进行零样本和微调评估。 Result: 微调的Video-LLaVa-7B和Qwen2-VL-7B-Instruct在OpenQA和CloseQA中分别提升了2.6%和13%的性能。 Conclusion: 模型在空间推理和细粒度物体识别方面仍有改进空间。 Abstract: Egocentric Video Question Answering (QA) requires models to handle long-horizon temporal reasoning, first-person perspectives, and specialized challenges like frequent camera movement. This paper systematically evaluates both proprietary and open-source Multimodal Large Language Models (MLLMs) on QaEgo4Dv2 - a refined dataset of egocentric videos derived from QaEgo4D. Four popular MLLMs (GPT-4o, Gemini-1.5-Pro, Video-LLaVa-7B and Qwen2-VL-7B-Instruct) are assessed using zero-shot and fine-tuned approaches for both OpenQA and CloseQA settings. We introduce QaEgo4Dv2 to mitigate annotation noise in QaEgo4D, enabling more reliable comparison. Our results show that fine-tuned Video-LLaVa-7B and Qwen2-VL-7B-Instruct achieve new state-of-the-art performance, surpassing previous benchmarks by up to +2.6% ROUGE/METEOR (for OpenQA) and +13% accuracy (for CloseQA). We also present a thorough error analysis, indicating the model's difficulty in spatial reasoning and fine-grained object recognition - key areas for future improvement.

Collab-RAG: Boosting Retrieval-Augmented Generation for Complex Question Answering via White-Box and Black-Box LLM Collaboration

Ran Xu,Wenqi Shi,Yuchen Zhuang,Yue Yu,Joyce C. Ho,Haoyu Wang,Carl Yang

Task: 提出Collab-RAG框架，通过协作训练提升多跳问答任务的准确性。

Motivation: 解决RAG系统在多跳问答任务中因无关上下文检索和有限复杂推理能力导致的问题。

Details

Method: 利用白盒小语言模型（SLM）和黑盒大语言模型（LLM）的协作训练，SLM分解复杂问题为子问题，LLM提供反馈信号。 Result: 在五个多跳问答数据集上平均表现优于基线1.8%-14.2%，3B SLM在问题分解上超越32B LLM。 Conclusion: Collab-RAG通过协作训练显著提升复杂问题的推理和检索效率，且无需额外蒸馏前沿LLM。 Abstract: Retrieval-Augmented Generation (RAG) systems often struggle to handle multi-hop question-answering tasks accurately due to irrelevant context retrieval and limited complex reasoning capabilities. We introduce Collab-RAG, a collaborative training framework that leverages mutual enhancement between a white-box small language model (SLM) and a blackbox large language model (LLM) for RAG. Specifically, the SLM decomposes complex queries into simpler sub-questions, thus enhancing the accuracy of the retrieval and facilitating more effective reasoning by the black-box LLM. Concurrently, the black-box LLM provides feedback signals to improve the SLM's decomposition capability. We observe that Collab-RAG relies solely on supervision from an affordable black-box LLM without additional distillation from frontier LLMs, yet demonstrates strong generalization across multiple black-box LLMs. Experimental evaluations across five multi-hop QA datasets demonstrate that Collab-RAG substantially outperforms existing black-box-only and SLM fine-tuning baselines by 1.8%-14.2% on average. In particular, our fine-tuned 3B SLM surpasses a frozen 32B LLM in question decomposition, highlighting the efficiency of Collab-RAG in improving reasoning and retrieval for complex questions. The code of Collab-RAG is available on https://github.com/ritaranx/Collab-RAG/.

DyCON: Dynamic Uncertainty-aware Consistency and Contrastive Learning for Semi-supervised Medical Image Segmentation

Maregu Assefa,Muzammal Naseer,Iyyakutti Iyappan Ganapathi,Syed Sadaf Ali,Mohamed L Seghier,Naoufel Werghi

Task: 提出一种动态不确定性感知一致性和对比学习框架（DyCON），用于解决医学图像分割中的类别不平衡和高不确定性问题。

Motivation: 现有半监督学习方法在医学图像分割中因类别不平衡和病理变化导致的高不确定性而表现不佳。

Details

Method: DyCON结合了不确定性感知一致性损失（UnCL）和焦点熵感知对比损失（FeCL），动态调整权重以优化全局一致性和局部特征区分。 Result: 在四个医学图像分割数据集（ISLES'22、BraTS'19、LA、Pancreas）上，DyCON表现优于现有方法。 Conclusion: DyCON通过动态不确定性感知和对比学习，显著提升了医学图像分割的准确性和鲁棒性。 Abstract: Semi-supervised learning in medical image segmentation leverages unlabeled data to reduce annotation burdens through consistency learning. However, current methods struggle with class imbalance and high uncertainty from pathology variations, leading to inaccurate segmentation in 3D medical images. To address these challenges, we present DyCON, a Dynamic Uncertainty-aware Consistency and Contrastive Learning framework that enhances the generalization of consistency methods with two complementary losses: Uncertainty-aware Consistency Loss (UnCL) and Focal Entropy-aware Contrastive Loss (FeCL). UnCL enforces global consistency by dynamically weighting the contribution of each voxel to the consistency loss based on its uncertainty, preserving high-uncertainty regions instead of filtering them out. Initially, UnCL prioritizes learning from uncertain voxels with lower penalties, encouraging the model to explore challenging regions. As training progress, the penalty shift towards confident voxels to refine predictions and ensure global consistency. Meanwhile, FeCL enhances local feature discrimination in imbalanced regions by introducing dual focal mechanisms and adaptive confidence adjustments into the contrastive principle. These mechanisms jointly prioritizes hard positives and negatives while focusing on uncertain sample pairs, effectively capturing subtle lesion variations under class imbalance. Extensive evaluations on four diverse medical image segmentation datasets (ISLES'22, BraTS'19, LA, Pancreas) show DyCON's superior performance against SOTA methods.

M-Prometheus: A Suite of Open Multilingual LLM Judges

José Pombal,Dongkeun Yoon,Patrick Fernandes,Ian Wu,Seungone Kim,Ricardo Rei,Graham Neubig,André F. T. Martins

Task: 开发一种多语言自动评估长文本的语言模型（M-Prometheus）。

Motivation: 当前大多数语言模型评估工具仅针对英语优化，缺乏多语言评估能力，阻碍了多语言模型的发展。

Details

Method: 提出M-Prometheus，一套参数规模从3B到14B的开源语言模型，支持多语言输出的直接评估和成对比较反馈。 Result: M-Prometheus在多语言奖励基准测试和文学机器翻译评估中表现优于现有开源模型，并能显著提升生成文本质量。 Conclusion: 通过实验确定了构建有效多语言评估模型的关键因素，并公开了模型、训练数据和代码。 Abstract: The use of language models for automatically evaluating long-form text (LLM-as-a-judge) is becoming increasingly common, yet most LLM judges are optimized exclusively for English, with strategies for enhancing their multilingual evaluation capabilities remaining largely unexplored in the current literature. This has created a disparity in the quality of automatic evaluation methods for non-English languages, ultimately hindering the development of models with better multilingual capabilities. To bridge this gap, we introduce M-Prometheus, a suite of open-weight LLM judges ranging from 3B to 14B parameters that can provide both direct assessment and pairwise comparison feedback on multilingual outputs. M-Prometheus models outperform state-of-the-art open LLM judges on multilingual reward benchmarks spanning more than 20 languages, as well as on literary machine translation (MT) evaluation covering 4 language pairs. Furthermore, M-Prometheus models can be leveraged at decoding time to significantly improve generated outputs across all 3 tested languages, showcasing their utility for the development of better multilingual models. Lastly, through extensive ablations, we identify the key factors for obtaining an effective multilingual judge, including backbone model selection and training on natively multilingual feedback data instead of translated data. We release our models, training dataset, and code.

Multimodal Lengthy Videos Retrieval Framework and Evaluation Metric

Mohamed Eltahir,Osamah Sarraj,Mohammed Bremoo,Mohammed Khurd,Abdulrahman Alfrihidi,Taha Alshatiri,Mohammad Almatrafi,Tanveer Hussain

Task: 提出一个统一的框架，结合视觉匹配流和听觉匹配流，用于精确的视频检索，特别是针对长视频。

Motivation: 解决长视频检索中多模态关联的复杂性，以及未见词汇和场景的挑战。

Details

Method: 结合视觉匹配流和听觉匹配流，并引入基于字幕的视频分割方法和两阶段音频检索机制。 Result: 在YouCook2基准测试中展示了有前景的检索性能。 Conclusion: 提出的框架和新的长视频检索评估方法为未来研究提供了支持。 Abstract: Precise video retrieval requires multi-modal correlations to handle unseen vocabulary and scenes, becoming more complex for lengthy videos where models must perform effectively without prior training on a specific dataset. We introduce a unified framework that combines a visual matching stream and an aural matching stream with a unique subtitles-based video segmentation approach. Additionally, the aural stream includes a complementary audio-based two-stage retrieval mechanism that enhances performance on long-duration videos. Considering the complex nature of retrieval from lengthy videos and its corresponding evaluation, we introduce a new retrieval evaluation method specifically designed for long-video retrieval to support further research. We conducted experiments on the YouCook2 benchmark, showing promising retrieval performance.

Constraint Multi-class Positive and Unlabeled Learning for Distantly Supervised Named Entity Recognition

Yuzhe Zhang,Min Cen,Hong Zhang

Task: 提出一种名为CMPU的新方法，用于解决远程监督命名实体识别（DS-NER）中的高假阴性率问题。

Motivation: 远程监督命名实体识别（DS-NER）依赖外部知识库自动标注训练数据，但存在固有的不完整性导致高假阴性率。

Details

Method: 引入约束多类正例和无标记学习（CMPU），在多个正类的风险估计器中加入约束因子，以提高鲁棒性。 Result: 理论分析证明CMPU的有效性，实验结果表明其在两个基准数据集上优于现有DS-NER方法。 Conclusion: CMPU是一种更鲁棒的方法，能够有效解决DS-NER中的高假阴性率问题。 Abstract: Distantly supervised named entity recognition (DS-NER) has been proposed to exploit the automatically labeled training data by external knowledge bases instead of human annotations. However, it tends to suffer from a high false negative rate due to the inherent incompleteness. To address this issue, we present a novel approach called \textbf{C}onstraint \textbf{M}ulti-class \textbf{P}ositive and \textbf{U}nlabeled Learning (CMPU), which introduces a constraint factor on the risk estimator of multiple positive classes. It suggests that the constraint non-negative risk estimator is more robust against overfitting than previous PU learning methods with limited positive data. Solid theoretical analysis on CMPU is provided to prove the validity of our approach. Extensive experiments on two benchmark datasets that were labeled using diverse external knowledge sources serve to demonstrate the superior performance of CMPU in comparison to existing DS-NER methods.

Your Image Generator Is Your New Private Dataset

Nicolo Resmini,Eugenio Lomurno,Cristian Sbrolli,Matteo Matteucci

Task: 提出Text-Conditioned Knowledge Recycling (TCKR)流程，解决文本条件图像生成在构建分类器训练集时的关键问题。

Motivation: 生成扩散模型能解决数据稀缺和标注成本问题，但需解决文本提示构建、领域适应和性能鲁棒性等挑战。

Details

Method: 结合动态图像描述、参数高效扩散模型微调和生成知识蒸馏技术，生成定制化的合成数据集。 Result: 在十个分类基准上，TCKR生成的数据训练的分类器性能与真实数据相当甚至更优，且隐私性显著提升。 Conclusion: 高保真合成数据可替代真实数据，提供强性能与隐私保护的双重优势。 Abstract: Generative diffusion models have emerged as powerful tools to synthetically produce training data, offering potential solutions to data scarcity and reducing labelling costs for downstream supervised deep learning applications. However, effectively leveraging text-conditioned image generation for building classifier training sets requires addressing key issues: constructing informative textual prompts, adapting generative models to specific domains, and ensuring robust performance. This paper proposes the Text-Conditioned Knowledge Recycling (TCKR) pipeline to tackle these challenges. TCKR combines dynamic image captioning, parameter-efficient diffusion model fine-tuning, and Generative Knowledge Distillation techniques to create synthetic datasets tailored for image classification. The pipeline is rigorously evaluated on ten diverse image classification benchmarks. The results demonstrate that models trained solely on TCKR-generated data achieve classification accuracies on par with (and in several cases exceeding) models trained on real images. Furthermore, the evaluation reveals that these synthetic-data-trained models exhibit substantially enhanced privacy characteristics: their vulnerability to Membership Inference Attacks is significantly reduced, with the membership inference AUC lowered by 5.49 points on average compared to using real training data, demonstrating a substantial improvement in the performance-privacy trade-off. These findings indicate that high-fidelity synthetic data can effectively replace real data for training classifiers, yielding strong performance whilst simultaneously providing improved privacy protection as a valuable emergent property. The code and trained models are available in the accompanying open-source repository.

Few Dimensions are Enough: Fine-tuning BERT with Selected Dimensions Revealed Its Redundant Nature

Shion Fukuhata,Yoshinobu Kano

Task: 研究在BERT微调过程中如何选择最终层的输出部分以及各维度信息的作用。

Motivation: 当前BERT微调时通常选择最终层的部分输出输入到新的全连接层，但具体选择哪一部分以及各维度的信息尚不明确。

Details

Method: 通过BERT在GLUE任务上的微调，全面评估了token向量、层和维度的有效性与冗余性。 Result: 结果显示，最终层中除CLS向量外的输出包含等价信息，大多数任务仅需2-3个维度，高层间差异小但低层贡献递减。此外，隐藏层在微调中可能显著变化，BERT存在冗余性，能同时处理多任务，但维度可能过多。 Conclusion: BERT在微调中具有显著冗余性，隐藏层变化大，高层间差异小，且维度可能过多，适合多任务处理。 Abstract: When fine-tuning BERT models for specific tasks, it is common to select part of the final layer's output and input it into a newly created fully connected layer. However, it remains unclear which part of the final layer should be selected and what information each dimension of the layers holds. In this study, we comprehensively investigated the effectiveness and redundancy of token vectors, layers, and dimensions through BERT fine-tuning on GLUE tasks. The results showed that outputs other than the CLS vector in the final layer contain equivalent information, most tasks require only 2-3 dimensions, and while the contribution of lower layers decreases, there is little difference among higher layers. We also evaluated the impact of freezing pre-trained layers and conducted cross-fine-tuning, where fine-tuning is applied sequentially to different tasks. The findings suggest that hidden layers may change significantly during fine-tuning, BERT has considerable redundancy, enabling it to handle multiple tasks simultaneously, and its number of dimensions may be excessive.

Targetless LiDAR-Camera Calibration with Anchored 3D Gaussians

Haebeom Jung,Namtae Kim,Jungwoo Kim,Jaesik Park

Task: 提出一种无需标定目标的LiDAR-相机标定方法，通过联合优化传感器位姿和场景几何。

Motivation: 传统标定方法依赖标定目标（如棋盘格或球形反射器），限制了应用场景的灵活性。

Details

Method: 利用3D高斯场景表示，冻结可靠的LiDAR点作为锚点，通过光度损失联合优化位姿和高斯参数。 Result: 显著减少了传感器错位，提高了渲染质量和PSNR，优于数据集中提供的标定位姿。 Conclusion: 方法在KITTI-360和Waymo数据集上验证有效，且在不同硬件配置下表现鲁棒。 Abstract: We present a targetless LiDAR-camera calibration method that jointly optimizes sensor poses and scene geometry from arbitrary scenes, without relying on traditional calibration targets such as checkerboards or spherical reflectors. Our approach leverages a 3D Gaussian-based scene representation. We first freeze reliable LiDAR points as anchors, then jointly optimize the poses and auxiliary Gaussian parameters in a fully differentiable manner using a photometric loss. This joint optimization significantly reduces sensor misalignment, resulting in higher rendering quality and consistently improved PSNR compared to the carefully calibrated poses provided in popular datasets. We validate our method through extensive experiments on two real-world autonomous driving datasets, KITTI-360 and Waymo, each featuring distinct sensor configurations. Additionally, we demonstrate the robustness of our approach using a custom LiDAR-camera setup, confirming strong performance across diverse hardware configurations.

A Domain-Based Taxonomy of Jailbreak Vulnerabilities in Large Language Models

Carlos Peláez-González,Andrés Herrera-Poyatos,Cristina Zuheros,David Herrera-Poyatos,Virilo Tejedor,Francisco Herrera

Task: 研究大型语言模型（LLMs）的越狱漏洞，并提出一种基于训练领域的新型分类法。

Motivation: 尽管LLMs在自然语言处理方面表现出色，但仍面临一致性、幻觉和越狱漏洞等挑战，尤其是越狱漏洞会绕过对齐保护机制，导致不安全输出。

Details

Method: 提出一种基于LLMs训练领域的越狱攻击分类法，从泛化、目标和鲁棒性角度分析对齐失败。 Result: 提出四种越狱攻击类别：不匹配泛化、竞争目标、对抗鲁棒性和混合攻击，揭示了漏洞的本质。 Conclusion: 通过分类法深入理解LLM行为，为未来研究提供了关键启示。 Abstract: The study of large language models (LLMs) is a key area in open-world machine learning. Although LLMs demonstrate remarkable natural language processing capabilities, they also face several challenges, including consistency issues, hallucinations, and jailbreak vulnerabilities. Jailbreaking refers to the crafting of prompts that bypass alignment safeguards, leading to unsafe outputs that compromise the integrity of LLMs. This work specifically focuses on the challenge of jailbreak vulnerabilities and introduces a novel taxonomy of jailbreak attacks grounded in the training domains of LLMs. It characterizes alignment failures through generalization, objectives, and robustness gaps. Our primary contribution is a perspective on jailbreak, framed through the different linguistic domains that emerge during LLM training and alignment. This viewpoint highlights the limitations of existing approaches and enables us to classify jailbreak attacks on the basis of the underlying model deficiencies they exploit. Unlike conventional classifications that categorize attacks based on prompt construction methods (e.g., prompt templating), our approach provides a deeper understanding of LLM behavior. We introduce a taxonomy with four categories -- mismatched generalization, competing objectives, adversarial robustness, and mixed attacks -- offering insights into the fundamental nature of jailbreak vulnerabilities. Finally, we present key lessons derived from this taxonomic study.

Systematic Literature Review on Vehicular Collaborative Perception -- A Computer Vision Perspective

Lei Wan,Jianxin Zhao,Andreas Wiedholz,Manuel Bied,Mateus Martinez de Lucena,Abhishek Dinkar Jagtap,Andreas Festag,Antônio Augusto Fröhlich,Hannan Ejaz Keen,Alexey Vinel

Task: 系统综述协作感知（CP）在自动驾驶车辆中的应用及其挑战。

Motivation: 当前单车感知系统存在视觉遮挡和远距离检测能力有限的问题，协作感知（CP）通过V2V和V2I通信提供解决方案，但缺乏系统性文献综述以减少主观偏差。

Details

Method: 遵循PRISMA 2020指南，分析106篇同行评审文章，基于模态、协作方案和关键感知任务进行比较分析。 Result: 比较了不同方法对姿态误差、时间延迟、通信限制等问题的处理，并指出当前评估方法与CP目标的不一致。 Conclusion: 综述深入探讨了挑战、机遇和风险，为未来协作感知研究提供了参考。 Abstract: The effectiveness of autonomous vehicles relies on reliable perception capabilities. Despite significant advancements in artificial intelligence and sensor fusion technologies, current single-vehicle perception systems continue to encounter limitations, notably visual occlusions and limited long-range detection capabilities. Collaborative Perception (CP), enabled by Vehicle-to-Vehicle (V2V) and Vehicle-to-Infrastructure (V2I) communication, has emerged as a promising solution to mitigate these issues and enhance the reliability of autonomous systems. Beyond advancements in communication, the computer vision community is increasingly focusing on improving vehicular perception through collaborative approaches. However, a systematic literature review that thoroughly examines existing work and reduces subjective bias is still lacking. Such a systematic approach helps identify research gaps, recognize common trends across studies, and inform future research directions. In response, this study follows the PRISMA 2020 guidelines and includes 106 peer-reviewed articles. These publications are analyzed based on modalities, collaboration schemes, and key perception tasks. Through a comparative analysis, this review illustrates how different methods address practical issues such as pose errors, temporal latency, communication constraints, domain shifts, heterogeneity, and adversarial attacks. Furthermore, it critically examines evaluation methodologies, highlighting a misalignment between current metrics and CP's fundamental objectives. By delving into all relevant topics in-depth, this review offers valuable insights into challenges, opportunities, and risks, serving as a reference for advancing research in vehicular collaborative perception.

Following the Whispers of Values: Unraveling Neural Mechanisms Behind Value-Oriented Behaviors in LLMs

Ling Hu,Yuemei Xu,Xiaoyang Gu,Letao Han

Task: 提出一个名为ValueExploration的新框架，旨在从神经元层面探索大型语言模型（LLMs）中国家社会价值观的行为驱动机制。

Motivation: 尽管大型语言模型表现出色，但其可能存在由编码价值观驱动的偏见和有害行为，亟需理解其背后的价值观机制。当前研究主要通过外部响应评估这些价值观，缺乏可解释性且未能评估现实世界中的社会价值观。

Details

Method: 提出ValueExploration框架，构建C-voice双语基准，通过激活差异识别和定位编码价值观的神经元，并通过停用这些神经元分析模型行为变化。 Result: 在四个代表性LLMs上的广泛实验验证了框架的有效性。 Conclusion: 揭示了价值观如何影响LLM决策的内部机制，并提供了可用的基准和代码。 Abstract: Despite the impressive performance of large language models (LLMs), they can present unintended biases and harmful behaviors driven by encoded values, emphasizing the urgent need to understand the value mechanisms behind them. However, current research primarily evaluates these values through external responses with a focus on AI safety, lacking interpretability and failing to assess social values in real-world contexts. In this paper, we propose a novel framework called ValueExploration, which aims to explore the behavior-driven mechanisms of National Social Values within LLMs at the neuron level. As a case study, we focus on Chinese Social Values and first construct C-voice, a large-scale bilingual benchmark for identifying and evaluating Chinese Social Values in LLMs. By leveraging C-voice, we then identify and locate the neurons responsible for encoding these values according to activation difference. Finally, by deactivating these neurons, we analyze shifts in model behavior, uncovering the internal mechanism by which values influence LLM decision-making. Extensive experiments on four representative LLMs validate the efficacy of our framework. The benchmark and code will be available.

M2IV: Towards Efficient and Fine-grained Multimodal In-Context Learning in Large Vision-Language Models

Yanshu Li,Hongyang He,Yi Cao,Qisen Cheng,Xiang Fu,Ruixiang Tang

Task: 提出一种名为M2IV的方法，通过可学习的上下文向量替代显式演示，以提升大型视觉语言模型（LVLMs）的多模态上下文学习能力。

Motivation: 多模态上下文学习（ICL）在LVLMs中的应用受到输入的高令牌密集性和跨模态少样本学习的高复杂性的限制，影响了表示方法的表达能力。

Details

Method: M2IV方法结合多头注意力（MHA）和多层感知器（MLP）的优势，通过训练实现跨模态保真度和细粒度语义蒸馏。 Result: 在七个基准测试和三种LVLMs上，M2IV平均准确率比传统ICL高出3.74%，并具有显著的效率优势。 Conclusion: M2IV通过可学习上下文向量和VLibrary的引入，显著提升了LVLMs的性能和灵活性，适用于多种任务。 Abstract: Multimodal in-context learning (ICL) is a vital capability for Large Vision-Language Models (LVLMs), allowing task adaptation via contextual prompts without parameter retraining. However, its application is hindered by the token-intensive nature of inputs and the high complexity of cross-modal few-shot learning, which limits the expressive power of representation methods. To tackle these challenges, we propose \textbf{M2IV}, a method that substitutes explicit demonstrations with learnable \textbf{I}n-context \textbf{V}ectors directly integrated into LVLMs. By exploiting the complementary strengths of multi-head attention (\textbf{M}HA) and multi-layer perceptrons (\textbf{M}LP), M2IV achieves robust cross-modal fidelity and fine-grained semantic distillation through training. This significantly enhances performance across diverse LVLMs and tasks and scales efficiently to many-shot scenarios, bypassing the context window limitations. We also introduce \textbf{VLibrary}, a repository for storing and retrieving M2IV, enabling flexible LVLM steering for tasks like cross-modal alignment, customized generation and safety improvement. Experiments across seven benchmarks and three LVLMs show that M2IV surpasses Vanilla ICL and prior representation engineering approaches, with an average accuracy gain of \textbf{3.74\%} over ICL with the same shot count, alongside substantial efficiency advantages.

Surveying Professional Writers on AI: Limitations, Expectations, and Fears

Anastasiia Ivanova,Natalia Fedorova,Sergey Tilga,Ekaterina Artemova

Task: 研究大型语言模型（LLMs）在专业写作中的应用及其影响。

Motivation: 探讨LLMs在多语言支持、伦理问题以及对作者声音和创造力的长期影响等方面的未充分研究领域。

Details

Method: 通过问卷调查（N=301）和互动调查（N=36）收集专业写作者的数据，分析LLMs在25种以上语言中的使用情况、伦理问题及用户期望。 Result: 调查揭示了LLMs对非英语用户的重要性、错误信息程度、领域和风格适应性、以及可用性和关键功能等方面的见解。 Conclusion: 这些发现可为LLMs的进一步开发提供指导，使写作者和更广泛的用户群体受益。 Abstract: The rapid development of AI-driven tools, particularly large language models (LLMs), is reshaping professional writing. Still, key aspects of their adoption such as languages support, ethics, and long-term impact on writers voice and creativity remain underexplored. In this work, we conducted a questionnaire (N = 301) and an interactive survey (N = 36) targeting professional writers regularly using AI. We examined LLM-assisted writing practices across 25+ languages, ethical concerns, and user expectations. The findings of the survey demonstrate important insights, reflecting upon the importance of: LLMs adoption for non-English speakers; the degree of misinformation, domain and style adaptation; usability and key features of LLMs. These insights can guide further development, benefiting both writers and a broader user base.

Yimu Wang,Mozhgan Nasr Azadani,Sean Sedwards,Krzysztof Czarnecki

Task: 提出一种名为LEO-MINI的多模态大语言模型，以减少视觉标记数量并提升视觉推理能力。

Motivation: 现有方法在减少视觉标记数量时会牺牲视觉推理能力，因此需要一种既能高效减少标记数量又能提升推理能力的新方法。

Details

Method: LEO-MINI结合了CoTR（一种新颖的标记减少模块）和MMoE（一种多模态专家混合模块），通过视觉标记相似性和动态路由机制优化模型性能。 Result: LEO-MINI在多种视觉语言任务中表现出更高的效率和性能。 Conclusion: LEO-MINI通过创新的标记减少和专家混合模块，显著提升了多模态大语言模型的效率和推理能力。 Abstract: Redundancy of visual tokens in multi-modal large language models (MLLMs) significantly reduces their computational efficiency. Recent approaches, such as resamplers and summarizers, have sought to reduce the number of visual tokens, but at the cost of visual reasoning ability. To address this, we propose LEO-MINI, a novel MLLM that significantly reduces the number of visual tokens and simultaneously boosts visual reasoning capabilities. For efficiency, LEO-MINI incorporates CoTR, a novel token reduction module to consolidate a large number of visual tokens into a smaller set of tokens, using the similarity between visual tokens, text tokens, and a compact learnable query. For effectiveness, to scale up the model's ability with minimal computational overhead, LEO-MINI employs MMoE, a novel mixture of multi-modal experts module. MMOE employs a set of LoRA experts with a novel router to switch between them based on the input text and visual tokens instead of only using the input hidden state. MMoE also includes a general LoRA expert that is always activated to learn general knowledge for LLM reasoning. For extracting richer visual features, MMOE employs a set of vision experts trained on diverse domain-specific data. To demonstrate LEO-MINI's improved efficiency and performance, we evaluate it against existing efficient MLLMs on various benchmark vision-language tasks.

Batch Aggregation: An Approach to Enhance Text Classification with Correlated Augmented Data

Charco Hui,Yalu Wen

Task: 提出一种名为'Batch Aggregation'（BAGG）的新方法，用于改进文本增强技术在文本分类任务中的效果。

Motivation: 传统文本分类方法忽略了增强文本之间的关系，将其视为独立样本，可能导致分类错误。

Details

Method: 通过引入一个额外的层来聚合相关增强文本的结果，显式建模增强文本之间的依赖关系。 Result: 在多个领域的基准数据集上，BAGG提高了分类准确性，尤其在领域特定数据集上表现更显著，准确性提升达10-29%。 Conclusion: BAGG方法克服了传统技术的局限性，在训练数据有限的情况下提供了更鲁棒的结果。 Abstract: Natural language processing models often face challenges due to limited labeled data, especially in domain specific areas, e.g., clinical trials. To overcome this, text augmentation techniques are commonly used to increases sample size by transforming the original input data into artificial ones with the label preserved. However, traditional text classification methods ignores the relationship between augmented texts and treats them as independent samples which may introduce classification error. Therefore, we propose a novel approach called 'Batch Aggregation' (BAGG) which explicitly models the dependence of text inputs generated through augmentation by incorporating an additional layer that aggregates results from correlated texts. Through studying multiple benchmark data sets across different domains, we found that BAGG can improve classification accuracy. We also found that the increase of performance with BAGG is more obvious in domain specific data sets, with accuracy improvements of up to 10-29%. Through the analysis of benchmark data, the proposed method addresses limitations of traditional techniques and improves robustness in text classification tasks. Our result demonstrates that BAGG offers more robust results and outperforms traditional approaches when training data is limited.

3DM-WeConvene: Learned Image Compression with 3D Multi-Level Wavelet-Domain Convolution and Entropy Model

Haisheng Fu,Jie Liang,Feng Liang,Zhenman Fang,Guohe Zhang,Jingning Han

Task: 提出一种集成3D多级离散小波变换（DWT）的框架，以改进基于卷积神经网络的图像压缩（LIC）方法。

Motivation: 当前大多数LIC方法主要在空间域操作，缺乏减少频域相关性的机制，限制了性能提升。

Details

Method: 提出3D多级小波域卷积层（3DM-WeConv）和3D小波域通道自回归熵模型（3DWeChARM），结合两步训练策略。 Result: 在Kodak、Tecnick 100和CLIC测试集上，BD-Rate分别降低-12.24%、-15.51%和-12.97%，优于H.266/VVC。 Conclusion: 该框架显著提升了LIC的率失真性能和计算效率，尤其适用于高分辨率图像。 Abstract: Learned image compression (LIC) has recently made significant progress, surpassing traditional methods. However, most LIC approaches operate mainly in the spatial domain and lack mechanisms for reducing frequency-domain correlations. To address this, we propose a novel framework that integrates low-complexity 3D multi-level Discrete Wavelet Transform (DWT) into convolutional layers and entropy coding, reducing both spatial and channel correlations to improve frequency selectivity and rate-distortion (R-D) performance. Our proposed 3D multi-level wavelet-domain convolution (3DM-WeConv) layer first applies 3D multi-level DWT (e.g., 5/3 and 9/7 wavelets from JPEG 2000) to transform data into the wavelet domain. Then, different-sized convolutions are applied to different frequency subbands, followed by inverse 3D DWT to restore the spatial domain. The 3DM-WeConv layer can be flexibly used within existing CNN-based LIC models. We also introduce a 3D wavelet-domain channel-wise autoregressive entropy model (3DWeChARM), which performs slice-based entropy coding in the 3D DWT domain. Low-frequency (LF) slices are encoded first to provide priors for high-frequency (HF) slices. A two-step training strategy is adopted: first balancing LF and HF rates, then fine-tuning with separate weights. Extensive experiments demonstrate that our framework consistently outperforms state-of-the-art CNN-based LIC methods in R-D performance and computational complexity, with larger gains for high-resolution images. On the Kodak, Tecnick 100, and CLIC test sets, our method achieves BD-Rate reductions of -12.24%, -15.51%, and -12.97%, respectively, compared to H.266/VVC.

Revealing the Intrinsic Ethical Vulnerability of Aligned Large Language Models

Jiawei Lian,Jianhong Pan,Lefan Wang,Yi Wang,Shaohui Mei,Lap-Pui Chau

Task: 分析大型语言模型（LLMs）在预训练和指令调优后仍存在的伦理漏洞及其对抗性诱导下的表现。

Motivation: 尽管通过指令调优和偏好学习使LLMs与人类价值观对齐，但其预训练阶段嵌入的有害知识仍以“暗模式”形式存在，导致对齐措施失效。

Details

Method: 通过理论分析证明当前对齐方法仅产生局部的“安全区域”，并通过语义一致性诱导在分布偏移下验证模型的脆弱性。 Result: 在23个先进对齐LLMs中，19个模型的攻击成功率达到100%，揭示了其普遍存在的漏洞。 Conclusion: 当前的对齐方法无法完全消除预训练知识中的有害内容，需要更全局化的安全策略。 Abstract: Large language models (LLMs) are foundational explorations to artificial general intelligence, yet their alignment with human values via instruction tuning and preference learning achieves only superficial compliance. Here, we demonstrate that harmful knowledge embedded during pretraining persists as indelible "dark patterns" in LLMs' parametric memory, evading alignment safeguards and resurfacing under adversarial inducement at distributional shifts. In this study, we first theoretically analyze the intrinsic ethical vulnerability of aligned LLMs by proving that current alignment methods yield only local "safety regions" in the knowledge manifold. In contrast, pretrained knowledge remains globally connected to harmful concepts via high-likelihood adversarial trajectories. Building on this theoretical insight, we empirically validate our findings by employing semantic coherence inducement under distributional shifts--a method that systematically bypasses alignment constraints through optimized adversarial prompts. This combined theoretical and empirical approach achieves a 100% attack success rate across 19 out of 23 state-of-the-art aligned LLMs, including DeepSeek-R1 and LLaMA-3, revealing their universal vulnerabilities.

Dual Consistent Constraint via Disentangled Consistency and Complementarity for Multi-view Clustering

Bo Li,Jing Yun

Task: 提出一种新颖的多视图聚类框架，通过解耦变分自编码器分离共享和私有信息，以利用一致性和互补性信息。

Motivation: 现有方法仅关注表示学习中的一致性，忽视了多视图中互补性信息的贡献，限制了多视图表示学习的效果。

Details

Method: 采用解耦变分自编码器，通过对比学习最大化不同视图间的互信息，并利用一致性推理约束显式利用互补性信息。 Result: 实验表明，该方法在性能上优于基线方法，并能有效提升数据表示质量。 Conclusion: 该框架首次在统一的多视图聚类理论中引入双重一致性约束，为复杂多视图场景提供了有效解决方案。 Abstract: Multi-view clustering can explore common semantics from multiple views and has received increasing attention in recent years. However, current methods focus on learning consistency in representation, neglecting the contribution of each view's complementarity aspect in representation learning. This limit poses a significant challenge in multi-view representation learning. This paper proposes a novel multi-view clustering framework that introduces a disentangled variational autoencoder that separates multi-view into shared and private information, i.e., consistency and complementarity information. We first learn informative and consistent representations by maximizing mutual information across different views through contrastive learning. This process will ignore complementary information. Then, we employ consistency inference constraints to explicitly utilize complementary information when attempting to seek the consistency of shared information across all views. Specifically, we perform a within-reconstruction using the private and shared information of each view and a cross-reconstruction using the shared information of all views. The dual consistency constraints are not only effective in improving the representation quality of data but also easy to extend to other scenarios, especially in complex multi-view scenes. This could be the first attempt to employ dual consistent constraint in a unified MVC theoretical framework. During the training procedure, the consistency and complementarity features are jointly optimized. Extensive experiments show that our method outperforms baseline methods.

Not All Data Are Unlearned Equally

Aravind Krishnan,Siva Reddy,Marius Mosbach

Task: 研究大型语言模型（LLM）中知识遗忘的效果，特别是知识在预训练数据中的频率对遗忘成功的影响。

Motivation: 现有方法对所有需要遗忘的数据点一视同仁，但实际中不同知识的遗忘难度可能不同，尤其是隐私相关数据。

Details

Method: 分析知识在预训练数据中的频率对遗忘效果的影响，并比较基于概率和生成的评估方法。 Result: 发现高频知识更难遗忘，且模型规模增大会加剧评估方法的不一致性。 Conclusion: 需要改进评估方法并开发考虑训练数据的新型遗忘方法。 Abstract: Machine unlearning is concerned with the task of removing knowledge learned from particular data points from a trained model. In the context of large language models (LLMs), unlearning has recently received increased attention, particularly for removing knowledge about named entities from models for privacy purposes. While various approaches have been proposed to address the unlearning problem, most existing approaches treat all data points to be unlearned equally, i.e., unlearning that Montreal is a city in Canada is treated exactly the same as unlearning the phone number of the first author of this paper. In this work, we show that this all data is equal assumption does not hold for LLM unlearning. We study how the success of unlearning depends on the frequency of the knowledge we want to unlearn in the pre-training data of a model and find that frequency strongly affects unlearning, i.e., more frequent knowledge is harder to unlearn. Additionally, we uncover a misalignment between probability and generation-based evaluations of unlearning and show that this problem worsens as models become larger. Overall, our experiments highlight the need for better evaluation practices and novel methods for LLM unlearning that take the training data of models into account.

DeclutterNeRF: Generative-Free 3D Scene Recovery for Occlusion Removal

Wanzhou Liu,Zhexiao Xiong,Xinyu Li,Nathan Jacobs

Task: 提出一种无需生成先验的遮挡物去除方法DeclutterNeRF，并构建包含复杂遮挡场景的数据集DeclutterSet。

Motivation: 现有遮挡物去除方法依赖生成先验，导致新伪影和模糊；且现有评估数据集缺乏真实复杂性和视角变化。

Details

Method: DeclutterNeRF通过联合多视角优化可学习相机参数、遮挡退火正则化和可解释随机结构相似性损失实现高质量重建。 Result: DeclutterNeRF在DeclutterSet上显著优于现有方法，为未来研究奠定基础。 Conclusion: DeclutterNeRF和DeclutterSet解决了现有遮挡物去除方法的局限性，提供了高质量且无伪影的重建结果。 Abstract: Recent novel view synthesis (NVS) techniques, including Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have greatly advanced 3D scene reconstruction with high-quality rendering and realistic detail recovery. Effectively removing occlusions while preserving scene details can further enhance the robustness and applicability of these techniques. However, existing approaches for object and occlusion removal predominantly rely on generative priors, which, despite filling the resulting holes, introduce new artifacts and blurriness. Moreover, existing benchmark datasets for evaluating occlusion removal methods lack realistic complexity and viewpoint variations. To address these issues, we introduce DeclutterSet, a novel dataset featuring diverse scenes with pronounced occlusions distributed across foreground, midground, and background, exhibiting substantial relative motion across viewpoints. We further introduce DeclutterNeRF, an occlusion removal method free from generative priors. DeclutterNeRF introduces joint multi-view optimization of learnable camera parameters, occlusion annealing regularization, and employs an explainable stochastic structural similarity loss, ensuring high-quality, artifact-free reconstructions from incomplete images. Experiments demonstrate that DeclutterNeRF significantly outperforms state-of-the-art methods on our proposed DeclutterSet, establishing a strong baseline for future research.

On the Performance of an Explainable Language Model on PubMedQA

Venkat Srinivasan,Vishaal Jatav,Anushka Chandrababu,Geetika Sharma

Task: 开发一种可解释的语言模型Gyan，用于在医学问答数据集上实现高性能。

Motivation: 现有大型语言模型（LLMs）在医学知识检索和问答方面表现优异，但存在不可解释性、幻觉问题、维护困难以及计算资源需求高等问题。

Details

Method: 采用一种基于替代架构的组合式语言模型Gyan，将模型与知识解耦，使其透明、可信且无需大量计算资源。 Result: Gyan-4.3在PubmedQA数据集上达到87.1%的准确率，优于GPT-4和Med-PaLM 2。 Conclusion: Gyan是一种高效、可解释且易于跨领域迁移的语言模型，为医学问答提供了新的解决方案。 Abstract: Large language models (LLMs) have shown significant abilities in retrieving medical knowledge, reasoning over it and answering medical questions comparably to physicians. However, these models are not interpretable, hallucinate, are difficult to maintain and require enormous compute resources for training and inference. In this paper, we report results from Gyan, an explainable language model based on an alternative architecture, on the PubmedQA data set. The Gyan LLM is a compositional language model and the model is decoupled from knowledge. Gyan is trustable, transparent, does not hallucinate and does not require significant training or compute resources. Gyan is easily transferable across domains. Gyan-4.3 achieves SOTA results on PubmedQA with 87.1% accuracy compared to 82% by MedPrompt based on GPT-4 and 81.8% by Med-PaLM 2 (Google and DeepMind). We will be reporting results for other medical data sets - MedQA, MedMCQA, MMLU - Medicine in the future.

Bridging Knowledge Gap Between Image Inpainting and Large-Area Visible Watermark Removal

Yicheng Leng,Chaowei Fang,Junye Chen,Yixiang Fang,Sheng Li,Guanbin Li

Task: 提出一种新的特征适应框架，用于可见水印去除，包括水印清洁和背景内容恢复。

Motivation: 现有基于深度神经网络的模型在处理大面积水印时表现不佳，且过度依赖高质量水印掩码预测。

Details

Method: 利用预训练图像修复模型的表示建模能力，通过双分支系统捕获和嵌入残差背景内容特征，并通过门控特征融合模块将其合并到修复主干模型中。 Result: 在合成数据集和真实数据集上的实验表明，该方法显著优于现有最先进方法。 Conclusion: 该方法不仅减少了对高质量水印掩码的依赖，还提升了水印去除的效果。 Abstract: Visible watermark removal which involves watermark cleaning and background content restoration is pivotal to evaluate the resilience of watermarks. Existing deep neural network (DNN)-based models still struggle with large-area watermarks and are overly dependent on the quality of watermark mask prediction. To overcome these challenges, we introduce a novel feature adapting framework that leverages the representation modeling capacity of a pre-trained image inpainting model. Our approach bridges the knowledge gap between image inpainting and watermark removal by fusing information of the residual background content beneath watermarks into the inpainting backbone model. We establish a dual-branch system to capture and embed features from the residual background content, which are merged into intermediate features of the inpainting backbone model via gated feature fusion modules. Moreover, for relieving the dependence on high-quality watermark masks, we introduce a new training paradigm by utilizing coarse watermark masks to guide the inference process. This contributes to a visible image removal model which is insensitive to the quality of watermark mask during testing. Extensive experiments on both a large-scale synthesized dataset and a real-world dataset demonstrate that our approach significantly outperforms existing state-of-the-art methods. The source code is available in the supplementary materials.

The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning

Tianshi Zheng,Yixiang Chen,Chengxi Li,Chunyang Li,Qing Zong,Haochen Shi,Baixuan Xu,Yangqiu Song,Ginny Y. Wong,Simon See

Task: 研究Chain-of-Thought (CoT)提示在大型语言模型(LLMs)中的表现及其局限性。

Motivation: 发现CoT及其推理变体在模式化上下文学习(ICL)中表现不佳，挑战了其普遍有效的假设。

Details

Method: 通过16种先进LLMs和9种多样化模式化ICL数据集的实验，验证CoT表现不佳的原因。 Result: 揭示了CoT在模式化ICL中的显隐双重性：显式推理因LLMs难以推断模式而失败，隐式推理部分弥补但无法完全克服限制。 Conclusion: CoT并非普遍有效，研究结果为未来设计更有效的LLM推理方法提供了新视角。 Abstract: Chain-of-Thought (CoT) prompting has been widely recognized for its ability to enhance reasoning capabilities in large language models (LLMs) through the generation of explicit explanatory rationales. However, our study reveals a surprising contradiction to this prevailing perspective. Through extensive experiments involving 16 state-of-the-art LLMs and nine diverse pattern-based in-context learning (ICL) datasets, we demonstrate that CoT and its reasoning variants consistently underperform direct answering across varying model scales and benchmark complexities. To systematically investigate this unexpected phenomenon, we designed extensive experiments to validate several hypothetical explanations. Our analysis uncovers a fundamental explicit-implicit duality driving CoT's performance in pattern-based ICL: while explicit reasoning falters due to LLMs' struggles to infer underlying patterns from demonstrations, implicit reasoning-disrupted by the increased contextual distance of CoT rationales-often compensates, delivering correct answers despite flawed rationales. This duality explains CoT's relative underperformance, as noise from weak explicit inference undermines the process, even as implicit mechanisms partially salvage outcomes. Notably, even long-CoT reasoning models, which excel in abstract and symbolic reasoning, fail to fully overcome these limitations despite higher computational costs. Our findings challenge existing assumptions regarding the universal efficacy of CoT, yielding novel insights into its limitations and guiding future research toward more nuanced and effective reasoning methodologies for LLMs.

DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation

Bo-Wen Yin,Jiao-Long Cao,Ming-Ming Cheng,Qibin Hou

Task: 探索一种新的RGBD特征表示学习方法，提出DFormerv2，将深度图作为几何先验而非通过神经网络编码。

Motivation: 深度图提供了3D几何信息，但现有方法通常与RGB图像一起编码，是否必须像RGB图像那样显式编码深度信息？

Details

Method: DFormerv2利用深度图提取几何线索和空间距离，作为自注意力机制中的几何先验分配权重。 Result: DFormerv2在多个RGBD语义分割基准测试中表现优异。 Conclusion: DFormerv2通过将深度图作为几何先验，提供了一种更高效的RGBD特征表示学习方法。 Abstract: Recent advances in scene understanding benefit a lot from depth maps because of the 3D geometry information, especially in complex conditions (e.g., low light and overexposed). Existing approaches encode depth maps along with RGB images and perform feature fusion between them to enable more robust predictions. Taking into account that depth can be regarded as a geometry supplement for RGB images, a straightforward question arises: Do we really need to explicitly encode depth information with neural networks as done for RGB images? Based on this insight, in this paper, we investigate a new way to learn RGBD feature representations and present DFormerv2, a strong RGBD encoder that explicitly uses depth maps as geometry priors rather than encoding depth information with neural networks. Our goal is to extract the geometry clues from the depth and spatial distances among all the image patch tokens, which will then be used as geometry priors to allocate attention weights in self-attention. Extensive experiments demonstrate that DFormerv2 exhibits exceptional performance in various RGBD semantic segmentation benchmarks. Code is available at: https://github.com/VCIP-RGBD/DFormer.

State Tuning: State-based Test-Time Scaling on RWKV-7

Liu Xiao,Li Zhiyuan,Lin Yueyu

Task: 提出一种基于状态调整（state tuning）的测试时扩展方法，针对RNN-based RWKV-7模型。

Motivation: 利用RWKV-7模型的独特优势，在不改变预训练权重的情况下提升性能。

Details

Method: 开发观察者框架、动态扩展状态大小的核方法、集成去相关反向传播（DBP）优化状态矩阵。 Result: 小模型在目标任务上表现优于大模型，同时保持RWKV-7架构的高效性。 Conclusion: 状态调整是一种在资源受限环境下提升模型性能的有效策略。 Abstract: Test-time scaling has emerged as a prominent research direction in machine learning, enabling models to enhance their expressive capabilities during inference.Transformers, renowned for striking a delicate balance between efficiency and expressiveness, have benefited from test-time scaling techniques that leverage an expanding key-value (KV) cache to significantly improve performance.In this paper, we introduce a novel state-based approach to test-time scaling, which we term state tuning, tailored to the RNN-based RWKV-7 model.By exploiting the unique strengths of RWKV-7, our method achieves state-of-the-art performance on the target task without altering the model's pre-trained weights. Our approach centers on three key innovations. First, we develop an observer framework that allows a smaller model to replicate and learn the state dynamics of the RWKV-7 model. Second, we employ a kernel method to dynamically upscale the state size, enhancing the model's capacity to capture intricate patterns. Third, we integrate Decorrelated Backpropagation (DBP) to optimize the upscaled state matrix, thereby improving convergence and expressivity. By tuning only the state matrix, we demonstrate that a smaller model can outperform larger models on the given task. This method preserves the efficiency of the original RWKV-7 architecture while harnessing the power of test-time scaling to deliver superior results. Our findings underscore the potential of state tuning as an effective strategy for advancing model performance in resource-constrained settings. Our code is https://github.com/TorchRWKV/flash-linear-attention.

SapiensID: Foundation for Human Recognition

Minchul Kim,Dingqiang Ye,Yiyang Su,Feng Liu,Xiaoming Liu

Task: 提出一种统一的人体识别模型SapiensID，解决现有系统在多样化场景中因依赖独立模型而效果受限的问题。

Motivation: 现有的人脸和身体分析系统通常依赖独立的专门模型，限制了在姿态、可见性和上下文变化广泛的实际场景中的有效性。

Details

Method: SapiensID引入Retina Patch（RP）动态补丁生成方案、掩码识别模型（MRM）和语义注意力头（SAH），并利用大规模数据集WebBody4M进行训练。 Result: SapiensID在多种人体ReID基准测试中达到最先进水平，优于专门模型，并在跨姿态尺度ReID新挑战中表现优异。 Conclusion: SapiensID为复杂现实条件下的统一识别任务提供了强有力的解决方案。 Abstract: Existing human recognition systems often rely on separate, specialized models for face and body analysis, limiting their effectiveness in real-world scenarios where pose, visibility, and context vary widely. This paper introduces SapiensID, a unified model that bridges this gap, achieving robust performance across diverse settings. SapiensID introduces (i) Retina Patch (RP), a dynamic patch generation scheme that adapts to subject scale and ensures consistent tokenization of regions of interest, (ii) a masked recognition model (MRM) that learns from variable token length, and (iii) Semantic Attention Head (SAH), an module that learns pose-invariant representations by pooling features around key body parts. To facilitate training, we introduce WebBody4M, a large-scale dataset capturing diverse poses and scale variations. Extensive experiments demonstrate that SapiensID achieves state-of-the-art results on various body ReID benchmarks, outperforming specialized models in both short-term and long-term scenarios while remaining competitive with dedicated face recognition systems. Furthermore, SapiensID establishes a strong baseline for the newly introduced challenge of Cross Pose-Scale ReID, demonstrating its ability to generalize to complex, real-world conditions.

AI for Climate Finance: Agentic Retrieval and Multi-Step Reasoning for Early Warning System Investments

Saeid Ario Vaghefi,Aymane Hachcham,Veronica Grasso,Jiska Manicus,Nakiete Msemo,Chiara Colesanti Senni,Markus Leippold

Task: 开发一种基于LLM的AI系统，用于追踪和分类气候适应投资，特别是在早期预警系统（EWS）领域。

Motivation: 由于多边开发银行（MDBs）和基金在气候适应投资中缺乏标准化的财务报告，追踪这些投资变得复杂且需要专业知识。

Details

Method: 结合上下文检索、微调和多步推理的LLM代理AI系统，采用零样本学习、少样本学习、微调变压器分类器、思维链提示和基于代理的检索增强生成（RAG）方法。 Result: 基于代理的RAG方法显著优于其他方法，达到87%准确率、89%精确率和83%召回率。 Conclusion: 该研究不仅提出了一种高效的AI驱动财务追踪方法，还贡献了一个基准数据集和专家标注语料库，为未来研究提供了宝贵资源。 Abstract: Tracking financial investments in climate adaptation is a complex and expertise-intensive task, particularly for Early Warning Systems (EWS), which lack standardized financial reporting across multilateral development banks (MDBs) and funds. To address this challenge, we introduce an LLM-based agentic AI system that integrates contextual retrieval, fine-tuning, and multi-step reasoning to extract relevant financial data, classify investments, and ensure compliance with funding guidelines. Our study focuses on a real-world application: tracking EWS investments in the Climate Risk and Early Warning Systems (CREWS) Fund. We analyze 25 MDB project documents and evaluate multiple AI-driven classification methods, including zero-shot and few-shot learning, fine-tuned transformer-based classifiers, chain-of-thought (CoT) prompting, and an agent-based retrieval-augmented generation (RAG) approach. Our results show that the agent-based RAG approach significantly outperforms other methods, achieving 87\% accuracy, 89\% precision, and 83\% recall. Additionally, we contribute a benchmark dataset and expert-annotated corpus, providing a valuable resource for future research in AI-driven financial tracking and climate finance transparency.

On the Robustness of GUI Grounding Models Against Image Attacks

Haoren Zhao,Tianyi Chen,Zhen Wang

Task: 系统地评估最先进的GUI接地模型在自然噪声、无目标对抗攻击和目标对抗攻击三种条件下的鲁棒性。

Motivation: GUI接地模型在现实场景中面临自然噪声和对抗扰动的鲁棒性挑战，其鲁棒性尚未得到充分研究。

Details

Method: 在多种GUI环境（移动、桌面和网页界面）下对UGround等模型进行实验。 Result: 实验表明GUI接地模型对对抗扰动和低分辨率条件高度敏感。 Conclusion: 研究揭示了GUI接地模型的脆弱性，并为未来提升其鲁棒性的研究提供了基准。 Abstract: Graphical User Interface (GUI) grounding models are crucial for enabling intelligent agents to understand and interact with complex visual interfaces. However, these models face significant robustness challenges in real-world scenarios due to natural noise and adversarial perturbations, and their robustness remains underexplored. In this study, we systematically evaluate the robustness of state-of-the-art GUI grounding models, such as UGround, under three conditions: natural noise, untargeted adversarial attacks, and targeted adversarial attacks. Our experiments, which were conducted across a wide range of GUI environments, including mobile, desktop, and web interfaces, have clearly demonstrated that GUI grounding models exhibit a high degree of sensitivity to adversarial perturbations and low-resolution conditions. These findings provide valuable insights into the vulnerabilities of GUI grounding models and establish a strong benchmark for future research aimed at enhancing their robustness in practical applications. Our code is available at https://github.com/ZZZhr-1/Robust_GUI_Grounding.

DoCIA: An Online Document-Level Context Incorporation Agent for Speech Translation

Xinglin Lyu,Wei Tang,Yuang Li,Xiaofeng Zhao,Ming Zhu,Junhui Li,Yunfei Lu,Min Zhang,Daimeng Wei,Hao Yang,Min Zhang

Task: 开发一个在线框架（DoCIA）以通过引入文档级上下文提升语音翻译（ST）性能。

Motivation: 文档级上下文对处理文本到文本的文档级机器翻译（MT）中的语篇挑战至关重要，但在语音翻译（ST）中，由于自动语音识别（ASR）噪声的引入，文档级上下文的整合尚未充分探索。

Details

Method: DoCIA将ST流程分解为四个阶段，并通过基于大型语言模型（LLM）的辅助模块在ASR细化、MT和MT细化阶段引入文档级上下文，同时以多级方式利用文档级信息并最小化计算开销。此外，引入了一种简单而有效的确定机制以防止过度细化导致的幻觉。 Result: 实验结果表明，DoCIA在四个LLM上显著优于传统ST基线，在句子和语篇指标上均表现出色。 Conclusion: DoCIA通过有效整合文档级上下文，显著提升了语音翻译的性能和可靠性。 Abstract: Document-level context is crucial for handling discourse challenges in text-to-text document-level machine translation (MT). Despite the increased discourse challenges introduced by noise from automatic speech recognition (ASR), the integration of document-level context in speech translation (ST) remains insufficiently explored. In this paper, we develop DoCIA, an online framework that enhances ST performance by incorporating document-level context. DoCIA decomposes the ST pipeline into four stages. Document-level context is integrated into the ASR refinement, MT, and MT refinement stages through auxiliary LLM (large language model)-based modules. Furthermore, DoCIA leverages document-level information in a multi-level manner while minimizing computational overhead. Additionally, a simple yet effective determination mechanism is introduced to prevent hallucinations from excessive refinement, ensuring the reliability of the final results. Experimental results show that DoCIA significantly outperforms traditional ST baselines in both sentence and discourse metrics across four LLMs, demonstrating its effectiveness in improving ST performance.

TactileNet: Bridging the Accessibility Gap with AI-Generated Tactile Graphics for Individuals with Vision Impairment

Adnan Khan,Alireza Choubineh,Mai A. Shaaban,Abbas Akkasi,Majid Komeili

Task: 利用文本到图像的Stable Diffusion模型生成符合触觉标准的触觉图形。

Motivation: 传统触觉图形制作方法劳动密集且难以满足需求，亟需一种高效、可扩展的解决方案。

Details

Method: 通过集成Low-Rank Adaptation (LoRA)和DreamBooth技术，微调Stable Diffusion模型以生成高质量触觉图形。 Result: 生成的触觉图形在触觉标准上达到92.86%的符合率，并在姿势和特征上与自然图像100%对齐，同时展示了可扩展性。 Conclusion: 该框架显著提升了触觉图形的生成效率，为教育等领域的无障碍需求提供了可扩展的解决方案。 Abstract: Tactile graphics are essential for providing access to visual information for the 43 million people globally living with vision loss, as estimated by global prevalence data. However, traditional methods for creating these tactile graphics are labor-intensive and struggle to meet demand. We introduce TactileNet, the first comprehensive dataset and AI-driven framework for generating tactile graphics using text-to-image Stable Diffusion (SD) models. By integrating Low-Rank Adaptation (LoRA) and DreamBooth, our method fine-tunes SD models to produce high-fidelity, guideline-compliant tactile graphics while reducing computational costs. Evaluations involving tactile experts show that generated graphics achieve 92.86% adherence to tactile standards and 100% alignment with natural images in posture and features. Our framework also demonstrates scalability, generating 32,000 images (7,050 filtered for quality) across 66 classes, with prompt editing enabling customizable outputs (e.g., adding/removing details). Our work empowers designers to focus on refinement, significantly accelerating accessibility efforts. It underscores the transformative potential of AI for social good, offering a scalable solution to bridge the accessibility gap in education and beyond.

CARE: Aligning Language Models for Regional Cultural Awareness

Geyang Guo,Tarek Naous,Hiromi Wakaki,Yukiko Nishimura,Yuki Mitsufuji,Alan Ritter,Wei Xu

Task: 研究少量人类编写的多语言文化偏好数据是否能改善语言模型对不同文化的表现。

Motivation: 现有语言模型存在西方中心偏见，且难以表达多样文化知识，之前的方法依赖合成数据且仅用英语表达文化知识。

Details

Method: 引入CARE资源（包含24.1k条人类偏好响应，涵盖中阿文化的2,580个问题），并评估其对语言模型的改进效果。 Result: CARE能提升语言模型的文化对齐性而不损害通用能力，同时揭示了语言模型在文化意识上的区域差异。 Conclusion: 少量高质量多语言文化数据可显著改善语言模型的文化表现，且CARE资源为未来研究提供了基础。 Abstract: Existing language models (LMs) often exhibit a Western-centric bias and struggle to represent diverse cultural knowledge. Previous attempts to address this rely on synthetic data and express cultural knowledge only in English. In this work, we study whether a small amount of human-written, multilingual cultural preference data can improve LMs across various model families and sizes. We first introduce CARE, a multilingual resource of 24.1k responses with human preferences on 2,580 questions about Chinese and Arab cultures, all carefully annotated by native speakers and offering more balanced coverage. Using CARE, we demonstrate that cultural alignment improves existing LMs beyond generic resources without compromising general capabilities. Moreover, we evaluate the cultural awareness of LMs, native speakers, and retrieved web content when queried in different languages. Our experiment reveals regional disparities among LMs, which may also be reflected in the documentation gap: native speakers often take everyday cultural commonsense and social norms for granted, while non-natives are more likely to actively seek out and document them. CARE is publicly available at https://github.com/Guochry/CARE (we plan to add Japanese data in the near future).

Exploring Kernel Transformations for Implicit Neural Representations

Sheng Zheng,Chaoning Zhang,Dongshen Han,Fachrina Dewi Puspitasari,Xinhong Hao,Yang Yang,Heng Tao Shen

Task: 探索输入/输出的核变换对隐式神经表示（INRs）性能的影响。

Motivation: 现有研究主要关注模型内部组件（如激活函数）的影响，而忽略了输入/输出的核变换对INRs的作用。

Details

Method: 通过保持模型不变，研究输入/输出的核变换（如尺度和位移）对INRs的影响，并提出一种简单有效的方法。 Result: 发现尺度和位移变换能显著提升INRs性能，且计算开销可忽略；同时从深度和归一化角度解释了性能提升的原因。 Conclusion: 为未来通过核变换理解和改进INRs提供了新思路。 Abstract: Implicit neural representations (INRs), which leverage neural networks to represent signals by mapping coordinates to their corresponding attributes, have garnered significant attention. They are extensively utilized for image representation, with pixel coordinates as input and pixel values as output. In contrast to prior works focusing on investigating the effect of the model's inside components (activation function, for instance), this work pioneers the exploration of the effect of kernel transformation of input/output while keeping the model itself unchanged. A byproduct of our findings is a simple yet effective method that combines scale and shift to significantly boost INR with negligible computation overhead. Moreover, we present two perspectives, depth and normalization, to interpret the performance benefits caused by scale and shift transformation. Overall, our work provides a new avenue for future works to understand and improve INR through the lens of kernel transformation.

Concise Reasoning via Reinforcement Learning

Mehdi Fatemi,Banafsheh Rafiee,Mingjie Tang,Kartik Talamadupula

Task: 研究如何减少大型语言模型（LLMs）推理过程中的令牌使用量，同时保持或提高准确性。

Motivation: 当前基于强化学习（RL）训练的推理模型倾向于生成长响应，导致高计算成本、资源需求和响应时间，但长响应并不一定提高准确性。

Details

Method: 通过数学分析揭示RL训练导致长响应的倾向，并提出在训练后引入第二阶段RL优化以减少令牌使用。 Result: 实验证明，该方法能显著减少模型的思维链长度，同时保持或提高准确性。 Conclusion: 简洁性与准确性存在自然相关性，通过RL优化可以高效减少令牌使用而不牺牲性能。 Abstract: Despite significant advancements in large language models (LLMs), a major drawback of reasoning models is their enormous token usage, which increases computational cost, resource requirements, and response time. In this work, we revisit the core principles of reinforcement learning (RL) and, through mathematical analysis, demonstrate that the tendency to generate lengthy responses arises inherently from RL-based optimization during training. This finding questions the prevailing assumption that longer responses inherently improve reasoning accuracy. Instead, we uncover a natural correlation between conciseness and accuracy that has been largely overlooked. Moreover, we show that introducing a secondary phase of RL post-training, using a small set of problems and limited resources, can significantly reduce a model's chain of thought while maintaining or even enhancing accuracy. Finally, we validate our conclusions through extensive experimental results.

Inverse++: Vision-Centric 3D Semantic Occupancy Prediction Assisted with 3D Object Detection

Zhenxing Ming,Julie Stephany Berrio,Mao Shan,Stewart Worrall

Task: 通过引入额外的3D物体检测辅助分支，探索多任务学习以提高3D语义占用预测的性能。

Motivation: 现有方法主要关注复杂的内部结构模块设计，而本文通过引入额外的3D监督信号，增强模型对场景中小动态物体的捕捉能力，尤其是对易受伤害的道路使用者（如自行车、摩托车和行人）的检测，这对自动驾驶的安全性至关重要。

Details

Method: 提出了一种多任务学习方法，通过引入额外的3D物体检测辅助分支，提供额外的3D监督信号，从而增强中间特征对小动态物体的捕捉能力。 Result: 在nuScenes数据集上的实验表明，该方法达到了最先进的性能，IoU得分为31.73%，mIoU得分为20.91%，并且在检测易受伤害的道路使用者方面表现优异。 Conclusion: 通过引入额外的3D监督信号和多任务学习，本文方法显著提升了3D语义占用预测的性能，尤其是在检测小动态物体方面，为自动驾驶的安全性提供了重要支持。 Abstract: 3D semantic occupancy prediction aims to forecast detailed geometric and semantic information of the surrounding environment for autonomous vehicles (AVs) using onboard surround-view cameras. Existing methods primarily focus on intricate inner structure module designs to improve model performance, such as efficient feature sampling and aggregation processes or intermediate feature representation formats. In this paper, we explore multitask learning by introducing an additional 3D supervision signal by incorporating an additional 3D object detection auxiliary branch. This extra 3D supervision signal enhances the model's overall performance by strengthening the capability of the intermediate features to capture small dynamic objects in the scene, and these small dynamic objects often include vulnerable road users, i.e. bicycles, motorcycles, and pedestrians, whose detection is crucial for ensuring driving safety in autonomous vehicles. Extensive experiments conducted on the nuScenes datasets, including challenging rainy and nighttime scenarios, showcase that our approach attains state-of-the-art results, achieving an IoU score of 31.73% and a mIoU score of 20.91% and excels at detecting vulnerable road users (VRU). The code will be made available at:https://github.com/DanielMing123/Inverse++

Exploiting individual differences to bootstrap communication

Richard A. Blythe,Casimir Fisch

Task: 研究如何在没有预先存在的沟通手段的情况下，从非沟通行为中引导出沟通系统的出现。

Motivation: 现有的理论依赖于反馈机制来解释沟通系统的形成，但反馈本身需要预先存在的沟通能力，无法解释沟通系统的初始引导。

Details

Method: 提出一个模型，展示在大群体中，由于个体行为差异和共享意向性，能够表达无限数量意义的沟通系统如何自发形成。 Result: 模型表明，通过个体行为的可预测性和共享意向性，可以在没有预先沟通手段的情况下引导出沟通系统。 Conclusion: 沟通系统的形成依赖于社会认知的一般能力，支持语言等大型灵活沟通系统是社会认知能力的产物。 Abstract: Establishing a communication system is hard because the intended meaning of a signal is unknown to its receiver when first produced, and the signaller also has no idea how that signal will be interpreted. Most theoretical accounts of the emergence of communication systems rely on feedback to reinforce behaviours that have led to successful communication in the past. However, providing such feedback requires already being able to communicate the meaning that was intended or interpreted. Therefore these accounts cannot explain how communication can be bootstrapped from non-communicative behaviours. Here we present a model that shows how a communication system, capable of expressing an unbounded number of meanings, can emerge as a result of individual behavioural differences in a large population without any pre-existing means to determine communicative success. The two key cognitive capabilities responsible for this outcome are behaving predictably in a given situation, and an alignment of psychological states ahead of signal production that derives from shared intentionality. Since both capabilities can exist independently of communication, our results are compatible with theories in which large flexible socially-learned communication systems like language are the product of a general but well-developed capacity for social cognition.

Enhancing Compositional Reasoning in Vision-Language Models with Synthetic Preference Data

Samarth Mishra,Kate Saenko,Venkatesh Saligrama

Task: 通过数据阐明视觉概念，提升多模态大语言模型（MLLMs）的组合推理能力。

Motivation: 当前MLLMs在组合性推理（如区分“狗追猫”和“猫追狗”）上仍落后于人类表现，需改进。

Details

Method: 提出SCRAMBLe方法，通过合成偏好数据对MLLMs进行偏好调优，提升组合推理能力。 Result: SCRAMBLe显著提升MLLMs在组合性基准测试中的表现，并在通用视觉问答任务中略有提升。 Conclusion: SCRAMBLe是一种有效提升MLLMs组合推理能力的方法，且代码和模型已开源。 Abstract: Compositionality, or correctly recognizing scenes as compositions of atomic visual concepts, remains difficult for multimodal large language models (MLLMs). Even state of the art MLLMs such as GPT-4o can make mistakes in distinguishing compositions like "dog chasing cat" vs "cat chasing dog". While on Winoground, a benchmark for measuring such reasoning, MLLMs have made significant progress, they are still far from a human's performance. We show that compositional reasoning in these models can be improved by elucidating such concepts via data, where a model is trained to prefer the correct caption for an image over a close but incorrect one. We introduce SCRAMBLe: Synthetic Compositional Reasoning Augmentation of MLLMs with Binary preference Learning, an approach for preference tuning open-weight MLLMs on synthetic preference data generated in a fully automated manner from existing image-caption data. SCRAMBLe holistically improves these MLLMs' compositional reasoning capabilities which we can see through significant improvements across multiple vision language compositionality benchmarks, as well as smaller but significant improvements on general question answering tasks. As a sneak peek, SCRAMBLe tuned Molmo-7B model improves on Winoground from 49.5% to 54.8% (best reported to date), while improving by ~1% on more general visual question answering tasks. Code for SCRAMBLe along with tuned models and our synthetic training dataset is available at https://github.com/samarth4149/SCRAMBLe.

Post-Training Language Models for Continual Relation Extraction

Sefika Efeoglu,Adrian Paschke,Sonja Schimmler

Task: 研究如何将预训练语言模型（PLMs）应用于持续关系抽取（CRE），以解决动态数据中的知识遗忘问题。

Motivation: 现实世界数据动态且非平稳，传统关系抽取模型依赖静态数据集，难以适应数据变化，需要持续学习方法。

Details

Method: 使用解码器模型（如Mistral-7B、Llama2-7B）和编码器-解码器模型（如Flan-T5 Base），通过任务增量微调和记忆回放技术进行实验。 Result: 在TACRED和FewRel数据集上，解码器和编码器-解码器模型表现优于传统方法，尤其在Mistral和Flan-T5模型上取得显著效果。 Conclusion: 预训练语言模型结合记忆回放技术能有效提升持续关系抽取性能，为动态实时关系抽取提供了新思路。 Abstract: Real-world data, such as news articles, social media posts, and chatbot conversations, is inherently dynamic and non-stationary, presenting significant challenges for constructing real-time structured representations through knowledge graphs (KGs). Relation Extraction (RE), a fundamental component of KG creation, often struggles to adapt to evolving data when traditional models rely on static, outdated datasets. Continual Relation Extraction (CRE) methods tackle this issue by incrementally learning new relations while preserving previously acquired knowledge. This study investigates the application of pre-trained language models (PLMs), specifically large language models (LLMs), to CRE, with a focus on leveraging memory replay to address catastrophic forgetting. We evaluate decoder-only models (eg, Mistral-7B and Llama2-7B) and encoder-decoder models (eg, Flan-T5 Base) on the TACRED and FewRel datasets. Task-incremental fine-tuning of LLMs demonstrates superior performance over earlier approaches using encoder-only models like BERT on TACRED, excelling in seen-task accuracy and overall performance (measured by whole and average accuracy), particularly with the Mistral and Flan-T5 models. Results on FewRel are similarly promising, achieving second place in whole and average accuracy metrics. This work underscores critical factors in knowledge transfer, language model architecture, and KG completeness, advancing CRE with LLMs and memory replay for dynamic, real-time relation extraction.

AnyArtisticGlyph: Multilingual Controllable Artistic Glyph Generation

Xiongbo Lu,Yaxiong Chen,Shengwu Xiong

Task: 提出一种基于扩散模型的多语言可控艺术字形生成方法（AnyArtisticGlyph）。

Motivation: 现有艺术字形生成方法在细节上存在模糊或错误纹理的问题，需要更精细的控制和生成能力。

Details

Method: 结合字体融合与嵌入模块生成潜在特征，利用CLIP模型编码参考图像并与文本嵌入融合，同时引入粗粒度特征级损失提升生成准确性。 Result: 实验表明，该方法能生成自然且细节丰富的艺术字形图像，性能达到最先进水平。 Conclusion: AnyArtisticGlyph通过改进生成细节和控制能力，推动了艺术字形生成技术的发展，并将开源以促进文本生成技术的进步。 Abstract: Artistic Glyph Image Generation (AGIG) differs from current creativity-focused generation models by offering finely controllable deterministic generation. It transfers the style of a reference image to a source while preserving its content. Although advanced and promising, current methods may reveal flaws when scrutinizing synthesized image details, often producing blurred or incorrect textures, posing a significant challenge. Hence, we introduce AnyArtisticGlyph, a diffusion-based, multilingual controllable artistic glyph generation model. It includes a font fusion and embedding module, which generates latent features for detailed structure creation, and a vision-text fusion and embedding module that uses the CLIP model to encode references and blends them with transformation caption embeddings for seamless global image generation. Moreover, we incorporate a coarse-grained feature-level loss to enhance generation accuracy. Experiments show that it produces natural, detailed artistic glyph images with state-of-the-art performance. Our project will be open-sourced on https://github.com/jiean001/AnyArtisticGlyph to advance text generation technology.

Proposing TAGbank as a Corpus of Tree-Adjoining Grammar Derivations

Jungyeul Park

Task: 构建TAGbank，一个基于树邻接语法（TAG）的大规模语料库，用于支持自然语言处理中的语法分析和语义分析。

Motivation: 现有的大规模语料库（如Penn Treebank和Universal Dependencies）主要基于短语结构和依存语法，缺乏基于词汇化语法形式（如TAG）的语料资源。

Details

Method: 提出一种方法，将短语结构标注自动映射到TAG推导，利用TAG的生成能力支持解析、语法归纳和语义分析。 Result: 成功构建了TAGbank，并讨论了提取过程中的挑战，如确保树库方案的一致性和处理语言特定的句法特性。 Conclusion: TAGbank为计算任务提供了强大的资源，并有助于理论理解TAG的生成能力，未来计划扩展到多语言语料库。 Abstract: The development of lexicalized grammars, particularly Tree-Adjoining Grammar (TAG), has significantly advanced our understanding of syntax and semantics in natural language processing (NLP). While existing syntactic resources like the Penn Treebank and Universal Dependencies offer extensive annotations for phrase-structure and dependency parsing, there is a lack of large-scale corpora grounded in lexicalized grammar formalisms. To address this gap, we introduce TAGbank, a corpus of TAG derivations automatically extracted from existing syntactic treebanks. This paper outlines a methodology for mapping phrase-structure annotations to TAG derivations, leveraging the generative power of TAG to support parsing, grammar induction, and semantic analysis. Our approach builds on the work of CCGbank, extending it to incorporate the unique structural properties of TAG, including its transparent derivation trees and its ability to capture long-distance dependencies. We also discuss the challenges involved in the extraction process, including ensuring consistency across treebank schemes and dealing with language-specific syntactic idiosyncrasies. Finally, we propose the future extension of TAGbank to include multilingual corpora, focusing on the Penn Korean and Penn Chinese Treebanks, to explore the cross-linguistic application of TAG's formalism. By providing a robust, derivation-based resource, TAGbank aims to support a wide range of computational tasks and contribute to the theoretical understanding of TAG's generative capacity.

Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions

He Zhu,Quyu Kong,Kechun Xu,Xunlong Xia,Bing Deng,Jieping Ye,Rong Xiong,Yue Wang

Task: 基于语言指令、视觉观察和交互，定位3D空间中可操作物体的任务。

Motivation: 为具身智能体链接感知与动作，例如智能机器人需要根据人类指令准确识别物体可操作性并抓取。

Details

Method: 提出LMAffordance3D，一种多模态、语言引导的3D可操作性定位网络，融合2D/3D空间特征与语义特征。 Result: 在AGPIL数据集上验证了方法的有效性和优越性，包括未见过的实验场景。 Conclusion: 提出的任务和方法为3D物体可操作性定位提供了新思路，并在实验中表现出色。 Abstract: Grounding 3D object affordance is a task that locates objects in 3D space where they can be manipulated, which links perception and action for embodied intelligence. For example, for an intelligent robot, it is necessary to accurately ground the affordance of an object and grasp it according to human instructions. In this paper, we introduce a novel task that grounds 3D object affordance based on language instructions, visual observations and interactions, which is inspired by cognitive science. We collect an Affordance Grounding dataset with Points, Images and Language instructions (AGPIL) to support the proposed task. In the 3D physical world, due to observation orientation, object rotation, or spatial occlusion, we can only get a partial observation of the object. So this dataset includes affordance estimations of objects from full-view, partial-view, and rotation-view perspectives. To accomplish this task, we propose LMAffordance3D, the first multi-modal, language-guided 3D affordance grounding network, which applies a vision-language model to fuse 2D and 3D spatial features with semantic features. Comprehensive experiments on AGPIL demonstrate the effectiveness and superiority of our method on this task, even in unseen experimental settings. Our project is available at https://sites.google.com/view/lmaffordance3d.

NoveltyBench: Evaluating Creativity and Diversity in Language Models

Yiming Zhang,Harshita Diddee,Susan Holm,Hanchen Liu,Xinyue Liu,Vinay Samuel,Barry Wang,Daphne Ippolito

Task: 评估语言模型生成多样化和新颖输出的能力。

Motivation: 当前语言模型在标准基准测试中表现优异，但在生成多样化和新颖输出方面存在模式崩溃问题，限制了其实际应用。

Details

Method: 引入NoveltyBench基准，使用精心设计的提示和真实用户查询评估20种领先语言模型的多样性。 Result: 发现当前最先进的模型生成的多样性显著低于人类作者，且同一家族中更大的模型通常比小模型多样性更低。 Conclusion: 需要新的训练和评估范式，以在保证质量的同时提升模型的创造性。 Abstract: Language models have demonstrated remarkable capabilities on standard benchmarks, yet they struggle increasingly from mode collapse, the inability to generate diverse and novel outputs. Our work introduces NoveltyBench, a benchmark specifically designed to evaluate the ability of language models to produce multiple distinct and high-quality outputs. NoveltyBench utilizes prompts curated to elicit diverse answers and filtered real-world user queries. Evaluating 20 leading language models, we find that current state-of-the-art systems generate significantly less diversity than human writers. Notably, larger models within a family often exhibit less diversity than their smaller counterparts, challenging the notion that capability on standard benchmarks translates directly to generative utility. While prompting strategies like in-context regeneration can elicit diversity, our findings highlight a fundamental lack of distributional diversity in current models, reducing their utility for users seeking varied responses and suggesting the need for new training and evaluation paradigms that prioritize creativity alongside quality.

Two is Better than One: Efficient Ensemble Defense for Robust and Compact Models

Yoojin Jung,Byung Cheol Song

Task: 提出一种名为Efficient Ensemble Defense (EED)的技术，以增强资源受限环境下深度学习模型的对抗鲁棒性和效率。

Motivation: 解决深度学习模型在资源受限设备上部署时面临的对抗攻击脆弱性问题。

Details

Method: 通过基于不同剪枝重要性分数对单一基础模型进行多样化压缩，并增强集成多样性，动态确定推理阶段所需的子模型数量。 Result: 在CIFAR-10和SVHN数据集上，EED表现出优于现有对抗剪枝技术的鲁棒性，推理速度提升高达1.86倍。 Conclusion: EED是一种在资源受限环境下高效的防御解决方案。 Abstract: Deep learning-based computer vision systems adopt complex and large architectures to improve performance, yet they face challenges in deployment on resource-constrained mobile and edge devices. To address this issue, model compression techniques such as pruning, quantization, and matrix factorization have been proposed; however, these compressed models are often highly vulnerable to adversarial attacks. We introduce the \textbf{Efficient Ensemble Defense (EED)} technique, which diversifies the compression of a single base model based on different pruning importance scores and enhances ensemble diversity to achieve high adversarial robustness and resource efficiency. EED dynamically determines the number of necessary sub-models during the inference stage, minimizing unnecessary computations while maintaining high robustness. On the CIFAR-10 and SVHN datasets, EED demonstrated state-of-the-art robustness performance compared to existing adversarial pruning techniques, along with an inference speed improvement of up to 1.86 times. This proves that EED is a powerful defense solution in resource-constrained environments.

LLM-based Automated Grading with Human-in-the-Loop

Hang Li,Yucheng Chu,Kaiqi Yang,Yasemin Copur-Gencturk,Jiliang Tang

Task: 探索利用大语言模型（LLMs）的交互能力，通过人机协同（HITL）方法提升自动短答案评分（ASAG）的性能。

Motivation: 现有基于LLMs的ASAG方法在基于量规的评估中难以达到人类评分水平，需要更高效的方法。

Details

Method: 提出GradeHITL框架，利用LLMs的生成能力向人类专家提问，动态优化评分量规。 Result: GradeHITL显著提高了评分准确性，优于现有方法，接近人类评分水平。 Conclusion: 人机协同方法能有效提升ASAG性能，为教育领域提供更可靠的自动评分工具。 Abstract: The rise of artificial intelligence (AI) technologies, particularly large language models (LLMs), has brought significant advancements to the field of education. Among various applications, automatic short answer grading (ASAG), which focuses on evaluating open-ended textual responses, has seen remarkable progress with the introduction of LLMs. These models not only enhance grading performance compared to traditional ASAG approaches but also move beyond simple comparisons with predefined "golden" answers, enabling more sophisticated grading scenarios, such as rubric-based evaluation. However, existing LLM-powered methods still face challenges in achieving human-level grading performance in rubric-based assessments due to their reliance on fully automated approaches. In this work, we explore the potential of LLMs in ASAG tasks by leveraging their interactive capabilities through a human-in-the-loop (HITL) approach. Our proposed framework, GradeHITL, utilizes the generative properties of LLMs to pose questions to human experts, incorporating their insights to refine grading rubrics dynamically. This adaptive process significantly improves grading accuracy, outperforming existing methods and bringing ASAG closer to human-level evaluation.

CADCrafter: Generating Computer-Aided Design Models from Unconstrained Images

Cheng Chen,Jiacheng Wei,Tianrun Chen,Chi Zhang,Xiaofeng Yang,Shangzhan Zhang,Bingchen Yang,Chuan-Sheng Foo,Guosheng Lin,Qixing Huang,Fayao Liu

Task: 从无约束的真实世界CAD图像中逆向工程生成参数化CAD模型。

Motivation: 当前方法依赖昂贵且劳动密集的3D扫描和后处理，需要一种用户友好且能从易获取图像生成CAD模型的方法。

Details

Method: 提出CADCrafter框架，利用合成无纹理CAD数据训练，通过几何编码器捕捉几何特征，并采用直接偏好优化（DPO）确保几何有效性。 Result: 实验表明，该方法能稳健处理真实无约束CAD图像，并可泛化至未见过的通用物体。 Conclusion: CADCrafter为从真实图像生成CAD模型提供了一种高效且泛化能力强的解决方案。 Abstract: Creating CAD digital twins from the physical world is crucial for manufacturing, design, and simulation. However, current methods typically rely on costly 3D scanning with labor-intensive post-processing. To provide a user-friendly design process, we explore the problem of reverse engineering from unconstrained real-world CAD images that can be easily captured by users of all experiences. However, the scarcity of real-world CAD data poses challenges in directly training such models. To tackle these challenges, we propose CADCrafter, an image-to-parametric CAD model generation framework that trains solely on synthetic textureless CAD data while testing on real-world images. To bridge the significant representation disparity between images and parametric CAD models, we introduce a geometry encoder to accurately capture diverse geometric features. Moreover, the texture-invariant properties of the geometric features can also facilitate the generalization to real-world scenarios. Since compiling CAD parameter sequences into explicit CAD models is a non-differentiable process, the network training inherently lacks explicit geometric supervision. To impose geometric validity constraints, we employ direct preference optimization (DPO) to fine-tune our model with the automatic code checker feedback on CAD sequence quality. Furthermore, we collected a real-world dataset, comprised of multi-view images and corresponding CAD command sequence pairs, to evaluate our method. Experimental results demonstrate that our approach can robustly handle real unconstrained CAD images, and even generalize to unseen general objects.

Do PhD-level LLMs Truly Grasp Elementary Addition? Probing Rule Learning vs. Memorization in Large Language Models

Yang Yan,Yu Lu,Renjun Xu,Zhenzhong Lan

Task: 探究大型语言模型（LLMs）是否真正学习数学原理，还是仅记忆模式。

Motivation: 尽管LLMs在基准测试中表现优异，但在简单问题上表现不佳，引发对其学习本质的质疑。

Details

Method: 通过基本的两整数加法（0到2^64范围）测试交换律和组合泛化能力，并引入符号映射。 Result: LLMs在数值加法上表现良好（73.8-99.8%准确率），但在符号映射下准确率骤降至≤7.5%，且存在大量交换律违反案例。 Conclusion: 当前LLMs依赖记忆模式而非真正学习规则，需新方法以实现真正的数学推理。 Abstract: Despite high benchmark scores, Large Language Models (LLMs) often fail simple problem, raising a critical question: Do LLMs learn mathematical principles or merely memorize patterns? Rather than designing increasingly complex benchmarks like recent works, we investigate this using elementary two-integer addition ($0$ to $2^{64}$), probing two core properties: commutativity ($A+B=B+A$) and compositional generalization (via isomorphic symbolic mappings, e.g., $7 \rightarrow y$). While state-of-the-art LLMs achieve 73.8-99.8\% accuracy on numerical addition, performance collapses to $\leq$7.5\% under symbolic mapping, indicating failure to generalize learned rules. Non-monotonic performance scaling with digit count and frequent commutativity violations (over 1,700 cases of $A+B \neq B+A$) further support this. Explicitly providing addition rules degrades performance by 81.2\% on average, while self-explanation maintains baseline accuracy, suggesting LLM arithmetic processing is misaligned with human-defined principles. Our findings indicate current LLMs rely on memory pattern over genuine rule learning, highlighting architectural limitations and the need for new approaches to achieve true mathematical reasoning.

Continuous Locomotive Crowd Behavior Generation

Inhwan Bae,Junoh Lee,Hae-Gon Jeon

Task: 提出一种自动生成连续、真实的群体轨迹的新方法，以模拟异质行为和个体间的交互。

Motivation: 传统方法难以复现真实世界中群体的连续性行为，而模拟群体行为在心理学、机器人学、交通工程和虚拟环境等领域具有重要意义。

Details

Method: 设计了一个群体发射器模型和模拟器，通过扩散模型和马尔可夫链生成异质行为和长期运动。 Result: 该方法能够有效模拟多样化的群体行为模式，并在不同地理环境中表现良好。 Conclusion: 提出的框架能够生成高质量的群体行为，且所有组件均可由用户控制，代码已公开。 Abstract: Modeling and reproducing crowd behaviors are important in various domains including psychology, robotics, transport engineering and virtual environments. Conventional methods have focused on synthesizing momentary scenes, which have difficulty in replicating the continuous nature of real-world crowds. In this paper, we introduce a novel method for automatically generating continuous, realistic crowd trajectories with heterogeneous behaviors and interactions among individuals. We first design a crowd emitter model. To do this, we obtain spatial layouts from single input images, including a segmentation map, appearance map, population density map and population probability, prior to crowd generation. The emitter then continually places individuals on the timeline by assigning independent behavior characteristics such as agents' type, pace, and start/end positions using diffusion models. Next, our crowd simulator produces their long-term locomotions. To simulate diverse actions, it can augment their behaviors based on a Markov chain. As a result, our overall framework populates the scenes with heterogeneous crowd behaviors by alternating between the proposed emitter and simulator. Note that all the components in the proposed framework are user-controllable. Lastly, we propose a benchmark protocol to evaluate the realism and quality of the generated crowds in terms of the scene-level population dynamics and the individual-level trajectory accuracy. We demonstrate that our approach effectively models diverse crowd behavior patterns and generalizes well across different geographical environments. Code is publicly available at https://github.com/InhwanBae/CrowdES .

Enhancing LLM-Based Short Answer Grading with Retrieval-Augmented Generation

Yucheng Chu,Peng He,Hang Li,Haoyu Han,Kaiqi Yang,Yu Xue,Tingting Li,Joseph Krajcik,Jiliang Tang

Task: 提出一种基于检索增强生成（RAG）的自适应框架，用于自动化评分，以动态检索和整合领域特定知识。

Motivation: 大型语言模型（LLMs）在领域知识上的局限性限制了其在特定任务中的理解能力，无法达到满意的评分性能。

Details

Method: 结合语义搜索和精选教育资源，动态检索有价值的参考材料，并将其整合到评分过程中。 Result: 在科学教育数据集上的实验结果表明，该系统相比基线LLM方法在评分准确性上有所提升。 Conclusion: RAG增强的评分系统可以作为可靠的支持工具，并带来高效的性能提升。 Abstract: Short answer assessment is a vital component of science education, allowing evaluation of students' complex three-dimensional understanding. Large language models (LLMs) that possess human-like ability in linguistic tasks are increasingly popular in assisting human graders to reduce their workload. However, LLMs' limitations in domain knowledge restrict their understanding in task-specific requirements and hinder their ability to achieve satisfactory performance. Retrieval-augmented generation (RAG) emerges as a promising solution by enabling LLMs to access relevant domain-specific knowledge during assessment. In this work, we propose an adaptive RAG framework for automated grading that dynamically retrieves and incorporates domain-specific knowledge based on the question and student answer context. Our approach combines semantic search and curated educational sources to retrieve valuable reference materials. Experimental results in a science education dataset demonstrate that our system achieves an improvement in grading accuracy compared to baseline LLM approaches. The findings suggest that RAG-enhanced grading systems can serve as reliable support with efficient performance gains.

Enhancing Leaf Disease Classification Using GAT-GCN Hybrid Model

Shyam Sundhar,Riya Sharma,Priyansh Maheshwari,Suvidha Rupesh Kumar,T. Sunil Kumar

Task: 提出一种结合图注意力网络（GATs）和图卷积网络（GCNs）的混合模型，用于叶片病害分类。

Motivation: 农业病害风险增加，亟需高效、低干预的病害识别方法以支持可持续农业。

Details

Method: 采用超像素分割进行特征提取，结合GCNs和GATs，并应用边缘增强技术和权重初始化优化训练。 Result: 模型在苹果、马铃薯和甘蔗叶片病害分类中表现出色，F1分数分别为0.9818、0.9743和0.8799。 Conclusion: 该模型具有鲁棒性和高性能，有望通过精准病害检测支持可持续农业目标。 Abstract: Agriculture plays a critical role in the global economy, providing livelihoods and ensuring food security for billions. As innovative agricultural practices become more widespread, the risk of crop diseases has increased, highlighting the urgent need for efficient, low-intervention disease identification methods. This research presents a hybrid model combining Graph Attention Networks (GATs) and Graph Convolution Networks (GCNs) for leaf disease classification. GCNs have been widely used for learning from graph-structured data, and GATs enhance this by incorporating attention mechanisms to focus on the most important neighbors. The methodology integrates superpixel segmentation for efficient feature extraction, partitioning images into meaningful, homogeneous regions that better capture localized features. The authors have employed an edge augmentation technique to enhance the robustness of the model. The edge augmentation technique has introduced a significant degree of generalization in the detection capabilities of the model. To further optimize training, weight initialization techniques are applied. The hybrid model is evaluated against the individual performance of the GCN and GAT models and the hybrid model achieved a precision of 0.9822, recall of 0.9818, and F1-score of 0.9818 in apple leaf disease classification, a precision of 0.9746, recall of 0.9744, and F1-score of 0.9743 in potato leaf disease classification, and a precision of 0.8801, recall of 0.8801, and F1-score of 0.8799 in sugarcane leaf disease classification. These results demonstrate the robustness and performance of the model, suggesting its potential to support sustainable agricultural practices through precise and effective disease detection. This work is a small step towards reducing the loss of crops and hence supporting sustainable goals of zero hunger and life on land.

Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations

Pedro Ferreira,Wilker Aziz,Ivan Titov

Task: 研究偏好优化如何影响大型语言模型（LLM）生成解释的忠实性。

Motivation: 偏好优化可能导致LLM生成不忠实的解释，影响模型与人类协作的信任度。

Details

Method: 提出在奖励模型（RM）输入中加入因果归因，以检测解释与决策过程的不一致性。 Result: 在受控环境中，该方法减少了LLM生成误导性解释的倾向。 Conclusion: 通过增强RM的输入，可以有效提升LLM生成解释的忠实性。 Abstract: Chain-of-thought explanations are widely used to inspect the decision process of large language models (LLMs) and to evaluate the trustworthiness of model outputs, making them important for effective collaboration between LLMs and humans. We demonstrate that preference optimization - a key step in the alignment phase - can inadvertently reduce the faithfulness of these explanations. This occurs because the reward model (RM), which guides alignment, is tasked with optimizing both the expected quality of the response and the appropriateness of the explanations (e.g., minimizing bias or adhering to safety standards), creating potential conflicts. The RM lacks a mechanism to assess the consistency between the model's internal decision process and the generated explanation. Consequently, the LLM may engage in "reward hacking" by producing a final response that scores highly while giving an explanation tailored to maximize reward rather than accurately reflecting its reasoning. To address this issue, we propose enriching the RM's input with a causal attribution of the prediction, allowing the RM to detect discrepancies between the generated self-explanation and the model's decision process. In controlled settings, we show that this approach reduces the tendency of the LLM to generate misleading explanations.

Bottom-Up Scattering Information Perception Network for SAR target recognition

Chenxi Zhao,Daochang Wang,Siqian Zhang,Gangyao Kuang

Task: 提出一种新颖的自底向上散射信息感知网络，用于更可解释的SAR图像目标识别。

Motivation: 现有深度学习方法对SAR图像的散射信息感知和挖掘不足，导致算法性能瓶颈和鲁棒性差。

Details

Method: 构建SAR图像的专有解释网络，包括局部散射感知器替代CNN主干特征提取器，以及无监督散射部分特征提取模型。 Result: 在FAST-Vehicle和SAR-ACD数据集上验证了方法的性能。 Conclusion: 通过聚合目标部分知识形成完整目标描述，提高了模型的可解释性和判别能力。 Abstract: Deep learning methods based synthetic aperture radar (SAR) image target recognition tasks have been widely studied currently. The existing deep methods are insufficient to perceive and mine the scattering information of SAR images, resulting in performance bottlenecks and poor robustness of the algorithms. To this end, this paper proposes a novel bottom-up scattering information perception network for more interpretable target recognition by constructing the proprietary interpretation network for SAR images. Firstly, the localized scattering perceptron is proposed to replace the backbone feature extractor based on CNN networks to deeply mine the underlying scattering information of the target. Then, an unsupervised scattering part feature extraction model is proposed to robustly characterize the target scattering part information and provide fine-grained target representation. Finally, by aggregating the knowledge of target parts to form the complete target description, the interpretability and discriminative ability of the model is improved. We perform experiments on the FAST-Vehicle dataset and the SAR-ACD dataset to validate the performance of the proposed method.

Breach in the Shield: Unveiling the Vulnerabilities of Large Language Models

Runpeng Dai,Run Yang,Fan Zhou,Hongtu Zhu

Task: 提出一种新颖的稳定性度量方法，用于评估大型语言模型（LLMs）在不同扰动下的稳定性。

Motivation: 尽管LLMs和VLMs在任务理解和问题解决方面表现出色，但其在现实世界中的可靠性依赖于稳定性，而这一领域尚未充分研究。

Details

Method: 受信息几何中的统计方法启发，提出一种具有不变性特性的稳定性度量，用于分析模型对参数和输入扰动的敏感性。 Result: 通过实验验证了该度量方法的有效性，能够识别关键参数和输入中的脆弱区域，并在模型合并中提升鲁棒性。 Conclusion: 提出的稳定性框架不仅能够评估LLMs的稳定性，还能通过模型合并提升性能。 Abstract: Large Language Models (LLMs) and Vision-Language Models (VLMs) have become essential to general artificial intelligence, exhibiting remarkable capabilities in task understanding and problem-solving. However, the real-world reliability of these models critically depends on their stability, which remains an underexplored area. Despite their widespread use, rigorous studies examining the stability of LLMs under various perturbations are still lacking. In this paper, we address this gap by proposing a novel stability measure for LLMs, inspired by statistical methods rooted in information geometry. Our measure possesses desirable invariance properties, making it well-suited for analyzing model sensitivity to both parameter and input perturbations. To assess the effectiveness of our approach, we conduct extensive experiments on models ranging in size from 1.5B to 13B parameters. Our results demonstrate the utility of our measure in identifying salient parameters and detecting vulnerable regions in input images or critical dimensions in token embeddings. Furthermore, leveraging our stability framework, we enhance model robustness during model merging, leading to improved performance.

OCC-MLLM-CoT-Alpha: Towards Multi-stage Occlusion Recognition Based on Large Language Models via 3D-Aware Supervision and Chain-of-Thoughts Guidance

Chaoyi Wang,Baoqing Li,Xinhan Di

Task: 提出一种多模态大视觉语言框架OCC-MLLM-CoT-Alpha，用于理解遮挡物体。

Motivation: 现有的大规模视觉-语言多模态模型在理解遮挡物体方面表现不佳，需要改进。

Details

Method: 结合3D感知监督和思维链（Chain-of-Thoughts）引导，构建多模态大视觉语言模型框架，并采用监督与强化学习策略训练。 Result: 在评估中，方法在多种先进模型的两个设置下分别提升了15.75%至16.98%和4.42%至10.70%的决策分数。 Conclusion: 提出的框架显著提升了遮挡物体识别的性能。 Abstract: Comprehending occluded objects are not well studied in existing large-scale visual-language multi-modal models. Current state-of-the-art multi-modal large models struggles to provide satisfactory results in understanding occluded objects through universal visual encoders and supervised learning strategies. Therefore, we propose OCC-MLLM-CoT-Alpha, a multi-modal large vision language framework that integrates 3D-aware supervision and Chain-of-Thoughts guidance. Particularly, (1) we build a multi-modal large vision-language model framework which is consisted of a large multi-modal vision-language model and a 3D reconstruction expert model. (2) the corresponding multi-modal Chain-of-Thoughts is learned through a combination of supervised and reinforcement training strategies, allowing the multi-modal vision-language model to enhance the recognition ability with learned multi-modal chain-of-thoughts guidance. (3) A large-scale multi-modal chain-of-thoughts reasoning dataset, consisting of $110k$ samples of occluded objects held in hand, is built. In the evaluation, the proposed methods demonstrate decision score improvement of 15.75%,15.30%,16.98%,14.62%, and 4.42%,3.63%,6.94%,10.70% for two settings of a variety of state-of-the-art models.

CrowdVLM-R1: Expanding R1 Ability to Vision Language Model for Crowd Counting using Fuzzy Group Relative Policy Reward

Zhiqiang Wang,Pengbin Feng,Yanbin Lin,Shuzhang Cai,Zongao Bian,Jinghua Yan,Xingquan Zhu

Task: 提出一种名为Fuzzy Group Relative Policy Reward (FGRPR)的新框架，结合Group Relative Policy Optimization (GRPO)与模糊奖励函数以提高学习效率。

Motivation: 传统的二元0/1准确度奖励缺乏对精确输出的激励，而模糊奖励模型能提供更细致的激励。

Details

Method: 将GRPO与模糊奖励函数结合，形成FGRPR框架，应用于Qwen2.5-VL模型。 Result: FGRPR在五个领域内数据集上超越所有基线模型（包括GPT4o、LLaMA2和SFT），在领域外数据集上表现与SFT相当，但对更接近目标值的情况表现更优。 Conclusion: FGRPR适用于需要答案精确性的任务，其模糊奖励函数能有效提升模型性能。 Abstract: We propose Fuzzy Group Relative Policy Reward (FGRPR), a novel framework that integrates Group Relative Policy Optimization (GRPO) with a fuzzy reward function to enhance learning efficiency. Unlike the conventional binary 0/1 accuracy reward, our fuzzy reward model provides nuanced incentives, encouraging more precise outputs. Experimental results demonstrate that GRPO with a standard 0/1 accuracy reward underperforms compared to supervised fine-tuning (SFT). In contrast, FGRPR, applied to Qwen2.5-VL(3B and 7B), surpasses all baseline models, including GPT4o, LLaMA2(90B), and SFT, across five in-domain datasets. On an out-of-domain dataset, FGRPR achieves performance comparable to SFT but excels when target values are larger, as its fuzzy reward function assigns higher rewards to closer approximations. This approach is broadly applicable to tasks where the precision of the answer is critical. Code and data: https://github.com/yeyimilk/CrowdVLM-R1

Disentangling Instruction Influence in Diffusion Transformers for Parallel Multi-Instruction-Guided Image Editing

Hui Liu,Bin Zou,Suiyun Zhang,Kecheng Chen,Rui Liu,Haoliang Li

Task: 提出一种名为Instruction Influence Disentanglement (IID)的新框架，用于在单个去噪过程中并行执行多个指令的图像编辑。

Motivation: 现有方法在多指令并行执行时存在累积错误、质量下降或指令冲突导致编辑不完整的问题。

Details

Method: 通过分析DiT中的自注意力机制，识别多指令设置下的注意力模式，并生成指令特定的注意力掩码以分离每个指令的影响。 Result: 实验表明，IID在减少扩散步骤的同时，提高了保真度和指令完成度。 Conclusion: IID框架在多指令图像编辑中表现出色，代码将在论文接受后公开。 Abstract: Instruction-guided image editing enables users to specify modifications using natural language, offering more flexibility and control. Among existing frameworks, Diffusion Transformers (DiTs) outperform U-Net-based diffusion models in scalability and performance. However, while real-world scenarios often require concurrent execution of multiple instructions, step-by-step editing suffers from accumulated errors and degraded quality, and integrating multiple instructions with a single prompt usually results in incomplete edits due to instruction conflicts. We propose Instruction Influence Disentanglement (IID), a novel framework enabling parallel execution of multiple instructions in a single denoising process, designed for DiT-based models. By analyzing self-attention mechanisms in DiTs, we identify distinctive attention patterns in multi-instruction settings and derive instruction-specific attention masks to disentangle each instruction's influence. These masks guide the editing process to ensure localized modifications while preserving consistency in non-edited regions. Extensive experiments on open-source and custom datasets demonstrate that IID reduces diffusion steps while improving fidelity and instruction completion compared to existing baselines. The codes will be publicly released upon the acceptance of the paper.

Erfan Shayegani,G M Shahariar,Sara Abdali,Lei Yu,Nael Abu-Ghazaleh,Yue Dong

Task: 研究多模态语言模型（MMLMs）中的角色-模态攻击（RMA）及其防御方法。

Motivation: 现有对齐方法主要关注助手角色，忽略用户角色，且输入提示结构固定，导致模型易受攻击。

Details

Method: 提出角色-模态攻击（RMA），通过角色混淆和图像令牌位置修改引发有害输出，并设计对抗训练防御方法。 Result: RMA攻击在多种视觉语言模型上有效，对抗训练能显著降低攻击成功率（ASR）并保持模型通用性。 Conclusion: RMA攻击揭示了MMLMs的脆弱性，对抗训练是有效的防御手段。 Abstract: Multimodal Language Models (MMLMs) typically undergo post-training alignment to prevent harmful content generation. However, these alignment stages focus primarily on the assistant role, leaving the user role unaligned, and stick to a fixed input prompt structure of special tokens, leaving the model vulnerable when inputs deviate from these expectations. We introduce Role-Modality Attacks (RMA), a novel class of adversarial attacks that exploit role confusion between the user and assistant and alter the position of the image token to elicit harmful outputs. Unlike existing attacks that modify query content, RMAs manipulate the input structure without altering the query itself. We systematically evaluate these attacks across multiple Vision Language Models (VLMs) on eight distinct settings, showing that they can be composed to create stronger adversarial prompts, as also evidenced by their increased projection in the negative refusal direction in the residual stream, a property observed in prior successful attacks. Finally, for mitigation, we propose an adversarial training approach that makes the model robust against input prompt perturbations. By training the model on a range of harmful and benign prompts all perturbed with different RMA settings, it loses its sensitivity to Role Confusion and Modality Manipulation attacks and is trained to only pay attention to the content of the query in the input prompt structure, effectively reducing Attack Success Rate (ASR) while preserving the model's general utility.

Dynamic Vision Mamba

Mengxuan Wu,Zekai Li,Zhiyuan Liang,Moyang Li,Xuanlei Zhao,Samir Khaki,Zheng Zhu,Xiaojiang Peng,Konstantinos N. Plataniotis,Kai Wang,Wangbo Zhao,Yang You

Task: 提出一种动态视觉Mamba（DyVM）方法，以减少Mamba视觉模型中的空间冗余（包括token和block冗余）。

Motivation: Mamba视觉模型虽然计算效率高，但仍存在空间冗余问题，导致计算资源浪费。

Details

Method: 通过定制化token剪枝和动态选择SSM块来减少冗余。 Result: 在Vim-S上减少了35.2%的FLOPs，仅损失1.7%的准确率。 Conclusion: DyVM方法有效减少了计算开销，且适用于多种Mamba视觉模型和任务。 Abstract: Mamba-based vision models have gained extensive attention as a result of being computationally more efficient than attention-based models. However, spatial redundancy still exists in these models, represented by token and block redundancy. For token redundancy, we analytically find that early token pruning methods will result in inconsistency between training and inference or introduce extra computation for inference. Therefore, we customize token pruning to fit the Mamba structure by rearranging the pruned sequence before feeding it into the next Mamba block. For block redundancy, we allow each image to select SSM blocks dynamically based on an empirical observation that the inference speed of Mamba-based vision models is largely affected by the number of SSM blocks. Our proposed method, Dynamic Vision Mamba (DyVM), effectively reduces FLOPs with minor performance drops. We achieve a reduction of 35.2\% FLOPs with only a loss of accuracy of 1.7\% on Vim-S. It also generalizes well across different Mamba vision model architectures and different vision tasks. Our code will be made public.

TDBench: Benchmarking Vision-Language Models in Understanding Top-Down Images

Kaiyuan Hou,Minghui Zhao,Lilin Xu,Yuang Fan,Xiaofan Jiang

Task: Introduce TDBench, a benchmark for evaluating Vision-Language Models (VLMs) in top-down image understanding.

Motivation: Address the gap in evaluating VLMs for top-down images, which are valuable for tasks like autonomous navigation and spatial planning, but lack diverse datasets.

Details

Method: Construct TDBench using public top-down datasets and simulated images, including diverse real-world and synthetic scenarios, with visual question-answer pairs across ten evaluation dimensions. Result: TDBench reveals the strengths and limitations of existing VLMs through evaluation results and case studies. Conclusion: TDBench provides insights to motivate future research in top-down image understanding for VLMs. Abstract: The rapid emergence of Vision-Language Models (VLMs) has significantly advanced multimodal understanding, enabling applications in scene comprehension and visual reasoning. While these models have been primarily evaluated and developed for front-view image understanding, their capabilities in interpreting top-down images have received limited attention, partly due to the scarcity of diverse top-down datasets and the challenges in collecting such data. In contrast, top-down vision provides explicit spatial overviews and improved contextual understanding of scenes, making it particularly valuable for tasks like autonomous navigation, aerial imaging, and spatial planning. In this work, we address this gap by introducing TDBench, a comprehensive benchmark for VLMs in top-down image understanding. TDBench is constructed from public top-down view datasets and high-quality simulated images, including diverse real-world and synthetic scenarios. TDBench consists of visual question-answer pairs across ten evaluation dimensions of image understanding. Moreover, we conduct four case studies that commonly happen in real-world scenarios but are less explored. By revealing the strengths and limitations of existing VLM through evaluation results, we hope TDBench to provide insights for motivating future research. Project homepage: https://github.com/Columbia-ICSL/TDBench

OrderChain: A General Prompting Paradigm to Improve Ordinal Understanding Ability of MLLM

Jinhong Wang,Shuo Tong,Jian liu,Dongqi Tang,Weiqiang Wang,Wentong Li,Hongxia Xu,Danny Chen,Jintai Chen,Jian Wu

Task: 通过OrderChain提升多模态大语言模型（MLLMs）在序数回归（OR）任务中的性能。

Motivation: 尽管MLLMs在多模态任务中取得了显著进展，但在序数回归任务中的表现仍有待提升。

Details

Method: 提出OrderChain，包括任务感知提示和范围优化思维链（RO-CoT），并通过类别递归划分（CRD）生成指令候选类别提示。 Result: 在多个OR数据集上显著提升性能，例如Adience数据集准确率从47.5%提升至93.2%，Diabetic Retinopathy数据集从30.0%提升至85.7%。 Conclusion: OrderChain是首个增强MLLMs在OR任务中性能的工作，并在多个数据集上验证了其有效性。 Abstract: Despite the remarkable progress of multimodal large language models (MLLMs), they continue to face challenges in achieving competitive performance on ordinal regression (OR; a.k.a. ordinal classification). To address this issue, this paper presents OrderChain, a novel and general prompting paradigm that improves the ordinal understanding ability of MLLMs by specificity and commonality modeling. Specifically, our OrderChain consists of a set of task-aware prompts to facilitate the specificity modeling of diverse OR tasks and a new range optimization Chain-of-Thought (RO-CoT), which learns a commonality way of thinking about OR tasks by uniformly decomposing them into multiple small-range optimization subtasks. Further, we propose a category recursive division (CRD) method to generate instruction candidate category prompts to support RO-CoT automatic optimization. Comprehensive experiments show that a Large Language and Vision Assistant (LLaVA) model with our OrderChain improves baseline LLaVA significantly on diverse OR datasets, e.g., from 47.5% to 93.2% accuracy on the Adience dataset for age estimation, and from 30.0% to 85.7% accuracy on the Diabetic Retinopathy dataset. Notably, LLaVA with our OrderChain also remarkably outperforms state-of-the-art methods by 27% on accuracy and 0.24 on MAE on the Adience dataset. To our best knowledge, our OrderChain is the first work that augments MLLMs for OR tasks, and the effectiveness is witnessed across a spectrum of OR datasets.

FlowKV: A Disaggregated Inference Framework with Low-Latency KV Cache Transfer and Load-Aware Scheduling

Weiqing Li,Guochao Jiang,Xiangyong Ding,Zhangcheng Tao,Chuzhan Hao,Chenfeng Xu,Yuewei Zhang,Hao Wang

Task: 提出FlowKV，一种新型的解耦推理框架，优化KV缓存传输以减少延迟并提高吞吐量。

Motivation: 现有解耦推理框架中KV缓存传输延迟高，计算节点角色固定导致计算不平衡。

Details

Method: 通过优化KV缓存传输、引入负载感知调度器和灵活的PD节点分配。 Result: 将KV缓存平均传输延迟降低96%，推理速度提升15.2%-48.9%。 Conclusion: FlowKV显著优化了解耦推理的性能和资源利用率。 Abstract: Disaggregated inference has become an essential framework that separates the prefill (P) and decode (D) stages in large language model inference to improve throughput. However, the KV cache transfer faces significant delays between prefill and decode nodes. The block-wise calling method and discontinuous KV cache memory allocation increase the number of calls to the transmission kernel. Additionally, existing frameworks often fix the roles of P and D nodes, leading to computational imbalances. In this paper, we propose FlowKV, a novel disaggregated inference framework, which reduces the average transmission latency of KV cache by 96%, from 0.944s to 0.053s, almost eliminating the transfer time relative to the total request latency by optimizing the KV cache transfer. FlowKV introduces the Load-Aware Scheduler for balanced request scheduling and flexible PD node allocation. This design maximizes hardware resource utilization, achieving peak system throughput across various scenarios, including normal, computational imbalance, and extreme overload conditions. Experimental results demonstrate that FlowKV significantly accelerates inference by 15.2%-48.9% on LongBench dataset compared to the baseline and supports applications with heterogeneous GPUs.

DebGCD: Debiased Learning with Distribution Guidance for Generalized Category Discovery

Yuanpei Liu,Kai Han

Task: 解决广义类别发现（GCD）问题，旨在对未标记子集中的所有图像进行分类，无论它们来自已知还是未知类别。

Motivation: 现有方法在GCD中存在标签偏差问题，且忽略了未标记样本的确定性差异和语义分布变化，导致学习效果不佳。

Details

Method: 提出DebGCD框架，包括共同训练去偏分类器、引入语义分布检测器以及基于确定性的课程学习策略。 Result: 在GCD基准测试中表现出色，达到最先进的性能。 Conclusion: DebGCD通过去偏学习和分布指导，显著提升了GCD的性能和鲁棒性。 Abstract: In this paper, we tackle the problem of Generalized Category Discovery (GCD). Given a dataset containing both labelled and unlabelled images, the objective is to categorize all images in the unlabelled subset, irrespective of whether they are from known or unknown classes. In GCD, an inherent label bias exists between known and unknown classes due to the lack of ground-truth labels for the latter. State-of-the-art methods in GCD leverage parametric classifiers trained through self-distillation with soft labels, leaving the bias issue unattended. Besides, they treat all unlabelled samples uniformly, neglecting variations in certainty levels and resulting in suboptimal learning. Moreover, the explicit identification of semantic distribution shifts between known and unknown classes, a vital aspect for effective GCD, has been neglected. To address these challenges, we introduce DebGCD, a \underline{Deb}iased learning with distribution guidance framework for \underline{GCD}. Initially, DebGCD co-trains an auxiliary debiased classifier in the same feature space as the GCD classifier, progressively enhancing the GCD features. Moreover, we introduce a semantic distribution detector in a separate feature space to implicitly boost the learning efficacy of GCD. Additionally, we employ a curriculum learning strategy based on semantic distribution certainty to steer the debiased learning at an optimized pace. Thorough evaluations on GCD benchmarks demonstrate the consistent state-of-the-art performance of our framework, highlighting its superiority. Project page: https://visual-ai.github.io/debgcd/

Recursive Training Loops in LLMs: How training data properties modulate distribution shift in generated data?

Grgur Kovač,Jérémy Perez,Rémy Portelas,Peter Ford Dominey,Pierre-Yves Oudeyer

Task: 研究人类数据特性对迭代训练循环中分布偏移动态的影响。

Motivation: 大型语言模型（LLMs）生成的内容越来越多地出现在互联网上，形成反馈循环，可能导致分布偏移，影响模型质量。

Details

Method: 比较四个数据集（两个基于Twitter，两个基于Reddit），测试数据质量对偏移速率的影响，并进一步评估Reddit数据集的多种特性。 Result: 数据质量影响Twitter数据集的偏移速率，但对Reddit数据集无影响；词汇多样性与更大的有害偏移相关，语义多样性与更小的偏移相关；政治偏见的演变取决于人类数据的政治倾向。 Conclusion: 递归微调的影响高度依赖于人类数据的特性，不同互联网平台可能因特性不同而经历不同类型的偏移。 Abstract: Large language models (LLMs) are increasingly contributing to the creation of content on the Internet. This creates a feedback loop as subsequent generations of models will be trained on this generated, synthetic data. This phenomenon is receiving increasing interest, in particular because previous studies have shown that it may lead to distribution shift - models misrepresent and forget the true underlying distributions of human data they are expected to approximate (e.g. resulting in a drastic loss of quality). In this study, we study the impact of human data properties on distribution shift dynamics in iterated training loops. We first confirm that the distribution shift dynamics greatly vary depending on the human data by comparing four datasets (two based on Twitter and two on Reddit). We then test whether data quality may influence the rate of this shift. We find that it does on the twitter, but not on the Reddit datasets. We then focus on a Reddit dataset and conduct a more exhaustive evaluation of a large set of dataset properties. This experiment associated lexical diversity with larger, and semantic diversity with smaller detrimental shifts, suggesting that incorporating text with high lexical (but limited semantic) diversity could exacerbate the degradation of generated text. We then focus on the evolution of political bias, and find that the type of shift observed (bias reduction, amplification or inversion) depends on the political lean of the human (true) distribution. Overall, our work extends the existing literature on the consequences of recursive fine-tuning by showing that this phenomenon is highly dependent on features of the human data on which training occurs. This suggests that different parts of internet (e.g. GitHub, Reddit) may undergo different types of shift depending on their properties.

SUEDE:Shared Unified Experts for Physical-Digital Face Attack Detection Enhancement

Zuying Xie,Changtao Miao,Ajian Liu,Jiabao Guo,Feng Li,Dan Guo,Yunfeng Diao

Task: 提出一种统一框架SUEDE，用于同时检测物理和数字人脸攻击。

Motivation: 现有方法将物理攻击（如打印照片）和数字攻击（如DeepFake）作为独立任务研究，缺乏统一的特征空间，难以同时检测两种攻击。

Details

Method: 结合共享专家（捕获共同特征）和路由专家（捕获特定攻击特征），并利用CLIP作为基础网络。 Result: SUEDE在统一检测方法中表现优于现有技术。 Conclusion: SUEDE通过共享和路由专家的结合，有效解决了物理和数字攻击检测的统一问题。 Abstract: Face recognition systems are vulnerable to physical attacks (e.g., printed photos) and digital threats (e.g., DeepFake), which are currently being studied as independent visual tasks, such as Face Anti-Spoofing and Forgery Detection. The inherent differences among various attack types present significant challenges in identifying a common feature space, making it difficult to develop a unified framework for detecting data from both attack modalities simultaneously. Inspired by the efficacy of Mixture-of-Experts (MoE) in learning across diverse domains, we explore utilizing multiple experts to learn the distinct features of various attack types. However, the feature distributions of physical and digital attacks overlap and differ. This suggests that relying solely on distinct experts to learn the unique features of each attack type may overlook shared knowledge between them. To address these issues, we propose SUEDE, the Shared Unified Experts for Physical-Digital Face Attack Detection Enhancement. SUEDE combines a shared expert (always activated) to capture common features for both attack types and multiple routed experts (selectively activated) for specific attack types. Further, we integrate CLIP as the base network to ensure the shared expert benefits from prior visual knowledge and align visual-text representations in a unified space. Extensive results demonstrate SUEDE achieves superior performance compared to state-of-the-art unified detection methods.

Chris Samarinas,Hamed Zamani

Task: 训练小型语言模型用于推理密集型文档排序，结合知识蒸馏和强化学习优化。

Motivation: 现有方法依赖昂贵的人工标注或大型黑盒语言模型，而本方法利用网络数据和教师LLM自动生成高质量训练样本及相关性解释。

Details

Method: 将文档排序建模为强化学习问题，激励显式推理能力，训练一个3B参数的紧凑语言模型。 Result: 在BRIGHT基准测试中达到最先进性能，排名第三，且参数远少于其他方法，优于参数大20倍的模型。 Conclusion: 通过推理时生成解释而非直接预测相关性分数，小型语言模型能更有效推理，方法具有可扩展性和可解释性。 Abstract: We present a novel approach for training small language models for reasoning-intensive document ranking that combines knowledge distillation with reinforcement learning optimization. While existing methods often rely on expensive human annotations or large black-box language models, our methodology leverages web data and a teacher LLM to automatically generate high-quality training examples with relevance explanations. By framing document ranking as a reinforcement learning problem and incentivizing explicit reasoning capabilities, we train a compact 3B parameter language model that achieves state-of-the-art performance on the BRIGHT benchmark. Our model ranks third on the leaderboard while using substantially fewer parameters than other approaches, outperforming models that are over 20 times larger. Through extensive experiments, we demonstrate that generating explanations during inference, rather than directly predicting relevance scores, enables more effective reasoning with smaller language models. The self-supervised nature of our method offers a scalable and interpretable solution for modern information retrieval systems.

From Specificity to Generality: Revisiting Generalizable Artifacts in Detecting Face Deepfakes

Long Ma,Zhiyuan Yan,Yize Chen,Jin Xu,Qinglang Guo,Hu Huang,Yong Liao,Hui Lin

Task: 构建一个通用的面部深度伪造检测框架，能够有效识别大多数面部深度伪造。

Motivation: 由于深度伪造生成技术的快速发展，检测深度伪造变得越来越重要，但现有方法难以覆盖所有生成器产生的伪造痕迹。

Details

Method: 将深度伪造痕迹分为两类（FIA和USA），并提出一种数据级的伪伪造生成框架，通过超分辨率模拟USA和图像自混合模块生成FIA。 Result: 仅使用伪伪造数据训练的标准图像分类器能够很好地泛化到未见过的深度伪造。 Conclusion: 通过聚焦于通用伪造痕迹（FIA和USA），可以构建一个高效的通用深度伪造检测框架。 Abstract: Detecting deepfakes has been an increasingly important topic, especially given the rapid development of AI generation techniques. In this paper, we ask: How can we build a universal detection framework that is effective for most facial deepfakes? One significant challenge is the wide variety of deepfake generators available, resulting in varying forgery artifacts (e.g., lighting inconsistency, color mismatch, etc). But should we ``teach" the detector to learn all these artifacts separately? It is impossible and impractical to elaborate on them all. So the core idea is to pinpoint the more common and general artifacts across different deepfakes. Accordingly, we categorize deepfake artifacts into two distinct yet complementary types: Face Inconsistency Artifacts (FIA) and Up-Sampling Artifacts (USA). FIA arise from the challenge of generating all intricate details, inevitably causing inconsistencies between the complex facial features and relatively uniform surrounding areas. USA, on the other hand, are the inevitable traces left by the generator's decoder during the up-sampling process. This categorization stems from the observation that all existing deepfakes typically exhibit one or both of these artifacts. To achieve this, we propose a new data-level pseudo-fake creation framework that constructs fake samples with only the FIA and USA, without introducing extra less-general artifacts. Specifically, we employ a super-resolution to simulate the USA, while design a Blender module that uses image-level self-blending on diverse facial regions to create the FIA. We surprisingly found that, with this intuitive design, a standard image classifier trained only with our pseudo-fake data can non-trivially generalize well to unseen deepfakes.

VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models

Dahun Kim,AJ Piergiovanni,Ganesh Mallya,Anelia Angelova

Task: 提出VideoComp，一个用于提升视频-文本组合性理解的基准和学习框架，旨在改进视觉-语言模型在细粒度时间对齐方面的能力。

Motivation: 现有基准主要关注静态图像-文本组合性或孤立单事件视频，缺乏对连续多事件视频对齐的研究。

Details

Method: 利用具有时间局部化事件描述的视频-文本数据集（如ActivityNet-Captions、YouCook2），构建两个组合性基准（ActivityNet-Comp和YouCook2-Comp），并提出分层成对偏好损失和预训练策略。 Result: 评估了视频-文本基础模型和大型多模态模型，揭示了其在组合性方面的优势和不足。 Conclusion: 该研究为评估和提升模型在细粒度、时间连贯的视频-文本对齐能力方面提供了全面框架。 Abstract: We introduce VideoComp, a benchmark and learning framework for advancing video-text compositionality understanding, aimed at improving vision-language models (VLMs) in fine-grained temporal alignment. Unlike existing benchmarks focused on static image-text compositionality or isolated single-event videos, our benchmark targets alignment in continuous multi-event videos. Leveraging video-text datasets with temporally localized event captions (e.g. ActivityNet-Captions, YouCook2), we construct two compositional benchmarks, ActivityNet-Comp and YouCook2-Comp. We create challenging negative samples with subtle temporal disruptions such as reordering, action word replacement, partial captioning, and combined disruptions. These benchmarks comprehensively test models' compositional sensitivity across extended, cohesive video-text sequences. To improve model performance, we propose a hierarchical pairwise preference loss that strengthens alignment with temporally accurate pairs and gradually penalizes increasingly disrupted ones, encouraging fine-grained compositional learning. To mitigate the limited availability of densely annotated video data, we introduce a pretraining strategy that concatenates short video-caption pairs to simulate multi-event sequences. We evaluate video-text foundational models and large multimodal models (LMMs) on our benchmark, identifying both strengths and areas for improvement in compositionality. Overall, our work provides a comprehensive framework for evaluating and enhancing model capabilities in achieving fine-grained, temporally coherent video-text alignment.

Learning Affine Correspondences by Integrating Geometric Constraints

Pengju Sun,Banglei Guan,Zhenbao Yu,Yang Shang,Qifeng Yu,Daniel Barath

Task: 提出一种新的流程，通过结合密集匹配和几何约束来提取精确的仿射对应关系。

Motivation: 现有提取仿射对应关系的方法在性能上存在许多局限性，因此探索新范式至关重要。

Details

Method: 引入一种新的提取框架，结合密集匹配和新型关键点尺度与方向估计器，并提出基于几何约束的损失函数。 Result: 实验表明，该方法在图像匹配任务中的准确性和鲁棒性优于现有方法，且在相对姿态估计中表现更优。 Conclusion: 提出的方法在提取仿射对应关系和姿态估计任务中表现出更高的准确性和鲁棒性。 Abstract: Affine correspondences have received significant attention due to their benefits in tasks like image matching and pose estimation. Existing methods for extracting affine correspondences still have many limitations in terms of performance; thus, exploring a new paradigm is crucial. In this paper, we present a new pipeline designed for extracting accurate affine correspondences by integrating dense matching and geometric constraints. Specifically, a novel extraction framework is introduced, with the aid of dense matching and a novel keypoint scale and orientation estimator. For this purpose, we propose loss functions based on geometric constraints, which can effectively improve accuracy by supervising neural networks to learn feature geometry. The experimental show that the accuracy and robustness of our method outperform the existing ones in image matching tasks. To further demonstrate the effectiveness of the proposed method, we applied it to relative pose estimation. Affine correspondences extracted by our method lead to more accurate poses than the baselines on a range of real-world datasets. The code is available at https://github.com/stilcrad/DenseAffine.

OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs

Wasi Uddin Ahmad,Aleksander Ficek,Mehrzad Samadi,Jocelyn Huang,Vahid Noroozi,Somshubra Majumdar,Boris Ginsburg

Task: 介绍并评估OpenCodeInstruct数据集，用于提升大语言模型在编程任务中的表现。

Motivation: 高质量、公开可用的监督微调数据集稀缺，限制了LLMs在编程任务中的进一步发展。

Details

Method: 构建OpenCodeInstruct数据集（500万样本），包含编程问题、解决方案、测试用例等，并对多种基础模型进行微调。 Result: 在多个基准测试中（如HumanEval、MBPP等），微调后的模型表现显著提升。 Conclusion: OpenCodeInstruct填补了高质量编程数据集的空白，显著提升了LLMs在编程任务中的性能。 Abstract: Large Language Models (LLMs) have transformed software development by enabling code generation, automated debugging, and complex reasoning. However, their continued advancement is constrained by the scarcity of high-quality, publicly available supervised fine-tuning (SFT) datasets tailored for coding tasks. To bridge this gap, we introduce OpenCodeInstruct, the largest open-access instruction tuning dataset, comprising 5 million diverse samples. Each sample includes a programming question, solution, test cases, execution feedback, and LLM-generated quality assessments. We fine-tune various base models, including LLaMA and Qwen, across multiple scales (1B+, 3B+, and 7B+) using our dataset. Comprehensive evaluations on popular benchmarks (HumanEval, MBPP, LiveCodeBench, and BigCodeBench) demonstrate substantial performance improvements achieved by SFT with OpenCodeInstruct. We also present a detailed methodology encompassing seed data curation, synthetic instruction and solution generation, and filtering.

Inland Waterway Object Detection in Multi-environment: Dataset and Approach

Shanshan Wang,Haixiang Xu,Hui Feng,Xiaoqian Wang,Pei Song,Sijie Liu,Jianhua He

Task: 提出一个多环境内河船舶数据集（MEIWVD）并开发一种场景引导的图像增强模块和参数受限的扩张卷积方法，以提高复杂环境下的船舶检测性能。

Motivation: 内河航道数据集稀缺，现有数据集难以适应复杂环境（如狭窄航道、多变天气和城市干扰）下的船舶检测需求。

Details

Method: 引入MEIWVD数据集，提出场景引导的图像增强模块和参数受限的扩张卷积方法，结合多尺度扩张残差融合技术。 Result: MEIWVD为船舶检测提供了更严格的基准，所提方法显著提升了检测器在复杂多环境场景中的性能。 Conclusion: MEIWVD和提出的方法有效解决了内河船舶检测在复杂环境中的挑战，提升了检测性能。 Abstract: The success of deep learning in intelligent ship visual perception relies heavily on rich image data. However, dedicated datasets for inland waterway vessels remain scarce, limiting the adaptability of visual perception systems in complex environments. Inland waterways, characterized by narrow channels, variable weather, and urban interference, pose significant challenges to object detection systems based on existing datasets. To address these issues, this paper introduces the Multi-environment Inland Waterway Vessel Dataset (MEIWVD), comprising 32,478 high-quality images from diverse scenarios, including sunny, rainy, foggy, and artificial lighting conditions. MEIWVD covers common vessel types in the Yangtze River Basin, emphasizing diversity, sample independence, environmental complexity, and multi-scale characteristics, making it a robust benchmark for vessel detection. Leveraging MEIWVD, this paper proposes a scene-guided image enhancement module to improve water surface images based on environmental conditions adaptively. Additionally, a parameter-limited dilated convolution enhances the representation of vessel features, while a multi-scale dilated residual fusion method integrates multi-scale features for better detection. Experiments show that MEIWVD provides a more rigorous benchmark for object detection algorithms, and the proposed methods significantly improve detector performance, especially in complex multi-environment scenarios.