Motivation: 大语言模型在通用领域表现出色，但在法律等专业领域因缺乏领域特定预训练和难以处理长而复杂的法律文本而表现受限。 Method: 在三个印度法律判决预测数据集上进行零样本实验，通过按修辞角色重组文档、定义法律术语和模拟法院逐步推理来分析模型行为。 Result: 组织数据或解释关键法律术语显著提升了模型性能，F1分数相比基线最低提升约1.5%，最高达4.36%。 Conclusion: 引入结构化信息和领域术语解释能有效增强大语言模型在法律任务中的长文本处理能力和推理表现。 Abstract: Large Language Models (LLMs), trained on extensive datasets from the web, exhibit remarkable general reasoning skills. Despite this, they often struggle in specialized areas like law, mainly because they lack domain-specific pretraining. The legal field presents unique challenges, as legal documents are generally long and intricate, making it hard for models to process the full text efficiently. Previous studies have examined in-context approaches to address the knowledge gap, boosting model performance in new domains without full domain alignment. In our paper, we analyze model behavior on legal tasks by conducting experiments in three areas: (i) reorganizing documents based on rhetorical roles to assess how structured information affects long context processing and model decisions, (ii) defining rhetorical roles to familiarize the model with legal terminology, and (iii) emulating the step-by-step reasoning of courts regarding rhetorical roles to enhance model reasoning. These experiments are conducted in a zero-shot setting across three Indian legal judgment prediction datasets. Our results reveal that organizing data or explaining key legal terms significantly boosts model performance, with a minimum increase of ~1.5% and a maximum improvement of 4.36% in F1 score compared to the baseline.

Dario Morle,Reid Zaffino

Main category: cs.CV

TL;DR: 本文提出了一种名为“warping”的新型算子，统一了深度学习中各种动态采样方法，并提供了其统计分析，揭示了前向与反向传播之间的不对称性，证明了其与传统卷积算子的本质区别，同时探讨了动态采样网络稳定训练的条件及离散化效应。

Details

Motivation: 现有的动态采样机制在多个视觉模型中表现出色，但缺乏统一的理论分析框架，因此需要一种通用形式来连接并理解这些方法。 Method: 提出了“warping”算子作为动态采样的通用形式，通过建模输入为独立同分布变量和齐次随机场进行统计分析，并引入基于梯度更新的损失景观可视化方法。 Result: 成功重建了可变形卷积、主动卷积单元和空间变换网络等结构；发现了前向与反向传播间的独特不对称性；证明了动态采样算子属于不同于传统卷积的正交算子类别；明确了稳定训练的条件，并分析了离散化带来的统计影响。 Conclusion: Warping为动态采样提供了统一的理论框架，揭示了其内在机制与传统卷积的本质差异，为设计更稳定、高效的动态网络提供了理论基础与实践指导。 Abstract: Dynamic sampling mechanisms in deep learning architectures have demonstrated utility across many computer vision models, though the theoretical analysis of these structures has not yet been unified. In this paper we connect the various dynamic sampling methods by developing and analyzing a novel operator which generalizes existing methods, which we term "warping". Warping provides a minimal implementation of dynamic sampling which is amenable to analysis, and can be used to reconstruct existing architectures including deformable convolutions, active convolutional units, and spatial transformer networks. Using our formalism, we provide statistical analysis of the operator by modeling the inputs as both IID variables and homogeneous random fields. Extending this analysis, we discover a unique asymmetry between the forward and backward pass of the model training. We demonstrate that these mechanisms represent an entirely different class of orthogonal operators to the traditional translationally invariant operators defined by convolutions. With a combination of theoretical analysis and empirical investigation, we find the conditions necessary to ensure stable training of dynamic sampling networks. In addition, statistical analysis of discretization effects are studied. Finally, we introduce a novel loss landscape visualization which utilizes gradient update information directly, to better understand learning behavior.

Kriti Ghosh,Devjyoti Chakraborty,Lakshmish Ramaswamy,Suchendra M. Bhandarkar,In Kee Kim,Nancy O'Hare,Deepak Mishra

Main category: cs.CV

TL;DR: 提出Δ-NeRF，一种用于增量式NeRF细化的模块化残差框架，可在无历史数据情况下持续优化，适用于卫星遥感等时序观测场景。

Details

Motivation: 现有NeRF需重新训练以加入新视图，难以应对数据流式到达的场景（如卫星时序观测），且易发生灾难性遗忘。 Method: 设计残差控制器对冻结的基础NeRF逐层注入修正；引入不确定性感知门控机制自适应融合基础与修正预测；采用视图选择策略减少训练数据；使用知识蒸馏压缩模型。 Result: 在卫星图像上性能媲美联合训练，训练时间减少30-42%；相比微调PSNR最高提升43.5%，部分指标优于联合训练；可将增强模型压缩至原大小的20%。 Conclusion: Δ-NeRF实现了高效、持续的NeRF增量更新，解决了灾难性遗忘问题，显著降低计算开销，适合实际部署于时序遥感分析等应用场景。 Abstract: Neural Radiance Fields (NeRFs) have demonstrated remarkable capabilities in 3D reconstruction and novel view synthesis. However, most existing NeRF frameworks require complete retraining when new views are introduced incrementally, limiting their applicability in domains where data arrives sequentially. This limitation is particularly problematic in satellite-based terrain analysis, where regions are repeatedly observed over time. Incremental refinement of NeRFs remains underexplored, and naive approaches suffer from catastrophic forgetting when past data is unavailable. We propose $Δ$-NeRF, a unique modular residual framework for incremental NeRF refinement. $Δ$-NeRF introduces several novel techniques including: (1) a residual controller that injects per-layer corrections into a frozen base NeRF, enabling refinement without access to past data; (2) an uncertainty-aware gating mechanism that prevents overcorrection by adaptively combining base and refined predictions; and (3) a view selection strategy that reduces training data by up to 47\% while maintaining performance. Additionally, we employ knowledge distillation to compress the enhanced model into a compact student network (20\% of original size). Experiments on satellite imagery demonstrate that $Δ$-NeRF achieves performance comparable to joint training while reducing training time by 30-42\%. $Δ$-NeRF consistently outperforms existing baselines, achieving an improvement of up to 43.5\% in PSNR over naive fine-tuning and surpassing joint training on some metrics.

[60] Layer-Aware Video Composition via Split-then-Merge

Ozgur Kara,Yujia Chen,Ming-Hsuan Yang,James M. Rehg,Wen-Sheng Chu,Du Tran

Main category: cs.CV

TL;DR: 提出Split-then-Merge（StM）框架，通过自分解和重组无标签视频提升生成视频合成的控制能力与数据利用效率。

Details

Motivation: 解决生成视频合成中对标注数据或手工规则的依赖以及数据稀缺问题。 Method: 将大量无标签视频拆分为动态前景和背景层，进行自组合；采用变换感知训练流程、多层融合增强和身份保持损失来实现可控合成与前景保真。 Result: 在定量基准和人类/VLLM定性评估中均优于现有最先进方法。 Conclusion: StM能有效学习复杂动态组合规律，显著提升生成视频的质量与可控性。 Abstract: We present Split-then-Merge (StM), a novel framework designed to enhance control in generative video composition and address its data scarcity problem. Unlike conventional methods relying on annotated datasets or handcrafted rules, StM splits a large corpus of unlabeled videos into dynamic foreground and background layers, then self-composes them to learn how dynamic subjects interact with diverse scenes. This process enables the model to learn the complex compositional dynamics required for realistic video generation. StM introduces a novel transformation-aware training pipeline that utilizes a multi-layer fusion and augmentation to achieve affordance-aware composition, alongside an identity-preservation loss that maintains foreground fidelity during blending. Experiments show StM outperforms SoTA methods in both quantitative benchmarks and in humans/VLLM-based qualitative evaluations. More details are available at our project page: https://split-then-merge.github.io

[61] SPHINX: A Synthetic Environment for Visual Perception and Reasoning

Md Tanvirul Alam,Saksham Aggarwal,Justin Yang Chae,Nidhi Rastogi

Main category: cs.CV

TL;DR: Sphinx是一个用于视觉感知和推理的合成环境，生成包含多种任务类型的可验证谜题，评估显示当前大模型表现有限，而基于可验证奖励的强化学习能显著提升性能。

Details

Motivation: 为了推动视觉感知与多模态推理的发展，需要一个具备精确评估能力、涵盖核心认知能力的基准测试环境。 Method: 提出Sphinx环境，通过程序化生成包含 motifs、tiles、charts 等元素的25种类型谜题，并采用强化学习与可验证奖励（RLVR）来提升模型表现。 Result: 实验表明，即使最先进的GPT-5在该基准上准确率仅为51.1%，远低于人类；而使用RLVR方法可显著提升模型在Sphinx及其他外部视觉推理任务上的表现。 Conclusion: Sphinx为视觉推理提供了可扩展、可验证的评测平台，且RLVR是一种有前景的提升多模态模型推理能力的方法。 Abstract: We present Sphinx, a synthetic environment for visual perception and reasoning that targets core cognitive primitives. Sphinx procedurally generates puzzles using motifs, tiles, charts, icons, and geometric primitives, each paired with verifiable ground-truth solutions, enabling both precise evaluation and large-scale dataset construction. The benchmark covers 25 task types spanning symmetry detection, geometric transformations, spatial reasoning, chart interpretation, and sequence prediction. Evaluating recent large vision-language models (LVLMs) shows that even state-of-the-art GPT-5 attains only 51.1% accuracy, well below human performance. Finally, we demonstrate that reinforcement learning with verifiable rewards (RLVR) substantially improves model accuracy on these tasks and yields gains on external visual reasoning benchmarks, highlighting its promise for advancing multimodal reasoning.

[62] Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion

Samuele Dell'Erba,Andrew D. Bagdanov

Main category: cs.CV

TL;DR: 本文提出了一种无需训练、无需数据的优化方法OVI，用于替代文本到图像生成中昂贵的扩散先验网络，并引入两种新约束提升生成质量，实验表明该方法可与现有先进先验相媲美。

Details

Motivation: 现有的文本到图像扩散模型依赖计算成本高且需大量数据训练的先验网络，本文旨在探索是否可以完全避免使用此类训练过的先验。 Method: 提出基于优化的视觉反演（OVI），通过随机伪标记初始化潜在表示，并迭代优化使其与文本嵌入的余弦相似性最大化；同时引入Mahalanobis距离和最近邻损失作为正则化约束。 Result: 在Kandinsky 2.2上实验显示，OVI可有效替代传统先验；分析发现当前评估基准（如T2I-CompBench++）存在缺陷，仅用文本嵌入作先验也能得高分；所提约束方法尤其是最近邻方法在视觉保真度和量化指标上表现优异。 Conclusion: OVI提供了一种有前景的训练自由先验替代方案，揭示了当前评估标准的问题，并表明该方向值得进一步研究。 Abstract: Diffusion models have established the state-of-the-art in text-to-image generation, but their performance often relies on a diffusion prior network to translate text embeddings into the visual manifold for easier decoding. These priors are computationally expensive and require extensive training on massive datasets. In this work, we challenge the necessity of a trained prior at all by employing Optimization-based Visual Inversion (OVI), a training-free and data-free alternative, to replace the need for a prior. OVI initializes a latent visual representation from random pseudo-tokens and iteratively optimizes it to maximize the cosine similarity with input textual prompt embedding. We further propose two novel constraints, a Mahalanobis-based and a Nearest-Neighbor loss, to regularize the OVI optimization process toward the distribution of realistic images. Our experiments, conducted on Kandinsky 2.2, show that OVI can serve as an alternative to traditional priors. More importantly, our analysis reveals a critical flaw in current evaluation benchmarks like T2I-CompBench++, where simply using the text embedding as a prior achieves surprisingly high scores, despite lower perceptual quality. Our constrained OVI methods improve visual fidelity over this baseline, with the Nearest-Neighbor approach proving particularly effective, achieving quantitative scores comparable to or higher than the state-of-the-art data-efficient prior, indicating that the idea merits further investigation. The code will be publicly available upon acceptance.

Yusuf Dalva,Guocheng Gordon Qian,Maya Goldenberg,Tsai-Shien Chen,Kfir Aberman,Sergey Tulyakov,Pinar Yanardag,Kuan-Chieh Jackson Wang

Main category: cs.CV

TL;DR: Canvas-to-Image 是一个统一的框架，通过将多种控制信号整合到单一画布界面中，实现高保真、多模态的图像生成控制。

Details

Motivation: 现有扩散模型在同时处理文本提示、主体参考、空间布局等多种控制时，难以保证生成图像的忠实度和组合性。 Method: 提出将多种控制信号编码为单一复合画布图像，并采用多任务画布训练策略，在统一范式下联合优化模型对异构控制的理解与集成。 Result: 在多任务数据集上实验表明，该方法在身份保持和控制遵循方面显著优于现有最先进方法，适用于多人组合、姿态控制、布局约束和多控制生成等复杂场景。 Conclusion: Canvas-to-Image 实现了对多种控制信号的统一建模，提升了扩散模型在复杂用户意图下的生成保真度与泛化能力。 Abstract: While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, particularly when users simultaneously specify text prompts, subject references, spatial arrangements, pose constraints, and layout annotations. We introduce Canvas-to-Image, a unified framework that consolidates these heterogeneous controls into a single canvas interface, enabling users to generate images that faithfully reflect their intent. Our key idea is to encode diverse control signals into a single composite canvas image that the model can directly interpret for integrated visual-spatial reasoning. We further curate a suite of multi-task datasets and propose a Multi-Task Canvas Training strategy that optimizes the diffusion model to jointly understand and integrate heterogeneous controls into text-to-image generation within a unified learning paradigm. This joint training enables Canvas-to-Image to reason across multiple control modalities rather than relying on task-specific heuristics, and it generalizes well to multi-control scenarios during inference. Extensive experiments show that Canvas-to-Image significantly outperforms state-of-the-art methods in identity preservation and control adherence across challenging benchmarks, including multi-person composition, pose-controlled composition, layout-constrained generation, and multi-control generation.

Table of Contents

cs.CL [Back]

[1] Democratizing LLM Efficiency: From Hyperscale Optimizations to Universal Deployability

[2] Harmonic Token Projection (HTP): A Vocabulary-Free, Training-Free, Deterministic, and Reversible Embedding Methodology

[3] A centroid based framework for text classification in itsm environments

[4] PIRA: Preference-Oriented Instruction-Tuned Reward Models with Dual Aggregation

[5] Structured Definitions and Segmentations for Legal Reasoning in LLMs: A Study on Indian Legal Data

[6] MindSET: Advancing Mental Health Benchmarking through Large-Scale Social Media Data

[7] Semantics Meet Signals: Dual Codebook Representationl Learning for Generative Recommendation

[8] Prompt Engineering Techniques for Context-dependent Text-to-SQL in Arabic

[9] Cognitive bias in LLM reasoning compromises interpretation of clinical oncology notes

[10] Dynamic Template Selection for Output Token Generation Optimization: MLP-Based and Transformer Approaches

[11] LLMs-Powered Accurate Extraction, Querying and Intelligent Management of Literature derived 2D Materials Data

[12] Memories Retrieved from Many Paths: A Multi-Prefix Framework for Robust Detection of Training Data Leakage in Large Language Models

[13] SAGE: An Agentic Explainer Framework for Interpreting SAE Features in Language Models

[14] Structured Prompting Enables More Robust, Holistic Evaluation of Language Models

[15] Length-MAX Tokenizer for Language Models

[16] Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

[17] Winning with Less for Low Resource Languages: Advantage of Cross-Lingual English_Persian Argument Mining Model over LLM Augmentation

[18] Emergence and Localisation of Semantic Role Circuits in LLMs

[19] Chatty-KG: A Multi-Agent AI System for On-Demand Conversational Question Answering over Knowledge Graphs

[20] TrackList: Tracing Back Query Linguistic Diversity for Head and Tail Knowledge in Open Large Language Models

[21] Semantic Anchors in In-Context Learning: Why Small LLMs Cannot Flip Their Labels

[22] Context-Aware Pragmatic Metacognitive Prompting for Sarcasm Detection

[23] Enhancing Burmese News Classification with Kolmogorov-Arnold Network Head Fine-tuning

[24] Orthographic Constraint Satisfaction and Human Difficulty Alignment in Large Language Models

[25] ASR Error Correction in Low-Resource Burmese with Alignment-Enhanced Transformers using Phonetic Features

[26] MortgageLLM: Domain-Adaptive Pretraining with Residual Instruction Transfer, Alignment Tuning, and Task-Specific Routing

[27] Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines

[28] Can Finetuing LLMs on Small Human Samples Increase Heterogeneity, Alignment, and Belief-Action Coherence?

[29] Developing an Open Conversational Speech Corpus for the Isan Language

[30] PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark

[31] Emergent Lexical Semantics in Neural Language Models: Testing Martin's Law on LLM-Generated Text

[32] Training Introspective Behavior: Fine-Tuning Induces Reliable Internal State Detection in a 7B Model

[33] Can LLMs extract human-like fine-grained evidence for evidence-based fact-checking?

[34] Text-to-SQL as Dual-State Reasoning: Integrating Adaptive Context and Progressive Generation

[35] Odin: Oriented Dual-module Integration for Text-rich Network Representation Learning

[36] A Systematic Study of Model Merging Techniques in Large Language Models

[37] Hierarchical Ranking Neural Network for Long Document Readability Assessment

[38] Voice, Bias, and Coreference: An Interpretability Study of Gender in Speech Translation

[39] Bangla Sign Language Translation: Dataset Creation Challenges, Benchmarking and Prospects

[40] RoParQ: Paraphrase-Aware Alignment of Large Language Models Towards Robustness to Paraphrased Questions

[41] Auxiliary Metrics Help Decoding Skill Neurons in the Wild

[42] Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining

[43] The author is dead, but what if they never lived? A reception experiment on Czech AI- and human-authored poetry

[44] Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework

[45] ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration

[46] Revisiting Generalization Across Difficulty Levels: It's Not So Easy

cs.CV [Back]

[47] Are Neuro-Inspired Multi-Modal Vision-Language Models Resilient to Membership Inference Privacy Leakage?

[48] Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation

[49] Video Object Recognition in Mobile Edge Networks: Local Tracking or Edge Detection?

[50] DeeAD: Dynamic Early Exit of Vision-Language Action for Efficient Autonomous Driving

[51] Foundry: Distilling 3D Foundation Models for the Edge

[52] DinoLizer: Learning from the Best for Generative Inpainting Localization

[53] CANVAS: A Benchmark for Vision-Language Models on Tool-Based User Interface Design

[54] Text-Guided Semantic Image Encoder

[55] One Patch is All You Need: Joint Surface Material Reconstruction and Classification from Minimal Visual Cues

[56] LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

[57] Revisiting KRISP: A Lightweight Reproduction and Analysis of Knowledge-Enhanced Vision-Language Models

[58] Intriguing Properties of Dynamic Sampling Networks

[59] $Δ$-NeRF: Incremental Refinement of Neural Radiance Fields through Residual Control and Knowledge Transfer

[60] Layer-Aware Video Composition via Split-then-Merge

[61] SPHINX: A Synthetic Environment for Visual Perception and Reasoning

[62] Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion

[63] RefTr: Recurrent Refinement of Confluent Trajectories for 3D Vascular Tree Centerline Graphs

[64] MODEST: Multi-Optics Depth-of-Field Stereo Dataset

[65] Unsupervised Memorability Modeling from Tip-of-the-Tongue Retrieval Queries

[66] Estimating Fog Parameters from a Sequence of Stereo Images

[67] V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence

[68] Test-Time Alignment of Text-to-Image Diffusion Models via Null-Text Embedding Optimisation

[69] GaINeR: Geometry-Aware Implicit Network Representation

[70] A deep learning model to reduce agent dose for contrast-enhanced MRI of the cerebellopontine angle cistern

[71] Smooth regularization for efficient video recognition

[72] Open Vocabulary Compositional Explanations for Neuron Alignment

[73] UruDendro4: A Benchmark Dataset for Automatic Tree-Ring Detection in Cross-Section Images of Pinus taeda L

[74] BUSTR: Breast Ultrasound Text Reporting with a Descriptor-Aware Vision-Language Model

[75] Beyond Realism: Learning the Art of Expressive Composition with StickerNet

[76] TrafficLens: Multi-Camera Traffic Video Analysis Using LLMs

[77] Privacy-Preserving Federated Vision Transformer Learning Leveraging Lightweight Homomorphic Encryption in Medical AI