关键词:
多模态目标检测
跨模态
特征精细化
摘要:
传统卷积特征融合方法(如CNN)因局部感受野的限制,难以捕获模态间长距离特征关系,同时对图像错位敏感;而Transformer虽具备全局建模能力,但直接堆叠会导致计算复杂度和参数量激增。ICAFusion通过迭代跨模态注意力机制部分解决了这些问题,但仍存在不足:1) 跨模态特征增强模块(CFE)缺乏动态权重调整,对模态间质量差异适应性不足;2) 迭代特征增强模块(ICFE)在局部特征优化和精细化处理方面能力有限。为此,本文提出一种改进的多模态特征融合框架。在CFE模块中加入动态门控机制和注意力遮掩策略,自适应平衡模态特征贡献并过滤无效信息;在ICFE模块中引入精细化特征优化模块(FRFM),结合局部卷积、线性变换和门控机制对特征进行细化优化,提升模态互补性和特征表达能力。实验结果表明,改进后的模型在KAIST和FLIR数据集上的目标检测精度和鲁棒性显著提升,在FLIR上高阈值指标mAP75和mAP50-95分别提升了2.7%和2.4%。Traditional convolutional feature fusion methods (e.g., CNNs) are limited by local receptive fields, making it difficult to capture long-range relationships between modalities and sensitive to image misalignments. While Transformers possess global modeling capabilities, stacking them directly leads to increased computational complexity and parameter overhead. ICAFusion partially addresses these issues through an iterative cross-modal attention mechanism. However, it still has the following limitations: 1) The Cross-modal Feature Enhancement (CFE) module lacks dynamic weight adjustment, making it less adaptive to quality differences between modalities;2) The Iterative Cross-modal Feature Enhancement (ICFE) module has limited capabilities in local feature optimization and fine-grained processing. To address these shortcomings, this paper proposes an improved multimodal feature fusion framework. In the CFE module, a dynamic gating mechanism and attention masking strategy are introduced to adaptively balance modal feature contributions and filter out irrelevant information. In the ICFE module, a Fine-grained Feature Refinement Module (FRFM) is incorporated, which combines local convolution, linear transformation, and gating mechanisms to refine features, enhancing modality complementarity and feature representation capabilities. Experimental results demonstrate that the improved model significantly enhances object detection accuracy and robustness on the KAIST and FLIR datasets. Specifically, on the FLIR dataset, the high-threshold metrics mAP75 and mAP5