A Multimodal Fusion Architecture and Dataset: Advancing Camera and Geophone Integration for Smarter Infrastructure

MATTHEW Y. TAKARA, KATHERINE A. FLANIGAN

Abstract


Computer vision has become integral to the operation and management of complex systems, improving functionality by reducing the reliance on physical contact sensors. Multimodal data fusion techniques further enhance these systems by integrating diverse data sources to uncover complex patterns in the environment. While RGB images are often fused with other imagery types (e.g., Lidar, infrared) to enhance system performance under challenging conditions such as poor weather or occlusions, civil infrastructure applications present distinct challenges. These challenges stem from limited robustness under non-ideal conditions, privacy concerns, and resource constraints—factors that demand the integration of heterogeneous data from both vision-based and non-vision-based sensors. Low-dimensional data, such as time-series data collected in situ from physical sensors, can provide valuable information without compromising privacy or overburdening computational infrastructure. However, this data generally contains less information, making it more difficult to train robust models and underscoring the need for a deeper understanding of the tradeoffs among information value, scalability, and the level of privacy offered by different data types. In this paper, we develop a time-synchronized dataset that includes RGB images and geophone vibration data collected in situ. We then evaluate baseline models comparing vehicle classification performance of unimodal models to that of a spatial-attention fusion model for vehicle classification on a noisy urban road. Our preliminary fusion model shows improvements in classification accuracy over the image- or vibration-based models alone, laying the foundation for broader integration of diverse vision and non-vision modalities.


DOI
10.12783/shm2025/37371

Full Text:

PDF

Refbacks

  • There are currently no refbacks.