MMDOD - version1

Description

The Multi-Modal and Multi-View Object Detection Dataset (MMDOD) is a comprehensive benchmark designed to advance research in detection-driven image fusion under strong modality-view dependencies. MMDOD contains over 10,000 high-resolution images of transparent glass containers captured under four complementary imaging modalities (visible light, near-infrared (NIR), low-contrast, and polarization shift) across six distinct viewpoints. Each image is annotated with detailed object-level bounding boxes and class labels, enabling rigorous evaluation of multi-modal and multi-view fusion methods for object detection tasks. The dataset addresses a critical gap in existing benchmarks by providing synchronized multi-modal multi-view observations of transparent materials, which exhibit complex view-dependent optical effects such as specular reflections, refraction, and low contrast. MMDOD is particularly suited for evaluating end-to-end detection-driven fusion architectures, task-driven learning strategies, and cross-sensor alignment mechanisms in challenging industrial inspection scenarios.

Download instructions

The MMDOD dataset is distributed as a compressed .zip archive (approximately 4.8 GB) and is organized using a hierarchical directory structure to facilitate data access and processing. The archive contains two main directories, train and test, following an 80% / 20% split. Within each directory, data are further organized by scenes. Each scene consists of 27 images and their corresponding 27 annotation files.

All images are provided in PNG format and follow a standardized naming convention to ensure consistency across scenes and annotations. The image file naming format is:

timestamp_modality_view_camera.jpg
For example: 20250721112718414_26_10_E.jpg

The modality codes are defined as follows:

  • 26: Visible and Infrared (views 1–6 correspond to Near-Infrared, views 7–12 to Visible)

  • 28: Low-Contrast

  • 66: Polarization Shift (Stress)

Camera viewpoints are indicated by a single character:

  • C: Front-facing camera

  • E: Top-down camera

Ground-truth annotations are provided as text files, with each line representing a labeled object using the format:
class; x_center; y_center; width; height
where the bounding box is defined by its center coordinates and dimensions. For example:
Sticker;357.6754;382.9195;43.797;28.2856.

Download from
Licence
Publication date
03/02/2026
Author(s)
Gwendal Bernardi, Godefroy Brisebarre, Sébastien Roman, Mohsen Ardabilian, Emmanuel Dellandrea
Version
version1
Package
Dataset size
4.88 Go