MMDOD - version1 | LIRIS's datasets library

Description

The Multi-Modal and Multi-View Object Detection Dataset (MMDOD) is a comprehensive benchmark designed to advance research in detection-driven image fusion under strong modality-view dependencies. MMDOD contains over 10,000 high-resolution images of transparent glass containers captured under four complementary imaging modalities (visible light, near-infrared (NIR), low-contrast, and polarization shift) across six distinct viewpoints. Each image is annotated with detailed object-level bounding boxes and class labels, enabling rigorous evaluation of multi-modal and multi-view fusion methods for object detection tasks. The dataset addresses a critical gap in existing benchmarks by providing synchronized multi-modal multi-view observations of transparent materials, which exhibit complex view-dependent optical effects such as specular reflections, refraction, and low contrast. MMDOD is particularly suited for evaluating end-to-end detection-driven fusion architectures, task-driven learning strategies, and cross-sensor alignment mechanisms in challenging industrial inspection scenarios.

Download instructions

The MMDOD dataset is distributed as a compressed .zip archive (approximately 4.8 GB) and is organized using a hierarchical directory structure to facilitate data access and processing. The archive contains two main directories, train and test, following an 80% / 20% split. Within each directory, data are further organized by scenes. Each scene consists of 27 images and their corresponding 27 annotation files.

All images are provided in PNG format and follow a standardized naming convention to ensure consistency across scenes and annotations. The image file naming format is:

timestamp_modality_view_camera.jpg
For example: 20250721112718414_26_10_E.jpg

The modality codes are defined as follows:

26: Visible and Infrared (views 1–6 correspond to Near-Infrared, views 7–12 to Visible)
28: Low-Contrast
66: Polarization Shift (Stress)

Camera viewpoints are indicated by a single character:

C: Front-facing camera
E: Top-down camera

Ground-truth annotations are provided as text files, with each line representing a labeled object using the format:
class; x_center; y_center; width; height
where the bounding box is defined by its center coordinates and dimensions. For example:
Sticker;357.6754;382.9195;43.797;28.2856.

Download from

FTP, HTTPS

Licence

Creative common

Publication date

03/02/2026

Author(s)

Gwendal Bernardi, Godefroy Brisebarre, Sébastien Roman, Mohsen Ardabilian, Emmanuel Dellandrea

Version

version1

Keyword(s)

Multi-modal object detection