Version
version1

The Multi-View Evaluation Protocol for Glass Container Inspection (MVEP) is a specialized benchmark dataset designed to evaluate multi-view fusion methods for industrial quality control of transparent materials. MVEP addresses the critical challenge of automated defect detection and severity assessment on glass container surfaces, where view-dependent optical phenomena (including specular reflections, refractions, and transparency effects) significantly limit the reliability of single-view inspection. The dataset comprises 16,000 synchronized multi-view images of glass containers captured from six calibrated viewpoints under controlled industrial lighting conditions. Each container is annotated with object-level bounding boxes and ordinal severity labels corresponding to surface degradation defects, ranging from minimal visual alteration to critical damage requiring rejection. The defect taxonomy focuses on erasure-type surface defects, replacing earlier wipe-based definitions to better reflect realistic industrial degradation patterns, and excludes any reference to scuffing. MVEP provides a realistic distribution of quality grades encountered in production environments and includes inherent annotation uncertainty due to the subjective nature of visual quality assessment. The dataset is particularly well suited for evaluating ordinal classification methods, multi-view fusion strategies, cross-view consistency constraints, and robustness to annotation noise in industrial inspection scenarios involving transparent materials.

Date de publication
04/02/2026
Auteur(s)
Gwendal Bernardi, Godefroy Brisebarre, Sébastien Roman, Mohsen Ardabilian, Emmanuel Dellandrea
Taille totale et nombre de fichier(s)
12.9 Go
URL pour visualiser le dataset
Télécharger le dataset (En cliquant sur ce lien, vous acceptez la licence associée à ce dataset)
Version
version1

The Multi-Modal and Multi-View Object Detection Dataset (MMDOD) is a comprehensive benchmark designed to advance research in detection-driven image fusion under strong modality-view dependencies. MMDOD contains over 10,000 high-resolution images of transparent glass containers captured under four complementary imaging modalities (visible light, near-infrared (NIR), low-contrast, and polarization shift) across six distinct viewpoints. Each image is annotated with detailed object-level bounding boxes and class labels, enabling rigorous evaluation of multi-modal and multi-view fusion methods for object detection tasks. The dataset addresses a critical gap in existing benchmarks by providing synchronized multi-modal multi-view observations of transparent materials, which exhibit complex view-dependent optical effects such as specular reflections, refraction, and low contrast. MMDOD is particularly suited for evaluating end-to-end detection-driven fusion architectures, task-driven learning strategies, and cross-sensor alignment mechanisms in challenging industrial inspection scenarios.

Date de publication
03/02/2026
Auteur(s)
Gwendal Bernardi, Godefroy Brisebarre, Sébastien Roman, Mohsen Ardabilian, Emmanuel Dellandrea
Voir toutes les versions du dataset
Taille totale et nombre de fichier(s)
4.88 Go
URL pour visualiser le dataset
Télécharger le dataset (En cliquant sur ce lien, vous acceptez la licence associée à ce dataset)
Version
version2

Database instances generated and polluted using the Perfect Pet open-source software (https://github.com/mathildemarcy/perfect_pet).

These databases contain instances of the Perfect Pet database of different sizes, polluted with various artificial unicity factors.

  • file perfect_pet_5000.sql includes schemas clean_db and polluted_db_100 (polluted with a factor of 100%).
  • file perfect_pet_10000.sql includes schemas clean_db, polluted_db_au_25polluted_db_au_50polluted_db_au_75, and polluted_db_au_100 (polluted with factors of 25%, 50%, 75%, and 100%).
  • file perfect_pet_25000.sql includes schemas clean_db and polluted_db_100 (polluted with a factor of 100%).
  • file perfect_pet_50000.sql includes schemas clean_db and polluted_db_100 (polluted with a factor of 100%).
  • file perfect_pet_100000.sql includes schemas clean_db, polluted_db_au_25polluted_db_au_50polluted_db_au_75, and polluted_db_au_100 (polluted with factors of 25%, 50%, 75%, and 100%).
  • file perfect_pet_250000.sql includes schemas clean_db, polluted_db_au_25polluted_db_au_50polluted_db_au_75, and polluted_db_au_100 (polluted with factors of 25%, 50%, 75%, and 100%).

The clean schemas contain 13 relations: animal, animal_owner, animal_weight, appointment, appointment_service, appointment_slot, doctor, doctor_historization, microchip, microchip_code, owner, service, slot.

The polluted schemas contain 9 relations: animal, appointment, appointment_slot, doctor, microchip, microchip_code, owner, service, slot.

More information on these databases and their generation and pollution is available at https://github.com/mathildemarcy/perfect_pet.

Date de publication
18/06/2025
Auteur(s)
Mathilde MARCY, Jean-Marc PETIT
Taille totale et nombre de fichier(s)
9Go (6 files)
URL pour visualiser le dataset
Télécharger le dataset (En cliquant sur ce lien, vous acceptez la licence associée à ce dataset)
Version
version1

Our current research project focuses on the cleaning for analytical use of relational databases implemented with surrogate keys and surrogate foreign keys but without natural keys. Despite their numerous benefits, surrogate keys can not only induce the presence of data quality issues within a database but also act as a major obstacle to any regular data cleaning technique, due to the artificial unicity they carry and propagate.  We developed RED2Hunt, a framework dedicated to cleaning such databases, described in  https://arxiv.org/abs/2503.20593.

Because of their private nature, none of the operational databases our team members worked on could be made available to the research/academic community. Thus, we decided to generate Perfect Pet, a synthetic relational database 1) to be used to facilitate the diffusion of our work on artificial unicity (the phenomenon commonly found in operational databases which resolution motivated this research project), testing and demonstrating RED2Hunt and enable the reproductibility of our experiments and results, and make it available to the community for their own use.

For these purposes, the database had to satisfy the following requirements:

  • Ressemble real operational data: mimic data that would be collected in a real operational settings by their nature, structure, distributions.
  • Have a very simple, straightforward schema to be easily understandable by any audience.
  • Include the following data quality issues as observed in real operational databases:
  • Comprise high levels of artificial unicity (above 60% in each relation suffering from the phenomenon) due to:
    • Exclusive reliance on surrogate keys and no encoding of any natural key,
    • Schema denormalization,
    • Non-respect of observed cardinalities in real life (separation between entities' description and their historization).
  • Include data inconsistencies within duplicate groups due to:
    • denormalisation,
    • data entry error,
    • data collection practice evolution.
  • Inlude format inconsistencies, due to a modification of data type after deployment of the data collection system and start of data collection.

Perfect Pet data should suffer from the following data quality issues:

    • Redundancy,
    • Inconsistencies,
    • Inaccuracy,
    • Incompleteness.

The database was designed as a relational database supporting the appointment management application of the fictitious veterinary clinic Perfect Pet. It includes information describing the pets visiting the clinic, their owners (the clinic’s clients), the medical appointments, and the doctors working at the clinic.

Although team members never worked on a data research project related to the animal health and welfare sectors, this topic was selected to generate the synthetic data for two reasons: 1) guarantee the anonymity of the original operational databases by preventing a possible connection with the synthetic data, 2) simplify the generation process by leveraging the domain knowledge and access to an operational database in the sector of animal welfare, from one of the team members.

We thank The Jordanian Society for Animal Protection (JSAP) for allowing us to use a part of their operational database as a starting point to generate our synthetic Perfect Pet database, although it does not suffer from any of the data quality issues mentioned above.

Date de publication
27/03/2025
Auteur(s)
Mathilde MARCY, Jean-Marc PETIT
Taille totale et nombre de fichier(s)
0.4Mo - 2 files
URL pour visualiser le dataset
Télécharger le dataset (En cliquant sur ce lien, vous acceptez la licence associée à ce dataset)
Version
version1

This dataset includes 1,000 meshes for each of 100 categories of fruits and vegetables. A sub-sample is currently available, and the full dataset will be released in the coming weeks.

Date de publication
20/01/2025
Auteur(s)
Guillaume Duret, Liming Chen, et al.
Taille totale et nombre de fichier(s)
~200Go of 10K meshes
URL pour visualiser le dataset
Télécharger le dataset (En cliquant sur ce lien, vous acceptez la licence associée à ce dataset)
Version
version1

Set of deteriorated versions of the publicly available non-commercial IMDB database, comprising different amounts of duplicates.

The datasets were extracted from PostgreSQL databases including relations titles, name_basics, title_episode, title_ratings, and title_principals from https://datasets.imdbws.com/  IMDB database version downloaded on April 7th, 2024. these databases were deteriorated on purpose to experiment the Red2Hunt method that generates a redundant-free database from any relational operational database comprising surrogate keys and duplicates.

Date de publication
03/06/2024
Auteur(s)
Mathilde MARCY, Jean-Marc PETIT
Taille totale et nombre de fichier(s)
70Go - 7 fichiers dump
URL pour visualiser le dataset
Télécharger le dataset (En cliquant sur ce lien, vous acceptez la licence associée à ce dataset)
Version
version1

A set of computer generated cave system and tunnel system

  • available at various resolutions,
  • as watertight triangulations and/or point-clouds,
  • under the PLY and 3DTiles file formats.
Date de publication
16/04/2024
Auteur(s)
TeaTime and LIRIS (VCity team)
Voir toutes les versions du dataset
Taille totale et nombre de fichier(s)
3GO
URL pour visualiser le dataset
Télécharger le dataset (En cliquant sur ce lien, vous acceptez la licence associée à ce dataset)
Version
v1

This dataset has been generated from 3 constructions models, transferred from Autodesk Revit to NVIDIA Isaac Sim. It contains 8751 samples of RGB images associated with the semantic segmentation masks and label files for 13 classes (rectangular_sheath, circular_sheath, pipe, air_vent, fan_coil, stair, wall, floor, pipe_accessory, framework, radiant_panel, climate_engineering_equipment, ceiling, handrail, roof, cable_tray, pole).

Date de publication
20/12/2023
Auteur(s)
Mathis Baubriaud
Taille totale et nombre de fichier(s)
14.8GO
URL pour visualiser le dataset
Télécharger le dataset (En cliquant sur ce lien, vous acceptez la licence associée à ce dataset)
Version
version2

Backends declared in the pyproject.toml files

This dataset contains CSV and SQLite files with data about projects backends extracted from Metadata about every file uploaded to PyPI :

  • extract-pyproject-all-versions.csv, extract-pyproject-all-versions.db : for the projects having a pyproject.toml file and having been uploaded after 2018, get the project_name, max project_version, max uploaded_on, list of distinct project_version, list of distinct uploaded_on, list of distinct path, ...
  • extract-pyproject-latest.csv, extract-pyproject-latest.db : for each project found in extract-pyproject-all-versions, get the data of the latest uploaded_on date (1)
  • pyproject_backends.csv, pyproject_backends.db : the build backend found in extract-pyproject-latest.db for each project only in the pyproject.toml file on the root of the project (2)

Source code for the data extraction.

(1) There are several pyproject.toml files for some projects (e.g poetry), often in test folders
(2) The test is quite basic, but there are few projects that have several pyproject.toml file matching this test

PyPI metadata further analysis

After the publication of the first charts, I wanted to know how many projects had no source package, how many projects had no pyproject.toml to complete the first statistics.

This dataset contains CSV and SQLite files extracted from the same source (parquet files from "Metadata about every file uploaded to PyPI"):

  • extract-project-releases-2018-and-later.csv, extract-project-releases-2018-and-later.db : extract the metadata of the projects uploaded to since 2018 : get the project_name, project_version, project_release, release type (source or wheel), ...

These files weight 1.1 and 1.3 Go respectively.

Source code for the second data extraction.

Date de publication
30/12/2023
Auteur(s)
Françoise CONIL
Voir toutes les versions du dataset
Taille totale et nombre de fichier(s)
2,8 Go, 8 files
Télécharger le dataset (En cliquant sur ce lien, vous acceptez la licence associée à ce dataset)
Version
version1

This dataset holds

  • a 3D lasergrammetric dataset of the so called "Creux des Elaphes" cave as authored by EDYTEM / USMB / CNRS (under the laser point cloud data LAS format)
  • its conversion to the 3DTiles format
Date de publication
24/11/2023
Auteur(s)
EDYTEM / USMB / CNRS and LIRIS (VCity team)
Taille totale et nombre de fichier(s)
781Mo for the original LAZ file ( 94 465 067 RGB points)
URL pour visualiser le dataset
Télécharger le dataset (En cliquant sur ce lien, vous acceptez la licence associée à ce dataset)