The Multi-View Evaluation Protocol for Glass Container Inspection (MVEP) is a specialized benchmark dataset designed to evaluate multi-view fusion methods for industrial quality control of transparent materials. MVEP addresses the critical challenge of automated defect detection and severity assessment on glass container surfaces, where view-dependent optical phenomena (including specular reflections, refractions, and transparency effects) significantly limit the reliability of single-view inspection. The dataset comprises 16,000 synchronized multi-view images of glass containers captured from six calibrated viewpoints under controlled industrial lighting conditions. Each container is annotated with object-level bounding boxes and ordinal severity labels corresponding to surface degradation defects, ranging from minimal visual alteration to critical damage requiring rejection. The defect taxonomy focuses on erasure-type surface defects, replacing earlier wipe-based definitions to better reflect realistic industrial degradation patterns, and excludes any reference to scuffing. MVEP provides a realistic distribution of quality grades encountered in production environments and includes inherent annotation uncertainty due to the subjective nature of visual quality assessment. The dataset is particularly well suited for evaluating ordinal classification methods, multi-view fusion strategies, cross-view consistency constraints, and robustness to annotation noise in industrial inspection scenarios involving transparent materials.
The Multi-Modal and Multi-View Object Detection Dataset (MMDOD) is a comprehensive benchmark designed to advance research in detection-driven image fusion under strong modality-view dependencies. MMDOD contains over 10,000 high-resolution images of transparent glass containers captured under four complementary imaging modalities (visible light, near-infrared (NIR), low-contrast, and polarization shift) across six distinct viewpoints. Each image is annotated with detailed object-level bounding boxes and class labels, enabling rigorous evaluation of multi-modal and multi-view fusion methods for object detection tasks. The dataset addresses a critical gap in existing benchmarks by providing synchronized multi-modal multi-view observations of transparent materials, which exhibit complex view-dependent optical effects such as specular reflections, refraction, and low contrast. MMDOD is particularly suited for evaluating end-to-end detection-driven fusion architectures, task-driven learning strategies, and cross-sensor alignment mechanisms in challenging industrial inspection scenarios.
Database instances generated and polluted using the Perfect Pet open-source software (https://github.com/mathildemarcy/perfect_pet).
These databases contain instances of the Perfect Pet database of different sizes, polluted with various artificial unicity factors.
The clean schemas contain 13 relations: animal, animal_owner, animal_weight, appointment, appointment_service, appointment_slot, doctor, doctor_historization, microchip, microchip_code, owner, service, slot.
The polluted schemas contain 9 relations: animal, appointment, appointment_slot, doctor, microchip, microchip_code, owner, service, slot.
More information on these databases and their generation and pollution is available at https://github.com/mathildemarcy/perfect_pet.
Our current research project focuses on the cleaning for analytical use of relational databases implemented with surrogate keys and surrogate foreign keys but without natural keys. Despite their numerous benefits, surrogate keys can not only induce the presence of data quality issues within a database but also act as a major obstacle to any regular data cleaning technique, due to the artificial unicity they carry and propagate. We developed RED2Hunt, a framework dedicated to cleaning such databases, described in https://arxiv.org/abs/2503.20593.
Because of their private nature, none of the operational databases our team members worked on could be made available to the research/academic community. Thus, we decided to generate Perfect Pet, a synthetic relational database 1) to be used to facilitate the diffusion of our work on artificial unicity (the phenomenon commonly found in operational databases which resolution motivated this research project), testing and demonstrating RED2Hunt and enable the reproductibility of our experiments and results, and make it available to the community for their own use.
For these purposes, the database had to satisfy the following requirements:
Perfect Pet data should suffer from the following data quality issues:
The database was designed as a relational database supporting the appointment management application of the fictitious veterinary clinic Perfect Pet. It includes information describing the pets visiting the clinic, their owners (the clinic’s clients), the medical appointments, and the doctors working at the clinic.
Although team members never worked on a data research project related to the animal health and welfare sectors, this topic was selected to generate the synthetic data for two reasons: 1) guarantee the anonymity of the original operational databases by preventing a possible connection with the synthetic data, 2) simplify the generation process by leveraging the domain knowledge and access to an operational database in the sector of animal welfare, from one of the team members.
We thank The Jordanian Society for Animal Protection (JSAP) for allowing us to use a part of their operational database as a starting point to generate our synthetic Perfect Pet database, although it does not suffer from any of the data quality issues mentioned above.
This dataset includes 1,000 meshes for each of 100 categories of fruits and vegetables. A sub-sample is currently available, and the full dataset will be released in the coming weeks.
Set of deteriorated versions of the publicly available non-commercial IMDB database, comprising different amounts of duplicates.
The datasets were extracted from PostgreSQL databases including relations titles, name_basics, title_episode, title_ratings, and title_principals from https://datasets.imdbws.com/ IMDB database version downloaded on April 7th, 2024. these databases were deteriorated on purpose to experiment the Red2Hunt method that generates a redundant-free database from any relational operational database comprising surrogate keys and duplicates.
A set of computer generated cave system and tunnel system
This dataset has been generated from 3 constructions models, transferred from Autodesk Revit to NVIDIA Isaac Sim. It contains 8751 samples of RGB images associated with the semantic segmentation masks and label files for 13 classes (rectangular_sheath, circular_sheath, pipe, air_vent, fan_coil, stair, wall, floor, pipe_accessory, framework, radiant_panel, climate_engineering_equipment, ceiling, handrail, roof, cable_tray, pole).
This dataset contains CSV and SQLite files with data about projects backends extracted from Metadata about every file uploaded to PyPI :
Source code for the data extraction.
(1) There are several pyproject.toml files for some projects (e.g poetry), often in test folders
(2) The test is quite basic, but there are few projects that have several pyproject.toml file matching this test
After the publication of the first charts, I wanted to know how many projects had no source package, how many projects had no pyproject.toml to complete the first statistics.
This dataset contains CSV and SQLite files extracted from the same source (parquet files from "Metadata about every file uploaded to PyPI"):
These files weight 1.1 and 1.3 Go respectively.
Source code for the second data extraction.
This dataset holds