Perfect Pet: synthetic relational database polluted by artificial unicity - version1

Description

Our current research project focuses on the cleaning for analytical use of relational databases implemented with surrogate keys and surrogate foreign keys but without natural keys. Despite their numerous benefits, surrogate keys can not only induce the presence of data quality issues within a database but also act as a major obstacle to any regular data cleaning technique, due to the artificial unicity they carry and propagate. We developed RED2Hunt, a framework dedicated to cleaning such databases, described in https://arxiv.org/abs/2503.20593.

Because of their private nature, none of the operational databases our team members worked on could be made available to the research/academic community. Thus, we decided to generate Perfect Pet, a synthetic relational database 1) to be used to facilitate the diffusion of our work on artificial unicity (the phenomenon commonly found in operational databases which resolution motivated this research project), testing and demonstrating RED2Hunt and enable the reproductibility of our experiments and results, and make it available to the community for their own use.

For these purposes, the database had to satisfy the following requirements:

Ressemble real operational data: mimic data that would be collected in a real operational settings by their nature, structure, distributions.
Have a very simple, straightforward schema to be easily understandable by any audience.
Include the following data quality issues as observed in real operational databases:
Comprise high levels of artificial unicity (above 60% in each relation suffering from the phenomenon) due to:
- Exclusive reliance on surrogate keys and no encoding of any natural key,
- Schema denormalization,
- Non-respect of observed cardinalities in real life (separation between entities' description and their historization).
Include data inconsistencies within duplicate groups due to:
- denormalisation,
- data entry error,
- data collection practice evolution.
Inlude format inconsistencies, due to a modification of data type after deployment of the data collection system and start of data collection.

Perfect Pet data should suffer from the following data quality issues:

- Redundancy,
- Inconsistencies,
- Inaccuracy,
- Incompleteness.

The database was designed as a relational database supporting the appointment management application of the fictitious veterinary clinic Perfect Pet. It includes information describing the pets visiting the clinic, their owners (the clinic’s clients), the medical appointments, and the doctors working at the clinic.

Although team members never worked on a data research project related to the animal health and welfare sectors, this topic was selected to generate the synthetic data for two reasons: 1) guarantee the anonymity of the original operational databases by preventing a possible connection with the synthetic data, 2) simplify the generation process by leveraging the domain knowledge and access to an operational database in the sector of animal welfare, from one of the team members.

We thank The Jordanian Society for Animal Protection (JSAP) for allowing us to use a part of their operational database as a starting point to generate our synthetic Perfect Pet database, although it does not suffer from any of the data quality issues mentioned above.

Download instructions

The data model to create the tables in a schema called “perfect_pet” is available in file perfect_pet_db_data_model.sql

File perfect_pet_db_data.sql contains the data to be inserted in the tables.

Download from

FTP, HTTPS

Licence

Creative common

Publication date

27/03/2025

Author(s)

Mathilde MARCY, Jean-Marc PETIT

Version