A dataset of 1,000 synthetic records representing 181 unique people, each with varying numbers of duplicate entries. Duplicates have been corrupted with typographical errors, missing values, and other realistic data-quality issues. This is the primary demo dataset from the Python splink library.
Format
A tibble with 1,000 rows and 7 columns:
- unique_id
Integer. Row identifier (0-indexed).
- first_name
Character. Given name, sometimes corrupted or missing.
- surname
Character. Family name, sometimes corrupted or missing.
- dob
Character. Date of birth in
YYYY-MM-DDformat.- city
Character. City of residence, sometimes missing.
Character. Email address, sometimes corrupted or missing.
- cluster
Integer. Ground-truth entity label (0-indexed).
Source
From the splink datasets repository maintained by the UK Ministry of Justice Analytical Services: https://github.com/moj-analytical-services/splink_datasets. Original data generated by the splink team (Linacre et al.) under the MIT license.
Details
The cluster column provides ground-truth entity labels: records sharing
the same cluster value refer to the same person.
The unique_id column provides a unique identifier for each row,
starting at 0 (matching splink convention).
See also
fake_1000_labels for pairwise clerical labels.
