Skip to contents

A dataset of 1,000 synthetic records representing 181 unique people, each with varying numbers of duplicate entries. Duplicates have been corrupted with typographical errors, missing values, and other realistic data-quality issues. This is the primary demo dataset from the Python splink library.

Usage

fake_1000

Format

A tibble with 1,000 rows and 7 columns:

unique_id

Integer. Row identifier (0-indexed).

first_name

Character. Given name, sometimes corrupted or missing.

surname

Character. Family name, sometimes corrupted or missing.

dob

Character. Date of birth in YYYY-MM-DD format.

city

Character. City of residence, sometimes missing.

email

Character. Email address, sometimes corrupted or missing.

cluster

Integer. Ground-truth entity label (0-indexed).

Source

From the splink datasets repository maintained by the UK Ministry of Justice Analytical Services: https://github.com/moj-analytical-services/splink_datasets. Original data generated by the splink team (Linacre et al.) under the MIT license.

Details

The cluster column provides ground-truth entity labels: records sharing the same cluster value refer to the same person. The unique_id column provides a unique identifier for each row, starting at 0 (matching splink convention).

See also

fake_1000_labels for pairwise clerical labels.