Splink Fake 1000: Deduplication Benchmark

A dataset of 1,000 synthetic records representing 181 unique people, each with varying numbers of duplicate entries. Duplicates have been corrupted with typographical errors, missing values, and other realistic data-quality issues. This is the primary demo dataset from the Python splink library.

Usage

fake_1000

Format

A tibble with 1,000 rows and 7 columns:

unique_id: Integer. Row identifier (0-indexed).
first_name: Character. Given name, sometimes corrupted or missing.
surname: Character. Family name, sometimes corrupted or missing.
dob: Character. Date of birth in YYYY-MM-DD format.
city: Character. City of residence, sometimes missing.
email: Character. Email address, sometimes corrupted or missing.
cluster: Integer. Ground-truth entity label (0-indexed).

Source

From the splink datasets repository maintained by the UK Ministry of Justice Analytical Services: https://github.com/moj-analytical-services/splink_datasets. Original data generated by the splink team (Linacre et al.) under the MIT license.

Details

The cluster column provides ground-truth entity labels: records sharing the same cluster value refer to the same person. The unique_id column provides a unique identifier for each row, starting at 0 (matching splink convention).

Splink Fake 1000: Deduplication Benchmark

Usage

Format

Source

Details

See also