A small, hand-crafted dataset of 20 records representing 5 unique people. Each person has four records with varying levels of corruption: exact matches, minor typos, and slightly shifted dates of birth. Designed for quick examples and unit tests.
Format
A tibble with 20 rows and 6 columns:
- first_name
Character. Given name, sometimes corrupted.
- surname
Character. Family name, sometimes corrupted.
- dob
Character. Date of birth in
YYYY-MM-DDformat, sometimes shifted by one day.- city
Character. City of residence.
Character. Email address, sometimes corrupted.
- cluster
Integer. Ground-truth entity label (1 to 5).
See also
fake_1000 for a larger benchmark dataset.
