The FEBRL (Freely Extensible Biomedical Record Linkage) dataset 4a
contains 5,000 original records.
It is designed to be linked against febrl4b, which contains one
duplicate record per original.
Ground truth is encoded in rec_id: records sharing the same base ID
(e.g., rec-1070-org and rec-1070-dup-0) refer to the same entity.
Format
A tibble with 5,000 rows and 11 columns:
- rec_id
Character. Record identifier encoding entity and origin (
-orgsuffix).- given_name
Character. Given name, sometimes missing.
- surname
Character. Family name, sometimes missing.
- street_number
Integer. Street number, sometimes missing.
- address_1
Character. Primary address line, sometimes missing.
- address_2
Character. Secondary address line, often missing.
- suburb
Character. Suburb or neighborhood.
- postcode
Integer. Postal code.
- state
Character. Australian state abbreviation.
- date_of_birth
Integer. Date of birth as
YYYYMMDDinteger, sometimes missing.- soc_sec_id
Integer. Social security identifier.
Source
Distributed via the splink datasets repository (https://github.com/moj-analytical-services/splink_datasets) under the MIT license. The FEBRL datasets originate from Christen and Churches (2004) and are widely used as record-linkage benchmarks.
References
Christen, P. and Churches, T. (2004). Febrl – Freely Extensible Biomedical Record Linkage. Australian National University.
See also
febrl4b for the corresponding duplicate records.
