What is record linkage?
Record linkage, also called entity resolution or deduplication, identifies records in one or more datasets that refer to the same real-world entity. When datasets do not share a unique identifier, you must rely on imperfect fields such as names, dates of birth, and addresses. Probabilistic record linkage estimates the chance that two records are a match based on how similar they are across several fields.
irelink implements the Fellegi-Sunter model of
probabilistic record linkage. It estimates parameters with unsupervised
expectation maximization, so you can get started without labeled
training data.
A typical workflow
Every linkage task follows the same general pattern:
- Define a specification. Choose which columns to compare and how.
- Build a model. Load data into a SQL backend and attach the specification.
- Train parameters. Estimate u-probabilities, then run EM to learn m-probabilities.
- Predict. Score candidate pairs and keep the likely matches.
- Cluster. Resolve pairwise links into groups that represent the same entity.
The example below walks through each step using a small built-in dataset.
Step 1: Define a specification
A specification defines the comparisons and blocking rules that drive
the model. Comparisons tell irelink how to score similarity
on each field, and blocking rules limit which record pairs are compared
so linkage stays tractable on large datasets.
library(irelink)
#>
#> Attaching package: 'irelink'
#> The following object is masked from 'package:base':
#>
#> months
spec <- il_spec() |>
il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |>
il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |>
il_compare(dob, cl_exact()) |>
il_block_on(surname) |>
il_block_on(first_name)
spec
#> Linkage Specification
#> Comparisons (3):
#> first_name : jaro_winkler
#> surname : jaro_winkler
#> dob : exact
#> Blocking rules (2, OR-ed):
#> 1. surname
#> 2. first_nameEach call to il_compare() adds one comparison dimension.
Here, cl_jaro_winkler(0.9, 0.7) creates three levels:
similarity of at least 0.9 is level 2, similarity of at least 0.7 is
level 1, and anything lower is level 0. cl_exact() is a
simple binary match.
Blocking rules defined with il_block_on() restrict
candidate pairs to records that share the same value in the blocking
column. Multiple blocking rules use OR logic, so a pair is compared if
it satisfies any one of them.
Step 2: Build a model
il_model() uploads the data to a SQL backend and
attaches the specification. Any DBI-compatible connection works. Here we
use an in-memory DuckDB database:
Step 3: Train parameters
Training has two main steps. First, estimate u-probabilities, which are the chances that two random non-matching records agree at each comparison level:
model <- il_estimate_u(model)Next, run expectation maximization to learn m-probabilities, which are the chances that true matches agree at each level. You provide a blocking rule to generate the training pairs:
model <- il_estimate_em(model, block_on(surname))
#> EM trained: first_name and dob | skipped (blocked on):
#> surnameYou can inspect the learned parameters at any time:
il_weights(model)
#> # A tibble: 8 × 5
#> comparison gamma_level m_prob u_prob weight
#> <chr> <int> <dbl> <dbl> <dbl>
#> 1 first_name 0 0.0114 0.832 -6.18
#> 2 first_name 1 0.196 0.0632 1.63
#> 3 first_name 2 0.792 0.105 2.91
#> 4 surname 0 0.05 0.821 -4.04
#> 5 surname 1 0.05 0.0368 0.441
#> 6 surname 2 0.9 0.142 2.66
#> 7 dob 0 0.280 0.921 -1.72
#> 8 dob 1 0.720 0.0789 3.19Step 4: Predict
predict() scores candidate pairs and returns those above
a match-probability threshold:
pairs <- predict(model, threshold = 0.5)
head(pairs)
#> # A tibble: 6 × 8
#> unique_id_l unique_id_r gamma_first_name gamma_surname gamma_dob match_weight
#> <int> <int> <int> <int> <int> <dbl>
#> 1 8 17 1 2 1 7.49
#> 2 10 20 2 2 0 3.86
#> 3 1 2 2 2 1 8.76
#> 4 4 13 2 2 1 8.76
#> 5 10 19 1 2 1 7.49
#> 6 5 6 2 2 1 8.76
#> # ℹ 2 more variables: total_match_weight <dbl>, match_probability <dbl>Each row is a candidate pair. The output includes the left and right
record identifiers, the per-comparison gamma values, the evidence-only
match_weight, the prior-inclusive
total_match_weight, and the posterior
match_probability.
Step 5: Cluster
il_cluster() resolves pairwise predictions into entity
clusters with connected-components analysis:
clusters <- il_cluster(pairs)
head(clusters)
#> # A tibble: 6 × 2
#> unique_id cluster_id
#> <chr> <chr>
#> 1 20 cluster_10
#> 2 7 cluster_17
#> 3 13 cluster_13
#> 4 8 cluster_17
#> 5 10 cluster_10
#> 6 9 cluster_10Each record is assigned a cluster_id. Records in the
same cluster are treated as the same entity.
Comparison levels
irelink includes a large set of comparison levels for
common field types:
| Level | Use case |
|---|---|
cl_exact() |
Binary exact match |
cl_jaro_winkler() |
Names, short strings |
cl_levenshtein() |
General fuzzy strings |
cl_damerau_levenshtein() |
Strings with transpositions |
cl_jaro() |
Lightweight string similarity |
cl_jaccard() |
Token-set overlap |
cl_cosine() |
Embedding similarity |
cl_numeric_diff() |
Numeric fields (e.g., age) |
cl_pct_diff() |
Percentage difference |
cl_date_diff() |
Date fields |
cl_time_diff() |
Time fields |
cl_geo_distance() |
Geographic coordinates |
cl_array_intersect() |
Array or set overlap |
For common field types, domain-specific helpers combine multiple levels into a single call:
| Helper | Fields |
|---|---|
cl_name() |
Generic name field |
cl_first_last_name() |
First name and last name as separate fields |
cl_forename_surname() |
Forename and surname with transposition |
cl_dob() |
Date of birth |
cl_email() |
Email addresses |
cl_postcode() |
UK postal codes |
cl_zip_code() |
US ZIP codes |
Evaluation
If you have labeled data, meaning pairs that are known matches or
non-matches, irelink provides tools to assess model
quality:
-
il_accuracy(): overall accuracy at a threshold -
il_precision_recall(): precision and recall across thresholds -
il_roc(): ROC curve data -
il_errors(): inspect false positives and false negatives
Cleaning up
When you are done, release the database resources owned by the model.
In an interactive session with abandoned models, use
il_cleanup_all(con) before disconnecting to drop every
irelink table on the connection.
il_cleanup(model)
DBI::dbDisconnect(con, shutdown = TRUE)