Skip to contents

What is record linkage?

Record linkage, also called entity resolution or deduplication, identifies records in one or more datasets that refer to the same real-world entity. When datasets do not share a unique identifier, you must rely on imperfect fields such as names, dates of birth, and addresses. Probabilistic record linkage estimates the chance that two records are a match based on how similar they are across several fields.

irelink implements the Fellegi-Sunter model of probabilistic record linkage. It estimates parameters with unsupervised expectation maximization, so you can get started without labeled training data.

A typical workflow

Every linkage task follows the same general pattern:

  1. Define a specification. Choose which columns to compare and how.
  2. Build a model. Load data into a SQL backend and attach the specification.
  3. Train parameters. Estimate u-probabilities, then run EM to learn m-probabilities.
  4. Predict. Score candidate pairs and keep the likely matches.
  5. Cluster. Resolve pairwise links into groups that represent the same entity.

The example below walks through each step using a small built-in dataset.

Step 1: Define a specification

A specification defines the comparisons and blocking rules that drive the model. Comparisons tell irelink how to score similarity on each field, and blocking rules limit which record pairs are compared so linkage stays tractable on large datasets.

library(irelink)
#> 
#> Attaching package: 'irelink'
#> The following object is masked from 'package:base':
#> 
#>     months

spec <- il_spec() |>
  il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |>
  il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |>
  il_compare(dob, cl_exact()) |>
  il_block_on(surname) |>
  il_block_on(first_name)

spec
#> Linkage Specification
#>   Comparisons (3):
#>     first_name : jaro_winkler
#>     surname : jaro_winkler
#>     dob : exact
#>   Blocking rules (2, OR-ed):
#>     1. surname
#>     2. first_name

Each call to il_compare() adds one comparison dimension. Here, cl_jaro_winkler(0.9, 0.7) creates three levels: similarity of at least 0.9 is level 2, similarity of at least 0.7 is level 1, and anything lower is level 0. cl_exact() is a simple binary match.

Blocking rules defined with il_block_on() restrict candidate pairs to records that share the same value in the blocking column. Multiple blocking rules use OR logic, so a pair is compared if it satisfies any one of them.

Step 2: Build a model

il_model() uploads the data to a SQL backend and attaches the specification. Any DBI-compatible connection works. Here we use an in-memory DuckDB database:

df <- fake_20
con <- DBI::dbConnect(duckdb::duckdb())

model <- il_model(df, spec = spec, con = con)
model
#> irelink Model
#>   Status: Untrained
#>   Link type: dedupe
#>   Records: 20
#>   Comparisons: 3
#>   Blocking rules: 2

Step 3: Train parameters

Training has two main steps. First, estimate u-probabilities, which are the chances that two random non-matching records agree at each comparison level:

model <- il_estimate_u(model)

Next, run expectation maximization to learn m-probabilities, which are the chances that true matches agree at each level. You provide a blocking rule to generate the training pairs:

model <- il_estimate_em(model, block_on(surname))
#> EM trained: first_name and dob | skipped (blocked on):
#> surname

You can inspect the learned parameters at any time:

il_weights(model)
#> # A tibble: 8 × 5
#>   comparison gamma_level m_prob u_prob weight
#>   <chr>            <int>  <dbl>  <dbl>  <dbl>
#> 1 first_name           0 0.0114 0.832  -6.18 
#> 2 first_name           1 0.196  0.0632  1.63 
#> 3 first_name           2 0.792  0.105   2.91 
#> 4 surname              0 0.05   0.821  -4.04 
#> 5 surname              1 0.05   0.0368  0.441
#> 6 surname              2 0.9    0.142   2.66 
#> 7 dob                  0 0.280  0.921  -1.72 
#> 8 dob                  1 0.720  0.0789  3.19

Step 4: Predict

predict() scores candidate pairs and returns those above a match-probability threshold:

pairs <- predict(model, threshold = 0.5)
head(pairs)
#> # A tibble: 6 × 8
#>   unique_id_l unique_id_r gamma_first_name gamma_surname gamma_dob match_weight
#>         <int>       <int>            <int>         <int>     <int>        <dbl>
#> 1           8          17                1             2         1         7.49
#> 2          10          20                2             2         0         3.86
#> 3           1           2                2             2         1         8.76
#> 4           4          13                2             2         1         8.76
#> 5          10          19                1             2         1         7.49
#> 6           5           6                2             2         1         8.76
#> # ℹ 2 more variables: total_match_weight <dbl>, match_probability <dbl>

Each row is a candidate pair. The output includes the left and right record identifiers, the per-comparison gamma values, the evidence-only match_weight, the prior-inclusive total_match_weight, and the posterior match_probability.

Step 5: Cluster

il_cluster() resolves pairwise predictions into entity clusters with connected-components analysis:

clusters <- il_cluster(pairs)
head(clusters)
#> # A tibble: 6 × 2
#>   unique_id cluster_id
#>   <chr>     <chr>     
#> 1 20        cluster_10
#> 2 7         cluster_17
#> 3 13        cluster_13
#> 4 8         cluster_17
#> 5 10        cluster_10
#> 6 9         cluster_10

Each record is assigned a cluster_id. Records in the same cluster are treated as the same entity.

Comparison levels

irelink includes a large set of comparison levels for common field types:

Level Use case
cl_exact() Binary exact match
cl_jaro_winkler() Names, short strings
cl_levenshtein() General fuzzy strings
cl_damerau_levenshtein() Strings with transpositions
cl_jaro() Lightweight string similarity
cl_jaccard() Token-set overlap
cl_cosine() Embedding similarity
cl_numeric_diff() Numeric fields (e.g., age)
cl_pct_diff() Percentage difference
cl_date_diff() Date fields
cl_time_diff() Time fields
cl_geo_distance() Geographic coordinates
cl_array_intersect() Array or set overlap

For common field types, domain-specific helpers combine multiple levels into a single call:

Helper Fields
cl_name() Generic name field
cl_first_last_name() First name and last name as separate fields
cl_forename_surname() Forename and surname with transposition
cl_dob() Date of birth
cl_email() Email addresses
cl_postcode() UK postal codes
cl_zip_code() US ZIP codes

Evaluation

If you have labeled data, meaning pairs that are known matches or non-matches, irelink provides tools to assess model quality:

Cleaning up

When you are done, release the database resources owned by the model. In an interactive session with abandoned models, use il_cleanup_all(con) before disconnecting to drop every irelink table on the connection.

il_cleanup(model)
DBI::dbDisconnect(con, shutdown = TRUE)