Getting Started • irelink

What is record linkage?

Record linkage, also called entity resolution or deduplication, identifies records in one or more datasets that refer to the same real-world entity. When datasets do not share a unique identifier, you must rely on imperfect fields such as names, dates of birth, and addresses. Probabilistic record linkage estimates the chance that two records are a match based on how similar they are across several fields.

irelink implements the Fellegi-Sunter model of probabilistic record linkage. It estimates parameters with unsupervised expectation maximization, so you can get started without labeled training data.

A typical workflow

Every linkage task follows the same general pattern:

Define a specification. Choose which columns to compare and how.
Build a model. Load data into a SQL backend and attach the specification.
Train parameters. Estimate u-probabilities, then run EM to learn m-probabilities.
Predict. Score candidate pairs and keep the likely matches.
Cluster. Resolve pairwise links into groups that represent the same entity.

The example below walks through each step using a small built-in dataset.

Step 1: Define a specification

A specification defines the comparisons and blocking rules that drive the model. Comparisons tell irelink how to score similarity on each field, and blocking rules limit which record pairs are compared so linkage stays tractable on large datasets.

library(irelink)
#> 
#> Attaching package: 'irelink'
#> The following object is masked from 'package:base':
#> 
#>     months

spec <- il_spec() |>
  il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |>
  il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |>
  il_compare(dob, cl_exact()) |>
  il_block_on(surname) |>
  il_block_on(first_name)

spec
#> Linkage Specification
#>   Comparisons (3):
#>     first_name : jaro_winkler
#>     surname : jaro_winkler
#>     dob : exact
#>   Blocking rules (2, OR-ed):
#>     1. surname
#>     2. first_name

Each call to il_compare() adds one comparison dimension. Here, cl_jaro_winkler(0.9, 0.7) creates three levels: similarity of at least 0.9 is level 2, similarity of at least 0.7 is level 1, and anything lower is level 0. cl_exact() is a simple binary match.

Blocking rules defined with il_block_on() restrict candidate pairs to records that share the same value in the blocking column. Multiple blocking rules use OR logic, so a pair is compared if it satisfies any one of them.

Step 2: Build a model

il_model() uploads the data to a SQL backend and attaches the specification. Any DBI-compatible connection works. Here we use an in-memory DuckDB database:

df <- fake_20
con <- DBI::dbConnect(duckdb::duckdb())

model <- il_model(df, spec = spec, con = con)
model
#> irelink Model
#>   Status: Untrained
#>   Link type: dedupe
#>   Records: 20
#>   Comparisons: 3
#>   Blocking rules: 2

Step 3: Train parameters

Training has two main steps. First, estimate u-probabilities, which are the chances that two random non-matching records agree at each comparison level:

model <- il_estimate_u(model)

Next, run expectation maximization to learn m-probabilities, which are the chances that true matches agree at each level. You provide a blocking rule to generate the training pairs:

model <- il_estimate_em(model, block_on(surname))
#> EM trained: first_name and dob | skipped (blocked on):
#> surname

You can inspect the learned parameters at any time:

il_weights(model)
#> # A tibble: 8 × 5
#>   comparison gamma_level m_prob u_prob weight
#>   <chr>            <int>  <dbl>  <dbl>  <dbl>
#> 1 first_name           0 0.0114 0.832  -6.18 
#> 2 first_name           1 0.196  0.0632  1.63 
#> 3 first_name           2 0.792  0.105   2.91 
#> 4 surname              0 0.05   0.821  -4.04 
#> 5 surname              1 0.05   0.0368  0.441
#> 6 surname              2 0.9    0.142   2.66 
#> 7 dob                  0 0.280  0.921  -1.72 
#> 8 dob                  1 0.720  0.0789  3.19

Step 4: Predict

predict() scores candidate pairs and returns those above a match-probability threshold:

pairs <- predict(model, threshold = 0.5)
head(pairs)
#> # A tibble: 6 × 8
#>   unique_id_l unique_id_r gamma_first_name gamma_surname gamma_dob match_weight
#>         <int>       <int>            <int>         <int>     <int>        <dbl>
#> 1           8          17                1             2         1         7.49
#> 2          10          20                2             2         0         3.86
#> 3           1           2                2             2         1         8.76
#> 4           4          13                2             2         1         8.76
#> 5          10          19                1             2         1         7.49
#> 6           5           6                2             2         1         8.76
#> # ℹ 2 more variables: total_match_weight <dbl>, match_probability <dbl>

Each row is a candidate pair. The output includes the left and right record identifiers, the per-comparison gamma values, the evidence-only match_weight, the prior-inclusive total_match_weight, and the posterior match_probability.

Step 5: Cluster

il_cluster() resolves pairwise predictions into entity clusters with connected-components analysis:

clusters <- il_cluster(pairs)
head(clusters)
#> # A tibble: 6 × 2
#>   unique_id cluster_id
#>   <chr>     <chr>     
#> 1 20        cluster_10
#> 2 7         cluster_17
#> 3 13        cluster_13
#> 4 8         cluster_17
#> 5 10        cluster_10
#> 6 9         cluster_10

Each record is assigned a cluster_id. Records in the same cluster are treated as the same entity.

Comparison levels

irelink includes a large set of comparison levels for common field types:

Level	Use case
`cl_exact()`	Binary exact match
`cl_jaro_winkler()`	Names, short strings
`cl_levenshtein()`	General fuzzy strings
`cl_damerau_levenshtein()`	Strings with transpositions
`cl_jaro()`	Lightweight string similarity
`cl_jaccard()`	Token-set overlap
`cl_cosine()`	Embedding similarity
`cl_numeric_diff()`	Numeric fields (e.g., age)
`cl_pct_diff()`	Percentage difference
`cl_date_diff()`	Date fields
`cl_time_diff()`	Time fields
`cl_geo_distance()`	Geographic coordinates
`cl_array_intersect()`	Array or set overlap

For common field types, domain-specific helpers combine multiple levels into a single call:

Helper	Fields
`cl_name()`	Generic name field
`cl_first_last_name()`	First name and last name as separate fields
`cl_forename_surname()`	Forename and surname with transposition
`cl_dob()`	Date of birth
`cl_email()`	Email addresses
`cl_postcode()`	UK postal codes
`cl_zip_code()`	US ZIP codes

Evaluation

If you have labeled data, meaning pairs that are known matches or non-matches, irelink provides tools to assess model quality:

il_accuracy(): overall accuracy at a threshold
il_precision_recall(): precision and recall across thresholds
il_roc(): ROC curve data
il_errors(): inspect false positives and false negatives

Cleaning up

When you are done, release the database resources owned by the model. In an interactive session with abandoned models, use il_cleanup_all(con) before disconnecting to drop every irelink table on the connection.

il_cleanup(model)
DBI::dbDisconnect(con, shutdown = TRUE)