irelink implements the same Fellegi-Sunter probabilistic
record linkage framework as fastLink, but it uses
a different API and a SQL backend. This vignette maps common fastLink
patterns to irelink so you can get started quickly.
Design differences
fastLink bundles data preparation, EM estimation, and matching into a
single fastLink() call. irelink breaks that
work into a pipeline of composable functions where you define a spec,
build a model, estimate parameters, and then predict.
fastLink’s Jaro-Winkler comparisons produce three agreement levels,
and cut.a and cut.p control the thresholds.
irelink uses cl_jaro_winkler(high, low) to
express the same thresholds.
fastLink’s getMatches() assigns a
dedupe.ids column to flag duplicates. irelink
uses il_cluster() instead, which assigns a
cluster_id to each record.
Core workflow
| Step | fastLink | irelink |
|---|---|---|
| Define comparisons |
varnames, stringdist.match,
partial.match in fastLink()
|
il_spec() |>il_compare(...) |>il_block_on(...)
|
| Estimate and match | fastLink(dfA, dfB, ...) |
il_model(df, spec, con) |>il_estimate_u() |>il_estimate_em(block_on(...))
|
| Set match threshold |
threshold.match in fastLink()
|
threshold in predict()
|
| Get matched records | getMatches(dfA, dfB, fl.out) |
il_cluster(pairs) |
| Evaluate | manual confusion table | il_cluster_confusion_matrix(model, labels_col, threshold) |
| Block by variable |
blockData(dfA, dfB, varnames) + loop |
il_block_on(var) in spec |
| Numeric comparison |
numeric.match, cut.a.num
|
cl_numeric_diff(threshold) |
| Inspect parameters | out$EM$patterns.w |
autoplot(model, type = 'parameters') |
| Set prevalence prior | Not built in | il_prior_prevalence(model, probability) |
| Save / load model | Not built in |
il_save(model, path) / il_load(path)
|
Comparison functions
fastLink compares string fields with Jaro-Winkler by default, or with
Levenshtein or Jaro. These produce up to three agreement levels. Each
one maps to a cl_*() function in irelink.
| fastLink | irelink |
|---|---|
JW (default), cut.a, cut.p
|
cl_jaro_winkler(high, low) |
stringdist.method = "jaro", cut.a,
cut.p
|
cl_jaro(high, low) |
stringdist.method = "lv", cut.a,
cut.p
|
cl_levenshtein(low, high) |
numeric.match, cut.a.num
|
cl_numeric_diff(threshold) |
| exact agreement on non-string fields | cl_exact() |
Levenshtein thresholds in irelink are raw edit
distances. They are not renormalized similarity scores as in fastLink.
cl_levenshtein(1, 2) means “distance <= 1 is full
agreement, and distance <= 2 is partial agreement.”
Key parameters
| fastLink parameter | irelink equivalent |
|---|---|
cut.a |
first argument to cl_jaro_winkler()
|
cut.p |
second argument to cl_jaro_winkler()
|
cut.a.num |
argument to cl_numeric_diff()
|
threshold.match |
threshold in predict()
|
dedupe = FALSE |
default, irelink never enforces 1-to-1 matching |
n.cores |
irelink uses DuckDB parallelism automatically |
Example: side-by-side deduplication
fastLink:
library(fastLink)
out <- fastLink(
dfA = records,
dfB = records,
varnames = c('first_name', 'surname', 'dob'),
stringdist.match = c('first_name', 'surname'),
partial.match = c('first_name', 'surname'),
cut.a = 0.94,
cut.p = 0.84,
threshold.match = 0.90,
dedupe = FALSE
)
recordsfL <- getMatches(dfA = records, dfB = records, fl.out = out)
length(unique(recordsfL$dedupe.ids))irelink:
library(irelink)
con <- DBI::dbConnect(duckdb::duckdb())
spec <- il_spec() |>
il_compare(first_name, cl_jaro_winkler(0.94, 0.84)) |>
il_compare(surname, cl_jaro_winkler(0.94, 0.84)) |>
il_compare(dob, cl_exact()) |>
il_block_on(surname) |>
il_block_on(first_name)
model <- il_model(fake_1000, spec = spec, con = con) |>
il_estimate_u() |>
il_estimate_em(block_on(surname)) |>
il_estimate_em(block_on(first_name)) |>
il_prior_prevalence(1e-3)
pairs <- predict(model, threshold = 0.90)
clusters <- il_cluster(pairs)
length(unique(clusters$cluster_id))
il_cleanup(model)
DBI::dbDisconnect(con, shutdown = TRUE)il_prior_prevalence() replaces the training-driven prior
with a population-level baseline. This is similar to resetting the prior
after EM in fastLink workflows that use a heavily blocked training
sample, and you can skip it if your training data is large and
representative.
Blocking
fastLink’s blockData() partitions records into groups
and requires running fastLink() separately within each
block before combining the results. In irelink, you declare
blocking in the spec with il_block_on(), and the package
applies those rules automatically so you do not need a manual loop.
fastLink:
blocks <- blockData(records, records, varnames = 'surname')
results <- list()
for (j in seq_along(blocks)) {
sub <- records[blocks[[j]]$dfA.inds, ]
out_b <- fastLink(dfA = sub, dfB = sub, ...)
sub <- getMatches(dfA = sub, dfB = sub, fl.out = out_b)
sub$dedupe.ids <- paste0('B', j, '_', sub$dedupe.ids)
results[[j]] <- sub
}
combined <- do.call('rbind', results)irelink:
spec <- il_spec() |>
il_compare(first_name, cl_jaro_winkler(0.94, 0.84)) |>
il_compare(surname, cl_jaro_winkler(0.94, 0.84)) |>
il_compare(dob, cl_exact()) |>
il_block_on(surname)fastLink also offers k-means blocking through
blockData(..., kmeans.block = ..., nclusters = ...).
irelink does not include a built-in k-means blocking step
because the data lives in a SQL backend. For numeric fields, the closest
equivalent is il_block_on() with pre-bucketed values.
Model inspection
fastLink exposes learned parameters through
out$EM$patterns.w, a table of agreement patterns and
Fellegi-Sunter weights. irelink provides the same
information visually.
Evaluation
fastLink requires you to build a confusion table by hand from
dedupe.ids and a ground-truth column. irelink
provides il_cluster_confusion_matrix(), which does this
directly from the model.
fastLink:
recordsfL$dupTrue <- ifelse(duplicated(recordsfL$cluster), 'Duplicated', 'Not duplicated')
recordsfL$dupfL <- ifelse(duplicated(recordsfL$dedupe.ids), 'Duplicated', 'Not duplicated')
confusion <- table('fastLink' = recordsfL$dupfL, 'True' = recordsfL$dupTrue)irelink:
acc <- il_cluster_confusion_matrix(model, labels_col = 'cluster', threshold = 0.90)For a full accuracy or precision-recall curve across all thresholds,
use il_accuracy() and
il_precision_recall().
