Skip to contents

irelink implements the same Fellegi-Sunter probabilistic record linkage framework as fastLink, but it uses a different API and a SQL backend. This vignette maps common fastLink patterns to irelink so you can get started quickly.

Design differences

fastLink bundles data preparation, EM estimation, and matching into a single fastLink() call. irelink breaks that work into a pipeline of composable functions where you define a spec, build a model, estimate parameters, and then predict.

fastLink’s Jaro-Winkler comparisons produce three agreement levels, and cut.a and cut.p control the thresholds. irelink uses cl_jaro_winkler(high, low) to express the same thresholds.

fastLink’s getMatches() assigns a dedupe.ids column to flag duplicates. irelink uses il_cluster() instead, which assigns a cluster_id to each record.

Core workflow

Step fastLink irelink
Define comparisons varnames, stringdist.match, partial.match in fastLink() il_spec() |>
il_compare(...) |>
il_block_on(...)
Estimate and match fastLink(dfA, dfB, ...) il_model(df, spec, con) |>
il_estimate_u() |>
il_estimate_em(block_on(...))
Set match threshold threshold.match in fastLink() threshold in predict()
Get matched records getMatches(dfA, dfB, fl.out) il_cluster(pairs)
Evaluate manual confusion table il_cluster_confusion_matrix(model, labels_col, threshold)
Block by variable blockData(dfA, dfB, varnames) + loop il_block_on(var) in spec
Numeric comparison numeric.match, cut.a.num cl_numeric_diff(threshold)
Inspect parameters out$EM$patterns.w autoplot(model, type = 'parameters')
Set prevalence prior Not built in il_prior_prevalence(model, probability)
Save / load model Not built in il_save(model, path) / il_load(path)

Comparison functions

fastLink compares string fields with Jaro-Winkler by default, or with Levenshtein or Jaro. These produce up to three agreement levels. Each one maps to a cl_*() function in irelink.

fastLink irelink
JW (default), cut.a, cut.p cl_jaro_winkler(high, low)
stringdist.method = "jaro", cut.a, cut.p cl_jaro(high, low)
stringdist.method = "lv", cut.a, cut.p cl_levenshtein(low, high)
numeric.match, cut.a.num cl_numeric_diff(threshold)
exact agreement on non-string fields cl_exact()

Levenshtein thresholds in irelink are raw edit distances. They are not renormalized similarity scores as in fastLink. cl_levenshtein(1, 2) means “distance <= 1 is full agreement, and distance <= 2 is partial agreement.”

Key parameters

fastLink parameter irelink equivalent
cut.a first argument to cl_jaro_winkler()
cut.p second argument to cl_jaro_winkler()
cut.a.num argument to cl_numeric_diff()
threshold.match threshold in predict()
dedupe = FALSE default, irelink never enforces 1-to-1 matching
n.cores irelink uses DuckDB parallelism automatically

Example: side-by-side deduplication

fastLink:

library(fastLink)

out <- fastLink(
  dfA = records,
  dfB = records,
  varnames = c('first_name', 'surname', 'dob'),
  stringdist.match = c('first_name', 'surname'),
  partial.match = c('first_name', 'surname'),
  cut.a = 0.94,
  cut.p = 0.84,
  threshold.match = 0.90,
  dedupe = FALSE
)

recordsfL <- getMatches(dfA = records, dfB = records, fl.out = out)
length(unique(recordsfL$dedupe.ids))

irelink:

library(irelink)

con <- DBI::dbConnect(duckdb::duckdb())

spec <- il_spec() |>
  il_compare(first_name, cl_jaro_winkler(0.94, 0.84)) |>
  il_compare(surname, cl_jaro_winkler(0.94, 0.84)) |>
  il_compare(dob, cl_exact()) |>
  il_block_on(surname) |>
  il_block_on(first_name)

model <- il_model(fake_1000, spec = spec, con = con) |>
  il_estimate_u() |>
  il_estimate_em(block_on(surname)) |>
  il_estimate_em(block_on(first_name)) |>
  il_prior_prevalence(1e-3)

pairs <- predict(model, threshold = 0.90)
clusters <- il_cluster(pairs)
length(unique(clusters$cluster_id))

il_cleanup(model)
DBI::dbDisconnect(con, shutdown = TRUE)

il_prior_prevalence() replaces the training-driven prior with a population-level baseline. This is similar to resetting the prior after EM in fastLink workflows that use a heavily blocked training sample, and you can skip it if your training data is large and representative.

Blocking

fastLink’s blockData() partitions records into groups and requires running fastLink() separately within each block before combining the results. In irelink, you declare blocking in the spec with il_block_on(), and the package applies those rules automatically so you do not need a manual loop.

fastLink:

blocks <- blockData(records, records, varnames = 'surname')

results <- list()
for (j in seq_along(blocks)) {
  sub <- records[blocks[[j]]$dfA.inds, ]
  out_b <- fastLink(dfA = sub, dfB = sub, ...)
  sub <- getMatches(dfA = sub, dfB = sub, fl.out = out_b)
  sub$dedupe.ids <- paste0('B', j, '_', sub$dedupe.ids)
  results[[j]] <- sub
}
combined <- do.call('rbind', results)

irelink:

spec <- il_spec() |>
  il_compare(first_name, cl_jaro_winkler(0.94, 0.84)) |>
  il_compare(surname, cl_jaro_winkler(0.94, 0.84)) |>
  il_compare(dob, cl_exact()) |>
  il_block_on(surname)

fastLink also offers k-means blocking through blockData(..., kmeans.block = ..., nclusters = ...). irelink does not include a built-in k-means blocking step because the data lives in a SQL backend. For numeric fields, the closest equivalent is il_block_on() with pre-bucketed values.

Model inspection

fastLink exposes learned parameters through out$EM$patterns.w, a table of agreement patterns and Fellegi-Sunter weights. irelink provides the same information visually.

autoplot(model)
autoplot(model, type = 'parameters')

Evaluation

fastLink requires you to build a confusion table by hand from dedupe.ids and a ground-truth column. irelink provides il_cluster_confusion_matrix(), which does this directly from the model.

fastLink:

recordsfL$dupTrue <- ifelse(duplicated(recordsfL$cluster), 'Duplicated', 'Not duplicated')
recordsfL$dupfL <- ifelse(duplicated(recordsfL$dedupe.ids), 'Duplicated', 'Not duplicated')
confusion <- table('fastLink' = recordsfL$dupfL, 'True' = recordsfL$dupTrue)

irelink:

acc <- il_cluster_confusion_matrix(model, labels_col = 'cluster', threshold = 0.90)

For a full accuracy or precision-recall curve across all thresholds, use il_accuracy() and il_precision_recall().