Translating from fastLink • irelink

irelink implements the same Fellegi-Sunter probabilistic record linkage framework as fastLink, but it uses a different API and a SQL backend. This vignette maps common fastLink patterns to irelink so you can get started quickly.

Design differences

fastLink bundles data preparation, EM estimation, and matching into a single fastLink() call. irelink breaks that work into a pipeline of composable functions where you define a spec, build a model, estimate parameters, and then predict.

fastLink’s Jaro-Winkler comparisons produce three agreement levels, and cut.a and cut.p control the thresholds. irelink uses cl_jaro_winkler(high, low) to express the same thresholds.

fastLink’s getMatches() assigns a dedupe.ids column to flag duplicates. irelink uses il_cluster() instead, which assigns a cluster_id to each record.

Core workflow

Step	fastLink	irelink
Define comparisons	`varnames`, `stringdist.match`, `partial.match` in `fastLink()`	`il_spec() \|>` `il_compare(...) \|>` `il_block_on(...)`
Estimate and match	`fastLink(dfA, dfB, ...)`	`il_model(df, spec, con) \|>` `il_estimate_u() \|>` `il_estimate_em(block_on(...))`
Set match threshold	`threshold.match` in `fastLink()`	`threshold` in `predict()`
Get matched records	`getMatches(dfA, dfB, fl.out)`	`il_cluster(pairs)`
Evaluate	manual confusion table	`il_cluster_confusion_matrix(model, labels_col, threshold)`
Block by variable	`blockData(dfA, dfB, varnames)` + loop	`il_block_on(var)` in spec
Numeric comparison	`numeric.match`, `cut.a.num`	`cl_numeric_diff(threshold)`
Inspect parameters	`out$EM$patterns.w`	`autoplot(model, type = 'parameters')`
Set prevalence prior	Not built in	`il_prior_prevalence(model, probability)`
Save / load model	Not built in	`il_save(model, path)` / `il_load(path)`

Comparison functions

fastLink compares string fields with Jaro-Winkler by default, or with Levenshtein or Jaro. These produce up to three agreement levels. Each one maps to a cl_*() function in irelink.

fastLink	irelink
JW (default), `cut.a`, `cut.p`	`cl_jaro_winkler(high, low)`
`stringdist.method = "jaro"`, `cut.a`, `cut.p`	`cl_jaro(high, low)`
`stringdist.method = "lv"`, `cut.a`, `cut.p`	`cl_levenshtein(low, high)`
`numeric.match`, `cut.a.num`	`cl_numeric_diff(threshold)`
exact agreement on non-string fields	`cl_exact()`

Levenshtein thresholds in irelink are raw edit distances. They are not renormalized similarity scores as in fastLink. cl_levenshtein(1, 2) means “distance <= 1 is full agreement, and distance <= 2 is partial agreement.”

Key parameters

fastLink parameter	irelink equivalent
`cut.a`	first argument to `cl_jaro_winkler()`
`cut.p`	second argument to `cl_jaro_winkler()`
`cut.a.num`	argument to `cl_numeric_diff()`
`threshold.match`	`threshold` in `predict()`
`dedupe = FALSE`	default, `irelink` never enforces 1-to-1 matching
`n.cores`	irelink uses DuckDB parallelism automatically

Example: side-by-side deduplication

fastLink:

library(fastLink)

out <- fastLink(
  dfA = records,
  dfB = records,
  varnames = c('first_name', 'surname', 'dob'),
  stringdist.match = c('first_name', 'surname'),
  partial.match = c('first_name', 'surname'),
  cut.a = 0.94,
  cut.p = 0.84,
  threshold.match = 0.90,
  dedupe = FALSE
)

recordsfL <- getMatches(dfA = records, dfB = records, fl.out = out)
length(unique(recordsfL$dedupe.ids))

irelink:

library(irelink)

con <- DBI::dbConnect(duckdb::duckdb())

spec <- il_spec() |>
  il_compare(first_name, cl_jaro_winkler(0.94, 0.84)) |>
  il_compare(surname, cl_jaro_winkler(0.94, 0.84)) |>
  il_compare(dob, cl_exact()) |>
  il_block_on(surname) |>
  il_block_on(first_name)

model <- il_model(fake_1000, spec = spec, con = con) |>
  il_estimate_u() |>
  il_estimate_em(block_on(surname)) |>
  il_estimate_em(block_on(first_name)) |>
  il_prior_prevalence(1e-3)

pairs <- predict(model, threshold = 0.90)
clusters <- il_cluster(pairs)
length(unique(clusters$cluster_id))

il_cleanup(model)
DBI::dbDisconnect(con, shutdown = TRUE)

il_prior_prevalence() replaces the training-driven prior with a population-level baseline. This is similar to resetting the prior after EM in fastLink workflows that use a heavily blocked training sample, and you can skip it if your training data is large and representative.

Blocking

fastLink’s blockData() partitions records into groups and requires running fastLink() separately within each block before combining the results. In irelink, you declare blocking in the spec with il_block_on(), and the package applies those rules automatically so you do not need a manual loop.

fastLink:

blocks <- blockData(records, records, varnames = 'surname')

results <- list()
for (j in seq_along(blocks)) {
  sub <- records[blocks[[j]]$dfA.inds, ]
  out_b <- fastLink(dfA = sub, dfB = sub, ...)
  sub <- getMatches(dfA = sub, dfB = sub, fl.out = out_b)
  sub$dedupe.ids <- paste0('B', j, '_', sub$dedupe.ids)
  results[[j]] <- sub
}
combined <- do.call('rbind', results)

irelink:

spec <- il_spec() |>
  il_compare(first_name, cl_jaro_winkler(0.94, 0.84)) |>
  il_compare(surname, cl_jaro_winkler(0.94, 0.84)) |>
  il_compare(dob, cl_exact()) |>
  il_block_on(surname)

fastLink also offers k-means blocking through blockData(..., kmeans.block = ..., nclusters = ...). irelink does not include a built-in k-means blocking step because the data lives in a SQL backend. For numeric fields, the closest equivalent is il_block_on() with pre-bucketed values.

Model inspection

fastLink exposes learned parameters through out$EM$patterns.w, a table of agreement patterns and Fellegi-Sunter weights. irelink provides the same information visually.

autoplot(model)
autoplot(model, type = 'parameters')

Evaluation

fastLink requires you to build a confusion table by hand from dedupe.ids and a ground-truth column. irelink provides il_cluster_confusion_matrix(), which does this directly from the model.

fastLink:

recordsfL$dupTrue <- ifelse(duplicated(recordsfL$cluster), 'Duplicated', 'Not duplicated')
recordsfL$dupfL <- ifelse(duplicated(recordsfL$dedupe.ids), 'Duplicated', 'Not duplicated')
confusion <- table('fastLink' = recordsfL$dupfL, 'True' = recordsfL$dupTrue)

irelink:

acc <- il_cluster_confusion_matrix(model, labels_col = 'cluster', threshold = 0.90)

For a full accuracy or precision-recall curve across all thresholds, use il_accuracy() and il_precision_recall().