Deduplicating 50k Synthetic Records • irelink

This vignette reproduces the Splink “Deduplicate 50k synthetic” demo in irelink. The data is based on historical people scraped from Wikidata and includes duplicate records with realistic errors such as typos, missing values, and swapped fields. The cluster column provides the ground-truth entity labels used in evaluation.

This vignette requires nanoparquet to read the remote Parquet file and only compiles when the package and the data URL are both available.

Load the data

library(irelink)
library(ggplot2)

df

Profile the data

Use completeness and value distributions to choose blocking rules and comparisons:

con <- DBI::dbConnect(duckdb::duckdb())

df |>
  il_completeness(con = con) |>
  autoplot()

il_profile(df, first_name, surname, dob, birth_place, con = con, top_n = 8)

Choose blocking rules

il_suggest_blocking(df, con = con)

The cumulative_pairs column shows the total number of unique pairs produced so far:

il_count_pairs(
  df,
  block_on(surname, dob),
  block_on(first_name, dob),
  block_on(first_name, surname),
  block_on(dob, birth_place),
  con = con
)

Define the specification

Apply term-frequency adjustment to birth_place and occupation so common values such as “London” receive less weight than rare ones:

spec <- il_spec() |>
  il_compare(first_name, cl_name()) |>
  il_compare(surname, cl_name()) |>
  il_compare(dob, cl_dob()) |>
  il_compare(postcode_fake, cl_postcode()) |>
  il_compare(birth_place, cl_exact(term_frequency = TRUE)) |>
  il_compare(occupation, cl_exact(term_frequency = TRUE)) |>
  il_block_on(first_name ~ il_substr(1, 3), surname ~ il_substr(1, 4)) |>
  il_block_on(surname, dob) |>
  il_block_on(first_name, dob) |>
  il_block_on(postcode_fake, first_name) |>
  il_block_on(postcode_fake, surname) |>
  il_block_on(dob, birth_place) |>
  il_block_on(postcode_fake ~ il_substr(1, 3), dob) |>
  il_block_on(postcode_fake ~ il_substr(1, 3), first_name) |>
  il_block_on(postcode_fake ~ il_substr(1, 3), surname) |>
  il_block_on(
    first_name ~ il_substr(1, 2),
    surname ~ il_substr(1, 2),
    dob ~ il_substr(1, 4)
  )

spec

Train the model

model <- df |>
  il_model(spec = spec, con = con) |>
  il_estimate_prior(
    block_on(first_name, surname, dob),
    block_on(dob, postcode_fake),
    recall = 0.6
  ) |>
  il_estimate_u(max_pairs = 5e6) |>
  il_estimate_em(block_on(first_name, surname)) |>
  il_estimate_em(block_on(dob))

Inspect the trained model

summary(model)

autoplot(model)

autoplot(model, type = 'parameters')

autoplot(il_unlinkables(model))

Predict

predictions <- predict(model, threshold = 0.5)
predictions

autoplot(predictions)

autoplot(predictions, which = 1)

Cluster

clusters <- il_cluster(predictions, threshold = 0.95)
clusters

Evaluate against ground truth

acc <- il_accuracy(model, labels_col = 'cluster')
acc

When you use labels_col, the evaluation derives all true duplicate pairs from the ground-truth cluster column. Some true pairs may never be generated by the blocking rules. Those pairs count as false negatives at every threshold. As a result, the maximum recall in the accuracy, ROC, and precision-recall plots is the blocking recall:

acc0 <- acc[acc$threshold == min(acc$threshold), ]
acc0$tp / (acc0$tp + acc0$fn)

autoplot(acc)

autoplot(il_roc(model, labels_col = 'cluster'))

autoplot(il_precision_recall(model, labels_col = 'cluster'))

Error inspection

errors <- il_errors(model, labels_col = 'cluster', threshold = 0.999)
errors[errors$error_type == 'false_positive', ]

Some false negatives occur because the true pair was never generated by any blocking rule:

errors <- il_errors(model, labels_col = 'cluster', threshold = 0.5)
errors[errors$error_type == 'false_negative', ]

Cleanup

il_cleanup(model)
DBI::dbDisconnect(con, shutdown = TRUE)

il_cleanup(model) is model-scoped. If an interactive run failed before you kept the model object, call il_cleanup_all(con) to remove all irelink tables from the connection before disconnecting.