This vignette reproduces the Splink
“Deduplicate 50k synthetic” demo in irelink. The data
is based on historical people scraped from Wikidata and includes
duplicate records with realistic errors such as typos, missing values,
and swapped fields. The cluster column provides the
ground-truth entity labels used in evaluation.
This vignette requires nanoparquet to read the remote Parquet file and only compiles when the package and the data URL are both available.
Profile the data
Use completeness and value distributions to choose blocking rules and comparisons:
df |>
il_completeness(con = con) |>
autoplot()
il_profile(df, first_name, surname, dob, birth_place, con = con, top_n = 8)Choose blocking rules
il_suggest_blocking(df, con = con)The cumulative_pairs column shows the total number of
unique pairs produced so far:
il_count_pairs(
df,
block_on(surname, dob),
block_on(first_name, dob),
block_on(first_name, surname),
block_on(dob, birth_place),
con = con
)Define the specification
Apply term-frequency adjustment to birth_place and
occupation so common values such as “London” receive less
weight than rare ones:
spec <- il_spec() |>
il_compare(first_name, cl_name()) |>
il_compare(surname, cl_name()) |>
il_compare(dob, cl_dob()) |>
il_compare(postcode_fake, cl_postcode()) |>
il_compare(birth_place, cl_exact(term_frequency = TRUE)) |>
il_compare(occupation, cl_exact(term_frequency = TRUE)) |>
il_block_on(first_name ~ il_substr(1, 3), surname ~ il_substr(1, 4)) |>
il_block_on(surname, dob) |>
il_block_on(first_name, dob) |>
il_block_on(postcode_fake, first_name) |>
il_block_on(postcode_fake, surname) |>
il_block_on(dob, birth_place) |>
il_block_on(postcode_fake ~ il_substr(1, 3), dob) |>
il_block_on(postcode_fake ~ il_substr(1, 3), first_name) |>
il_block_on(postcode_fake ~ il_substr(1, 3), surname) |>
il_block_on(
first_name ~ il_substr(1, 2),
surname ~ il_substr(1, 2),
dob ~ il_substr(1, 4)
)
specTrain the model
model <- df |>
il_model(spec = spec, con = con) |>
il_estimate_prior(
block_on(first_name, surname, dob),
block_on(dob, postcode_fake),
recall = 0.6
) |>
il_estimate_u(max_pairs = 5e6) |>
il_estimate_em(block_on(first_name, surname)) |>
il_estimate_em(block_on(dob))Inspect the trained model
summary(model)
autoplot(model)
autoplot(model, type = 'parameters')
autoplot(il_unlinkables(model))Predict
predictions <- predict(model, threshold = 0.5)
predictions
autoplot(predictions)
autoplot(predictions, which = 1)Cluster
clusters <- il_cluster(predictions, threshold = 0.95)
clustersEvaluate against ground truth
acc <- il_accuracy(model, labels_col = 'cluster')
accWhen you use labels_col, the evaluation derives all true
duplicate pairs from the ground-truth cluster column. Some true pairs
may never be generated by the blocking rules. Those pairs count as false
negatives at every threshold. As a result, the maximum recall in the
accuracy, ROC, and precision-recall plots is the blocking recall:
acc0 <- acc[acc$threshold == min(acc$threshold), ]
acc0$tp / (acc0$tp + acc0$fn)
autoplot(acc)
autoplot(il_precision_recall(model, labels_col = 'cluster'))Error inspection
errors <- il_errors(model, labels_col = 'cluster', threshold = 0.999)
errors[errors$error_type == 'false_positive', ]Some false negatives occur because the true pair was never generated by any blocking rule:
errors <- il_errors(model, labels_col = 'cluster', threshold = 0.5)
errors[errors$error_type == 'false_negative', ]Cleanup
il_cleanup(model)
DBI::dbDisconnect(con, shutdown = TRUE)il_cleanup(model) is model-scoped. If an interactive run
failed before you kept the model object, call
il_cleanup_all(con) to remove all irelink
tables from the connection before disconnecting.
