Identifies pairs of records within the same cluster that were not already scored during prediction (e.g. because they were in different blocking groups), and scores them using the model. This can reveal low-confidence links that bridge otherwise separate sub-clusters.
Arguments
- model
A trained
il_modelobject.- pairs
An
il_comparedtibble frompredict.il_model().- clusters
A tibble from
il_cluster()with columnsunique_idandcluster_id.- threshold
Numeric match-probability threshold for returned pairs. Defaults to
0.
Examples
df <- data.frame(
unique_id = c(1, 2, 3),
first_name = c('John', 'John', 'Jon'),
surname = c('Smith', 'Smyth', 'Smith')
)
con <- DBI::dbConnect(duckdb::duckdb())
spec <- il_spec() |>
il_compare(first_name, cl_exact()) |>
il_compare(surname, cl_exact()) |>
il_block_on(surname)
model <- il_model(df, spec = spec, con = con)
model <- il_estimate_u(model)
model <- il_estimate_em(model, block_on(surname))
#> EM trained: first_name | skipped (blocked on): surname
pairs <- predict(model, threshold = 0.01)
clusters <- tibble::tibble(
unique_id = c('1', '2', '3'),
cluster_id = 'cluster_1'
)
missing <- il_score_missing_edges(model, pairs, clusters)
il_cleanup(model)
DBI::dbDisconnect(con, shutdown = TRUE)
