
Cluster-Level Confusion Matrix for Deduplication
Source:R/il_cluster_confusion_matrix.R
il_cluster_confusion_matrix.RdComputes a record-level confusion matrix after clustering predicted
matches into entities. A record is treated as "duplicated" if it is not
the first record in its predicted cluster, and likewise for the
ground-truth labels_col.
Arguments
- model
A trained
il_modelobject for a deduplication task.- labels_col
String naming the ground-truth cluster/entity column in the model's source data.
- threshold
Match-probability threshold passed to
predict(). Defaults to0.85.- method
Clustering method passed to
il_cluster().- ties_method
Tie handling for
method = "best_link", passed toil_cluster().- source_dataset
Optional source-dataset mapping passed to
il_cluster(). If supplied, it must cover everyunique_idin the predicted pairs, and duplicateunique_idmappings are not allowed.
Details
For DuckDB and PostgreSQL backends, pair scoring and clustering are pushed into SQL where possible. The final summary still returns a one-row tibble in R.
Examples
df <- data.frame(
unique_id = 1:5,
first_name = c('John', 'John', 'Mary', 'Bob', 'Bob'),
surname = c('Smith', 'Smith', 'Jones', 'Brown', 'Brown'),
cluster = c(1, 1, 2, 3, 4)
)
con <- DBI::dbConnect(duckdb::duckdb())
spec <- il_spec() |>
il_compare(first_name, cl_exact()) |>
il_compare(surname, cl_exact()) |>
il_block_on(surname)
model <- il_model(df, spec = spec, con = con)
model <- il_estimate_u(model)
model <- il_estimate_em(model, block_on(surname))
#> EM trained: first_name | skipped (blocked on): surname
il_cluster_confusion_matrix(model, labels_col = 'cluster', threshold = 0.85)
#> # A tibble: 1 × 9
#> threshold tp fp fn tn fn_blocking_miss precision recall f1
#> <dbl> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 0.85 1 1 0 3 NA 0.5 1 0.667
DBI::dbDisconnect(con, shutdown = TRUE)