Cluster-Level Confusion Matrix for Deduplication — il_cluster_confusion

Computes a record-level confusion matrix after clustering predicted matches into entities. A record is treated as "duplicated" if it is not the first record in its predicted cluster, and likewise for the ground-truth labels_col.

Usage

il_cluster_confusion_matrix(
  model,
  labels_col,
  threshold = 0.85,
  method = c("connected", "best_link"),
  ties_method = c("lowest_id", "drop"),
  source_dataset = NULL
)

Arguments

model: A trained il_model object for a deduplication task.
labels_col: String naming the ground-truth cluster/entity column in the model's source data.
threshold: Match-probability threshold passed to predict(). Defaults to 0.85.
method: Clustering method passed to il_cluster().
ties_method: Tie handling for method = "best_link", passed to il_cluster().
source_dataset: Optional source-dataset mapping passed to il_cluster(). If supplied, it must cover every unique_id in the predicted pairs, and duplicate unique_id mappings are not allowed.

Value

A one-row tibble with columns threshold, tp, fp, fn, tn, precision, recall, and f1.

Details

For DuckDB and PostgreSQL backends, pair scoring and clustering are pushed into SQL where possible. The final summary still returns a one-row tibble in R.

Examples

df <- data.frame(
  unique_id = 1:5,
  first_name = c('John', 'John', 'Mary', 'Bob', 'Bob'),
  surname = c('Smith', 'Smith', 'Jones', 'Brown', 'Brown'),
  cluster = c(1, 1, 2, 3, 4)
)
con <- DBI::dbConnect(duckdb::duckdb())
spec <- il_spec() |>
  il_compare(first_name, cl_exact()) |>
  il_compare(surname, cl_exact()) |>
  il_block_on(surname)
model <- il_model(df, spec = spec, con = con)
model <- il_estimate_u(model)
model <- il_estimate_em(model, block_on(surname))
#> EM trained: first_name | skipped (blocked on): surname

il_cluster_confusion_matrix(model, labels_col = 'cluster', threshold = 0.85)
#> # A tibble: 1 × 9
#>   threshold    tp    fp    fn    tn fn_blocking_miss precision recall    f1
#>       <dbl> <int> <int> <int> <int>            <int>     <dbl>  <dbl> <dbl>
#> 1      0.85     1     1     0     3               NA       0.5      1 0.667
DBI::dbDisconnect(con, shutdown = TRUE)