For each column, computes the fraction of true-match pairs that share the same value (recall). Helps identify which columns make effective blocking keys.
Arguments
- .data
A data frame or character table name.
- labels
A data frame with
unique_id_l,unique_id_r, andis_match.- columns
Character vector of column names to evaluate.
NULLfor all non-ID columns.- con
A DBI connection from
DBI::dbConnect().
Value
A tibble::tibble() with columns column, recall (fraction of true
matches caught), and n_matches_caught.
Examples
con <- DBI::dbConnect(duckdb::duckdb())
labels <- data.frame(
unique_id_l = fake_1000_labels$unique_id_l,
unique_id_r = fake_1000_labels$unique_id_r,
is_match = as.integer(fake_1000_labels$clerical_match_score >= 0.5)
)
block_from_labels(fake_1000, labels, con = con)
#> # A tibble: 6 × 3
#> column recall n_matches_caught
#> <chr> <dbl> <int>
#> 1 cluster 0.673 1367
#> 2 dob 0.403 819
#> 3 city 0.390 792
#> 4 email 0.359 730
#> 5 surname 0.258 525
#> 6 first_name 0.242 492
DBI::dbDisconnect(con, shutdown = TRUE)
