Skip to contents

For a given blocking rule, returns the n blocking-key combinations that produce the most record pairs. This helps diagnose skew, where a single dominant key can create a quadratic explosion of pairs.

Usage

il_largest_blocks(
  .data,
  rule,
  n = 5L,
  con = NULL,
  link_type = c("dedupe", "link")
)

Arguments

.data

A data frame, dbplyr::tbl_lazy, or character table name (first or only dataset).

rule

A blocking rule created by block_on().

n

Integer. Number of largest bins to return. Defaults to 5.

con

A DBI connection object from DBI::dbConnect(). Optional when .data is a dbplyr::tbl_lazy.

One of "dedupe" (default) or "link".

Value

A tibble::tibble() with one row per blocking-key combination, sorted by descending pair count. Columns are the blocking-key values plus n_records and n_pairs.

Examples

df <- data.frame(
  unique_id = 1:20,
  first_name = c(
    'John', 'Jon', 'Jane', 'Jane', 'Bob',
    'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas',
    'John', 'Jon', 'Jane', 'Janet', 'Bob',
    'Robert', 'Alice', 'Alison', 'Tom', 'Tomas'
  ),
  surname = c(
    'Smith', 'Smith', 'Doe', 'Doe', 'Jones',
    'Jones', 'Brown', 'Brown', 'White', 'White',
    'Smith', 'Smyth', 'Doe', 'Doe', 'Jones',
    'Jones', 'Brown', 'Browne', 'White', 'White'
  ),
  city = c(
    'London', 'London', 'Paris', 'Paris', 'Berlin',
    'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid',
    'London', 'London', 'Paris', 'Paris', 'Berlin',
    'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid'
  )
)
con <- DBI::dbConnect(duckdb::duckdb())
il_largest_blocks(df, block_on(city), n = 3, con = con)
#> # A tibble: 3 × 3
#>   city   n_records n_pairs
#>   <chr>      <dbl>   <dbl>
#> 1 Paris          4       6
#> 2 Rome           4       6
#> 3 Berlin         4       6
DBI::dbDisconnect(con, shutdown = TRUE)