Skip to contents

Generates and scores all candidate record pairs that pass the blocking rules, returning those above the match-probability threshold. This is an S3 method for stats::predict().

Usage

# S3 method for class 'il_model'
predict(
  object,
  threshold = 0.85,
  threshold_match_weight = NULL,
  type = c("pairs", "weights"),
  collect = TRUE,
  include_fields = FALSE,
  greedy = FALSE,
  profile_sql = FALSE,
  ...
)

Arguments

object

A trained il_model object.

threshold

A numeric value between 0 and 1. Only pairs with a match probability at or above this threshold are returned. Defaults to 0.85. Ignored when threshold_match_weight is set.

threshold_match_weight

Optional numeric value. When set, pairs are filtered on evidence-only match weight (log2 Bayes factor) instead of probability. Typical values range from about -5 to +30. Overrides threshold.

type

One of "pairs" (default) to return scored pairs, or "weights" to return match weights on a log-2 Bayes-factor scale.

collect

If TRUE (the default), scored pairs are collected into an in-memory tibble. If FALSE, scoring is performed entirely in-database and the result is a lightweight il_compared_lazy reference that il_cluster() can consume directly, avoiding the round-trip of collecting millions of rows into R and re-uploading them. Requires a DuckDB or PostgreSQL backend.

include_fields

If TRUE, the original column values from both records in each pair are included in the output (suffixed _l and _r). Defaults to FALSE for performance. When collect = FALSE the join is performed in-database before the table is created.

greedy

If TRUE, keep a deterministic one-to-one greedy matching for link models. Defaults to FALSE, returning all above-threshold candidate pairs. Greedy matching sorts pairs by descending posterior match probability, then by left and right row order.

profile_sql

Logical. If TRUE, attach lightweight SQL timing metadata to collected predictions or include it on lazy predictions.

...

Additional arguments passed to the generic.

Value

When collect = TRUE: an il_compared tibble with one row per candidate pair, including columns for record IDs, match weight, total match weight, match probability, and per-comparison gamma values. match_weight is the evidence-only log2 Bayes factor. The additive prior term is exposed separately through total_match_weight, whose value is match_weight + log2(prior / (1 - prior)). When collect = FALSE: an il_compared_lazy object referencing the scored pairs table in the database.

Examples

df <- data.frame(
  unique_id = 1:20,
  first_name = c(
    'John', 'Jon', 'Jane', 'Jane', 'Bob',
    'Bobby', 'Alice', 'Alicia', 'Tom', 'Thomas',
    'John', 'Jon', 'Jane', 'Janet', 'Bob',
    'Robert', 'Alice', 'Alison', 'Tom', 'Tomas'
  ),
  surname = c(
    'Smith', 'Smith', 'Doe', 'Doe', 'Jones',
    'Jones', 'Brown', 'Brown', 'White', 'White',
    'Smith', 'Smyth', 'Doe', 'Doe', 'Jones',
    'Jones', 'Brown', 'Browne', 'White', 'White'
  ),
  dob = c(
    '1990-01-01', '1990-01-01', '1985-06-15', '1985-06-15',
    '2000-12-01', '2000-12-01', '1975-03-22', '1975-03-22',
    '1988-07-04', '1988-07-04', '1990-01-01', '1990-01-02',
    '1985-06-15', '1985-06-16', '2000-12-01', '2000-12-02',
    '1975-03-22', '1975-03-23', '1988-07-04', '1988-07-05'
  ),
  city = c(
    'London', 'London', 'Paris', 'Paris', 'Berlin',
    'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid',
    'London', 'London', 'Paris', 'Paris', 'Berlin',
    'Berlin', 'Rome', 'Rome', 'Madrid', 'Madrid'
  ),
  email = c(
    'john@example.com', 'jon@example.com', 'jane@example.com',
    'jane@example.com', 'bob@example.com', 'bobby@example.com',
    'alice@example.com', 'alicia@example.com', 'tom@example.com',
    'thomas@example.com', 'john@example.com', 'jon@example.com',
    'jane@example.com', 'janet@example.com', 'bob@example.com',
    'robert@example.com', 'alice@example.com', 'alison@example.com',
    'tom@example.com', 'tomas@example.com'
  )
)
con <- DBI::dbConnect(duckdb::duckdb())
spec <- il_spec() |>
  il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |>
  il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |>
  il_compare(dob, cl_exact()) |>
  il_block_on(surname) |>
  il_block_on(first_name)
model <- il_model(df, spec = spec, con = con)
model <- il_estimate_u(model)
model <- il_estimate_em(model, block_on(surname))
#> EM trained: first_name and dob | skipped (blocked on): surname

pairs <- predict(model, threshold = 0.5)
DBI::dbDisconnect(con, shutdown = TRUE)