Skip to contents

Initial development release, translating Python’s splink probabilistic record linkage engine into idiomatic R.

Core pipeline

  • il_spec(), il_compare(), and il_block_on() define the linkage model declaratively: which fields to compare, how to compare them, and which blocking rules to apply.
  • il_model() binds a spec to one or two datasets and a DBI connection for dedupe, link, or link_and_dedupe, and accepts in-memory data frames, dbplyr::tbl_lazy references, or existing table-name strings.
  • predict() scores all candidate pairs above a match-probability threshold (or an evidence-only match-weight threshold via threshold_match_weight).
  • il_cluster() resolves scored pairs into entity clusters via connected components (using igraph) or single-best-link (with source_dataset for cross-source filtering).

Comparison library

Column transforms

Blocking

  • il_block_on() and block_on() for equality-based and custom SQL blocking rules, with per-column transform support via formula syntax (col ~ transform, e.g. first_name ~ il_substr(1, 3)) or a named-list .transform for programmatic construction.
  • .explode parameter for array-valued blocking columns (generates UNNEST subqueries for DuckDB/PostgreSQL).
  • il_count_pairs() estimates candidate-pair counts, including cumulative totals and percent-of-cartesian summaries across rule combinations.
  • il_suggest_blocking() ranks candidate blocking rules by pair-reduction, coverage, and balanced score.
  • il_find_blocking_below() finds blocking rule combinations below a pair count ceiling.
  • block_from_labels() measures per-column recall from labeled pairs.
  • il_largest_blocks() identifies the blocking keys that generate the most records and pairs, respecting blocking transforms.

Training

  • il_estimate_u() estimates non-match probabilities by sampling random pairs, with optional chunked estimation through chunk_size and early stopping through min_count_per_level.
  • il_estimate_em() runs the Fellegi-Sunter EM algorithm with configurable max_iterations, convergence, fix_u, fix_m, fix_prior, derive_prior, estimate_without_tf, and estimator_mode parameters.
  • estimator_mode = "dependency-aware" fits log-linear matched and unmatched comparison-pattern distributions over aggregated gamma counts, preserving missing comparison states as explicit pattern levels.
  • il_estimate_prior() sets the prior match probability from deterministic matching rules, counting unique blocked pairs across overlapping rules.
  • il_prior_prevalence() and il_prior_m() add regularizing custom priors for EM, il_constrain_m() adds explicit fixed matched-class constraints, and il_priors() / il_constraints() expose the stored metadata.
  • il_estimate_m_from_labels() and il_estimate_m_from_column() initialize parameters from ground-truth labels.

Prediction

  • predict() supports both threshold (match probability) and threshold_match_weight (evidence-only log2 Bayes factor) filtering.
  • Prediction output includes evidence-only match_weight, prior-inclusive total_match_weight, and posterior match_probability.
  • predict(type = "weights") returns match weights on the log2 Bayes-factor scale, and greedy = TRUE adds deterministic one-to-one post-processing for link models.
  • include_fields = TRUE joins all source columns into the scored output.
  • collect = FALSE returns an il_compared_lazy object backed by a model-scoped in-database table.
  • il_score_missing_edges() enumerates and scores unscored within-cluster pairs.
  • il_score_patterns() scores compatible comparison-pattern tables, including dependency-aware pattern tables larger than the table used for fitting.
  • il_deterministic_link() performs single-table exact-match deduplication without training.
  • il_find_matches() scores a set of probe records against existing data.
  • profile_sql = TRUE on predict() attaches lightweight SQL timing metadata to collected predictions or lazy prediction objects.

Diagnostics and evaluation

Data exploration

Visualization

  • autoplot() methods for il_model, il_compared, il_training_history, il_accuracy, il_roc, il_precision_recall, il_unlinkables, il_completeness, il_count_pairs, il_profile, il_string_similarity, il_comparator_score, and il_comparison_vectors.
  • All chart types are composable with standard ggplot2 layers.

Datasets

  • fake_1000: 1,000 records (250 entities) for deduplication.
  • fake_1000_labels: 3,176 pairwise labels for evaluation.
  • fake_20: minimal 20-record example.
  • febrl4a / febrl4b: 5,000-record cross-table linkage benchmark from FEBRL.

SQL backends and persistence

  • All computation runs inside a DBI-compatible database: DuckDB (recommended), SQLite, or PostgreSQL.
  • Database-backed workflows support zero-copy registration from dbplyr::tbl_lazy references and existing table names, in addition to in-memory data frames.
  • il_save() and il_load() support both RDS files and Splink settings JSON.
  • il_attach() reattaches a saved model to different data or connections.
  • il_cleanup() removes temporary tables owned by a single model, making it safe for shared DBI connections with multiple live models.
  • il_cleanup_all() removes all package-owned temporary tables from a connection for exploratory sessions and failed runs.

Performance

  • Gamma computation is pushed into DuckDB using native C++ string similarity functions.
  • SQLite is retained as a fallback with R-side gamma computation via stringdist.
  • DuckDB and PostgreSQL use SQL-native connected components, with an igraph fallback for SQLite.
  • Term-frequency, lazy prediction, and scratch tables use generated model-scoped names to avoid collisions on shared connections.
  • profile_sql = TRUE on il_estimate_u(), il_estimate_prior(), and predict() records lightweight SQL timing metadata for performance investigation.
  • End-to-end benchmarks against an R-side SQLite baseline: 1,000 records in 1.4 s (2.1× faster), 5,000 records in 19.5 s (1.6×), 10,000 records in 61.4 s (2.6×). Speedup grows with dataset size.