irelink 0.0.1
Initial development release, translating Python’s splink probabilistic record linkage engine into idiomatic R.
Core pipeline
-
il_spec(),il_compare(), andil_block_on()define the linkage model declaratively: which fields to compare, how to compare them, and which blocking rules to apply. -
il_model()binds a spec to one or two datasets and a DBI connection fordedupe,link, orlink_and_dedupe, and accepts in-memory data frames,dbplyr::tbl_lazyreferences, or existing table-name strings. -
predict()scores all candidate pairs above a match-probability threshold (or an evidence-only match-weight threshold viathreshold_match_weight). -
il_cluster()resolves scored pairs into entity clusters via connected components (using igraph) or single-best-link (withsource_datasetfor cross-source filtering).
Comparison library
- String similarity:
cl_exact(),cl_levenshtein(),cl_damerau_levenshtein(),cl_jaro(),cl_jaro_winkler(),cl_jaccard(),cl_cosine(). - Numeric and distance:
cl_numeric_diff(),cl_pct_diff(),cl_geo_distance(). - Temporal:
cl_date_diff()for date proximity withdays(),months(), andyears()helpers, andcl_time_diff()for sub-day precision withseconds(),minutes(), andhours()helpers. - Collections:
cl_array_intersect(),cl_array_subset(),cl_array_min_distance(). - Domain-specific:
cl_name(),cl_first_last_name()/cl_forename_surname()(both accept a companion column for the surname and handle first/last swap detection),cl_dob(),cl_email(),cl_domain(),cl_soundex(),cl_zip_code()(exact, 5-digit ZIP+4, and 3-digit Sectional Center Facility prefix levels),cl_postcode(). - Composition:
cl_levels(),cl_and(),cl_or(),cl_not(),cl_null(),cl_else(),cl_literal(),cl_custom(),cl_columns_reversed(). - All applicable comparators accept
term_frequency = TRUEfor Fellegi-Sunter term-frequency adjustments.
Column transforms
-
il_transform()composes multiple R functions into a chainable transform with SQL-side nesting (e.g.TRIM(LOWER(col))). - Column transform factories for SQL-side column expressions:
il_substr(),il_regex_extract(),il_nullif(),il_cast_to_string(),il_try_parse_date(),il_array_element(). - Built-in transforms auto-translated to SQL:
tolower,toupper,trimws. - Phonetic transforms:
il_soundex(),il_metaphone(),il_dmetaphone()(usable as R functions and SQL macros).
Blocking
-
il_block_on()andblock_on()for equality-based and custom SQL blocking rules, with per-column transform support via formula syntax (col ~ transform, e.g.first_name ~ il_substr(1, 3)) or a named-list.transformfor programmatic construction. -
.explodeparameter for array-valued blocking columns (generatesUNNESTsubqueries for DuckDB/PostgreSQL). -
il_count_pairs()estimates candidate-pair counts, including cumulative totals and percent-of-cartesian summaries across rule combinations. -
il_suggest_blocking()ranks candidate blocking rules by pair-reduction, coverage, and balanced score. -
il_find_blocking_below()finds blocking rule combinations below a pair count ceiling. -
block_from_labels()measures per-column recall from labeled pairs. -
il_largest_blocks()identifies the blocking keys that generate the most records and pairs, respecting blocking transforms.
Training
-
il_estimate_u()estimates non-match probabilities by sampling random pairs, with optional chunked estimation throughchunk_sizeand early stopping throughmin_count_per_level. -
il_estimate_em()runs the Fellegi-Sunter EM algorithm with configurablemax_iterations,convergence,fix_u,fix_m,fix_prior,derive_prior,estimate_without_tf, andestimator_modeparameters. -
estimator_mode = "dependency-aware"fits log-linear matched and unmatched comparison-pattern distributions over aggregated gamma counts, preserving missing comparison states as explicit pattern levels. -
il_estimate_prior()sets the prior match probability from deterministic matching rules, counting unique blocked pairs across overlapping rules. -
il_prior_prevalence()andil_prior_m()add regularizing custom priors for EM,il_constrain_m()adds explicit fixed matched-class constraints, andil_priors()/il_constraints()expose the stored metadata. -
il_estimate_m_from_labels()andil_estimate_m_from_column()initialize parameters from ground-truth labels.
Prediction
-
predict()supports boththreshold(match probability) andthreshold_match_weight(evidence-only log2 Bayes factor) filtering. - Prediction output includes evidence-only
match_weight, prior-inclusivetotal_match_weight, and posteriormatch_probability. -
predict(type = "weights")returns match weights on the log2 Bayes-factor scale, andgreedy = TRUEadds deterministic one-to-one post-processing forlinkmodels. -
include_fields = TRUEjoins all source columns into the scored output. -
collect = FALSEreturns anil_compared_lazyobject backed by a model-scoped in-database table. -
il_score_missing_edges()enumerates and scores unscored within-cluster pairs. -
il_score_patterns()scores compatible comparison-pattern tables, including dependency-aware pattern tables larger than the table used for fitting. -
il_deterministic_link()performs single-table exact-match deduplication without training. -
il_find_matches()scores a set of probe records against existing data. -
profile_sql = TRUEonpredict()attaches lightweight SQL timing metadata to collected predictions or lazy prediction objects.
Diagnostics and evaluation
-
il_parameters()andil_weights()expose the learned m/u parameters. -
il_waterfall()decomposes a pair’s match weight into per-comparison contributions. -
il_training_history()tracks parameter convergence across EM iterations. -
il_completeness()andil_profile()summarize data quality, andil_profile()accepts raw SQL expressions as column definitions (e.g.,"city || left(first_name, 1)"). -
il_unlinkables()identifies records that cannot be linked under any blocking rule. -
il_accuracy(),il_precision_recall(), andil_roc()evaluate performance against labeled data. -
il_errors()surfaces false positives and false negatives. -
il_graph_metrics()computes node degree, node centrality, cluster density, cluster centralization, and bridge detection. -
il_comparison_vectors()returns the gamma pattern distribution from a trained model.
Data exploration
-
il_compare_records()scores one explicit record pair against a spec without fitting a full model, andil_string_similarity()computes 5 string similarity metrics for a single pair. -
il_comparator_score()computes batch string similarity across a DataFrame with SQL-side scoring on DuckDB/PostgreSQL. -
il_comparator_threshold_chart()visualizes match rates at multiple similarity thresholds. -
il_phonetic_chart()produces a Soundex agreement heatmap. -
il_tf_chart()visualizes model-specific term frequency distributions with labeled most/least common values. -
il_register_tf()registers pre-computed term frequency tables in the database and returns the updated model.
Visualization
-
autoplot()methods foril_model,il_compared,il_training_history,il_accuracy,il_roc,il_precision_recall,il_unlinkables,il_completeness,il_count_pairs,il_profile,il_string_similarity,il_comparator_score, andil_comparison_vectors. - All chart types are composable with standard ggplot2 layers.
Datasets
-
fake_1000: 1,000 records (250 entities) for deduplication. -
fake_1000_labels: 3,176 pairwise labels for evaluation. -
fake_20: minimal 20-record example. -
febrl4a/febrl4b: 5,000-record cross-table linkage benchmark from FEBRL.
SQL backends and persistence
- All computation runs inside a DBI-compatible database: DuckDB (recommended), SQLite, or PostgreSQL.
- Database-backed workflows support zero-copy registration from
dbplyr::tbl_lazyreferences and existing table names, in addition to in-memory data frames. -
il_save()andil_load()support both RDS files and Splink settings JSON. -
il_attach()reattaches a saved model to different data or connections. -
il_cleanup()removes temporary tables owned by a single model, making it safe for shared DBI connections with multiple live models. -
il_cleanup_all()removes all package-owned temporary tables from a connection for exploratory sessions and failed runs.
Performance
- Gamma computation is pushed into DuckDB using native C++ string similarity functions.
- SQLite is retained as a fallback with R-side gamma computation via
stringdist. - DuckDB and PostgreSQL use SQL-native connected components, with an igraph fallback for SQLite.
- Term-frequency, lazy prediction, and scratch tables use generated model-scoped names to avoid collisions on shared connections.
-
profile_sql = TRUEonil_estimate_u(),il_estimate_prior(), andpredict()records lightweight SQL timing metadata for performance investigation. - End-to-end benchmarks against an R-side SQLite baseline: 1,000 records in 1.4 s (2.1× faster), 5,000 records in 19.5 s (1.6×), 10,000 records in 61.4 s (2.6×). Speedup grows with dataset size.
