Changelog • irelink

irelink 0.0.1

Initial development release, translating Python’s splink probabilistic record linkage engine into idiomatic R.

Core pipeline

il_spec(), il_compare(), and il_block_on() define the linkage model declaratively: which fields to compare, how to compare them, and which blocking rules to apply.
il_model() binds a spec to one or two datasets and a DBI connection for dedupe, link, or link_and_dedupe, and accepts in-memory data frames, dbplyr::tbl_lazy references, or existing table-name strings.
predict() scores all candidate pairs above a match-probability threshold (or an evidence-only match-weight threshold via threshold_match_weight).
il_cluster() resolves scored pairs into entity clusters via connected components (using igraph) or single-best-link (with source_dataset for cross-source filtering).

Comparison library

String similarity: cl_exact(), cl_levenshtein(), cl_damerau_levenshtein(), cl_jaro(), cl_jaro_winkler(), cl_jaccard(), cl_cosine().
Numeric and distance: cl_numeric_diff(), cl_pct_diff(), cl_geo_distance().
Temporal: cl_date_diff() for date proximity with days(), months(), and years() helpers, and cl_time_diff() for sub-day precision with seconds(), minutes(), and hours() helpers.
Collections: cl_array_intersect(), cl_array_subset(), cl_array_min_distance().
Domain-specific: cl_name(), cl_first_last_name() / cl_forename_surname() (both accept a companion column for the surname and handle first/last swap detection), cl_dob(), cl_email(), cl_domain(), cl_soundex(), cl_zip_code() (exact, 5-digit ZIP+4, and 3-digit Sectional Center Facility prefix levels), cl_postcode().
Composition: cl_levels(), cl_and(), cl_or(), cl_not(), cl_null(), cl_else(), cl_literal(), cl_custom(), cl_columns_reversed().
All applicable comparators accept term_frequency = TRUE for Fellegi-Sunter term-frequency adjustments.

Column transforms

il_transform() composes multiple R functions into a chainable transform with SQL-side nesting (e.g. TRIM(LOWER(col))).
Column transform factories for SQL-side column expressions: il_substr(), il_regex_extract(), il_nullif(), il_cast_to_string(), il_try_parse_date(), il_array_element().
Built-in transforms auto-translated to SQL: tolower, toupper, trimws.
Phonetic transforms: il_soundex(), il_metaphone(), il_dmetaphone() (usable as R functions and SQL macros).

Blocking

il_block_on() and block_on() for equality-based and custom SQL blocking rules, with per-column transform support via formula syntax (col ~ transform, e.g. first_name ~ il_substr(1, 3)) or a named-list .transform for programmatic construction.
.explode parameter for array-valued blocking columns (generates UNNEST subqueries for DuckDB/PostgreSQL).
il_count_pairs() estimates candidate-pair counts, including cumulative totals and percent-of-cartesian summaries across rule combinations.
il_suggest_blocking() ranks candidate blocking rules by pair-reduction, coverage, and balanced score.
il_find_blocking_below() finds blocking rule combinations below a pair count ceiling.
block_from_labels() measures per-column recall from labeled pairs.
il_largest_blocks() identifies the blocking keys that generate the most records and pairs, respecting blocking transforms.

Training

il_estimate_u() estimates non-match probabilities by sampling random pairs, with optional chunked estimation through chunk_size and early stopping through min_count_per_level.
il_estimate_em() runs the Fellegi-Sunter EM algorithm with configurable max_iterations, convergence, fix_u, fix_m, fix_prior, derive_prior, estimate_without_tf, and estimator_mode parameters.
estimator_mode = "dependency-aware" fits log-linear matched and unmatched comparison-pattern distributions over aggregated gamma counts, preserving missing comparison states as explicit pattern levels.
il_estimate_prior() sets the prior match probability from deterministic matching rules, counting unique blocked pairs across overlapping rules.
il_prior_prevalence() and il_prior_m() add regularizing custom priors for EM, il_constrain_m() adds explicit fixed matched-class constraints, and il_priors() / il_constraints() expose the stored metadata.
il_estimate_m_from_labels() and il_estimate_m_from_column() initialize parameters from ground-truth labels.

Prediction

predict() supports both threshold (match probability) and threshold_match_weight (evidence-only log2 Bayes factor) filtering.
Prediction output includes evidence-only match_weight, prior-inclusive total_match_weight, and posterior match_probability.
predict(type = "weights") returns match weights on the log2 Bayes-factor scale, and greedy = TRUE adds deterministic one-to-one post-processing for link models.
include_fields = TRUE joins all source columns into the scored output.
collect = FALSE returns an il_compared_lazy object backed by a model-scoped in-database table.
il_score_missing_edges() enumerates and scores unscored within-cluster pairs.
il_score_patterns() scores compatible comparison-pattern tables, including dependency-aware pattern tables larger than the table used for fitting.
il_deterministic_link() performs single-table exact-match deduplication without training.
il_find_matches() scores a set of probe records against existing data.
profile_sql = TRUE on predict() attaches lightweight SQL timing metadata to collected predictions or lazy prediction objects.

Diagnostics and evaluation

il_parameters() and il_weights() expose the learned m/u parameters.
il_waterfall() decomposes a pair’s match weight into per-comparison contributions.
il_training_history() tracks parameter convergence across EM iterations.
il_completeness() and il_profile() summarize data quality, and il_profile() accepts raw SQL expressions as column definitions (e.g., "city || left(first_name, 1)").
il_unlinkables() identifies records that cannot be linked under any blocking rule.
il_accuracy(), il_precision_recall(), and il_roc() evaluate performance against labeled data.
il_errors() surfaces false positives and false negatives.
il_graph_metrics() computes node degree, node centrality, cluster density, cluster centralization, and bridge detection.
il_comparison_vectors() returns the gamma pattern distribution from a trained model.

Data exploration

il_compare_records() scores one explicit record pair against a spec without fitting a full model, and il_string_similarity() computes 5 string similarity metrics for a single pair.
il_comparator_score() computes batch string similarity across a DataFrame with SQL-side scoring on DuckDB/PostgreSQL.
il_comparator_threshold_chart() visualizes match rates at multiple similarity thresholds.
il_phonetic_chart() produces a Soundex agreement heatmap.
il_tf_chart() visualizes model-specific term frequency distributions with labeled most/least common values.
il_register_tf() registers pre-computed term frequency tables in the database and returns the updated model.

Visualization

autoplot() methods for il_model, il_compared, il_training_history, il_accuracy, il_roc, il_precision_recall, il_unlinkables, il_completeness, il_count_pairs, il_profile, il_string_similarity, il_comparator_score, and il_comparison_vectors.
All chart types are composable with standard ggplot2 layers.

Datasets

fake_1000: 1,000 records (250 entities) for deduplication.
fake_1000_labels: 3,176 pairwise labels for evaluation.
fake_20: minimal 20-record example.
febrl4a / febrl4b: 5,000-record cross-table linkage benchmark from FEBRL.

SQL backends and persistence

All computation runs inside a DBI-compatible database: DuckDB (recommended), SQLite, or PostgreSQL.
Database-backed workflows support zero-copy registration from dbplyr::tbl_lazy references and existing table names, in addition to in-memory data frames.
il_save() and il_load() support both RDS files and Splink settings JSON.
il_attach() reattaches a saved model to different data or connections.
il_cleanup() removes temporary tables owned by a single model, making it safe for shared DBI connections with multiple live models.
il_cleanup_all() removes all package-owned temporary tables from a connection for exploratory sessions and failed runs.

Performance

Gamma computation is pushed into DuckDB using native C++ string similarity functions.
SQLite is retained as a fallback with R-side gamma computation via stringdist.
DuckDB and PostgreSQL use SQL-native connected components, with an igraph fallback for SQLite.
Term-frequency, lazy prediction, and scratch tables use generated model-scoped names to avoid collisions on shared connections.
profile_sql = TRUE on il_estimate_u(), il_estimate_prior(), and predict() records lightweight SQL timing metadata for performance investigation.
End-to-end benchmarks against an R-side SQLite baseline: 1,000 records in 1.4 s (2.1× faster), 5,000 records in 19.5 s (1.6×), 10,000 records in 61.4 s (2.6×). Speedup grows with dataset size.