irelink translates the Python splink
library into idiomatic R. This vignette maps common Splink patterns to
irelink so you can get started quickly.
Design differences
Splink uses an object-oriented design centered on a
Linker class. irelink uses a functional
pipeline that fits naturally in R. The Linker object’s
namespaced methods such as linker.training.* and
linker.inference.* become standalone functions that accept
and return an il_model object.
Splink bundles comparison levels into high-level comparison classes
such as JaroWinklerAtThresholds. In irelink,
the cl_*() functions fill the same role and can be passed
directly to il_compare().
Core workflow
| Step | splink (Python) | irelink (R) |
|---|---|---|
| Load data | splink_datasets.fake_1000 |
fake_1000 |
| Choose backend | DuckDBAPI() |
DBI::dbConnect(duckdb::duckdb()) |
| Define settings | SettingsCreator(...) |
il_spec() |>il_compare(...) |>il_block_on(...)
|
| Create model | Linker(df, settings, db_api) |
il_model(df, spec = spec, con = con) |
| Estimate prior |
linker.training.estimate_probability_two_random_records_match(...)
|
il_estimate_prior(model, ...) |
| Estimate u |
linker.training.estimate_u_using_random_sampling(...)
|
il_estimate_u(model) |
| Estimate m (EM) |
linker.training.estimate_parameters_using_expectation_maximisation(...)
|
il_estimate_em(model, ...) |
| Estimate m (labels) |
linker.training.estimate_m_from_pairwise_labels(...)
|
il_estimate_m_from_labels(model, ...) |
| Predict | linker.inference.predict(...) |
predict(model, ...) |
| Cluster |
linker.clustering.cluster_pairwise_predictions_at_threshold(...)
|
il_cluster(pairs) |
| Deterministic link | linker.deterministic_link() |
il_deterministic_link(df, ...) |
| Find matches |
linker.inference.find_matches_to_new_records(...)
|
il_find_matches(model, new_records, ...) |
irelink also supports
link_type = "link_and_dedupe" for two-table jobs where
duplicates may exist within each input table and across the two
tables.
Comparison levels
Comparison levels are the building blocks used to score how similar
two records are on a field. Each cl_*() function
corresponds to a Splink comparison level class.
| splink (Python) | irelink (R) |
|---|---|
ExactMatchLevel |
cl_exact() |
LevenshteinLevel |
cl_levenshtein() |
DamerauLevenshteinLevel |
cl_damerau_levenshtein() |
JaroLevel |
cl_jaro() |
JaroWinklerLevel |
cl_jaro_winkler() |
JaccardLevel |
cl_jaccard() |
CosineSimilarityLevel |
cl_cosine() |
AbsoluteDifferenceLevel |
cl_numeric_diff() |
PercentageDifferenceLevel |
cl_pct_diff() |
AbsoluteTimeDifferenceAtThresholds |
cl_date_diff() |
DistanceInKMLevel |
cl_geo_distance() |
ArrayIntersectLevel |
cl_array_intersect() |
CustomLevel |
cl_custom() |
NullLevel |
cl_null() |
ElseLevel |
cl_else() |
And |
cl_and() |
Or |
cl_or() |
Not |
cl_not() |
Domain-specific comparisons
Splink provides high-level comparison classes for common field types.
In irelink, these are helper functions that return
preconfigured sets of levels.
| splink (Python) | irelink (R) |
|---|---|
NameComparison |
cl_name() |
ForenameSurnameComparison |
cl_forename_surname() |
DateOfBirthComparison |
cl_dob() |
EmailComparison |
cl_email() |
PostcodeComparison |
cl_postcode() |
Model inspection
| splink (Python) | irelink (R) |
|---|---|
linker.visualisations.match_weights_chart() |
il_weights(model) |
linker.visualisations.parameter_estimate_comparisons_chart()
|
il_parameters(model) |
linker.visualisations.waterfall_chart(...) |
il_waterfall(pairs, ...) |
linker.misc.query_comparison_details(...) |
il_compare_records(record_a, record_b, ...) |
linker.training.prediction_errors_from_labels_column(...)
|
il_errors(model, ...) |
linker.evaluation.unlinkables_chart() |
il_unlinkables(model) |
Evaluation
| splink (Python) | irelink (R) |
|---|---|
linker.evaluation.accuracy_chart_from_labels_column(...)
|
il_accuracy(model, ...) |
linker.evaluation.precision_recall_chart_from_labels_column(...)
|
il_precision_recall(model, ...) |
linker.evaluation.roc_chart_from_labels_column(...)
|
il_roc(model, ...) |
Data profiling
| splink (Python) | irelink (R) |
|---|---|
linker.profile_columns(...) |
il_profile(df, ...) |
linker.count_num_comparisons_from_blocking_rule(...) |
il_count_pairs(df, ...) |
| completeness profiling | il_completeness(df, ...) |
Persistence
| splink (Python) | irelink (R) |
|---|---|
linker.misc.save_model_to_json(...) |
il_save(model, path) |
load_model_from_json(...) |
il_load(path) |
delete_tables_created_by_splink_from_db(...) |
il_cleanup_all(con) |
| model-scoped cleanup | il_cleanup(model) |
Blocking rules
In Splink, you create blocking rules with block_on(),
and irelink uses the same function name. The main
difference is where the rules are used: Splink passes them into
SettingsCreator, while irelink adds them to a
spec with il_block_on() or passes them directly to training
functions.
# blocking in the spec
spec <- il_spec() |>
il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |>
il_block_on(surname)
# blocking in EM training
model <- il_estimate_em(model, block_on(surname))Example: side-by-side deduplication
Below is a minimal deduplication example in both Splink and
irelink.
splink (Python):
from splink import Linker, SettingsCreator, DuckDBAPI, block_on, splink_datasets
import splink.comparison_library as cl
df = splink_datasets.fake_1000
db_api = DuckDBAPI()
settings = SettingsCreator(
link_type="dedupe_only",
comparisons=[
cl.JaroWinklerAtThresholds("first_name", [0.9, 0.7]),
cl.JaroWinklerAtThresholds("surname", [0.9, 0.7]),
cl.ExactMatch("dob"),
],
blocking_rules_to_generate_predictions=[
block_on("first_name"),
block_on("surname"),
],
)
linker = Linker(df, settings, db_api)
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
linker.training.estimate_parameters_using_expectation_maximisation(
block_on("surname")
)
pairwise = linker.inference.predict(threshold_match_probability=0.5)
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
pairwise, 0.95
)irelink (R):
library(irelink)
df <- fake_1000
con <- DBI::dbConnect(duckdb::duckdb())
spec <- il_spec() |>
il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |>
il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |>
il_compare(dob, cl_exact()) |>
il_block_on(first_name) |>
il_block_on(surname)
model <- il_model(df, spec = spec, con = con)
model <- il_estimate_u(model)
model <- il_estimate_em(model, block_on(surname))
pairs <- predict(model, threshold = 0.5)
clusters <- il_cluster(pairs)
il_cleanup(model)
DBI::dbDisconnect(con, shutdown = TRUE)The examples above use probability thresholds because those transfer
cleanly between Splink and irelink. In Splink, prediction
match_weight includes the prior odds. In
irelink, match_weight is evidence only, and
total_match_weight is the prior-inclusive log2 odds. Keep
that difference in mind if you translate match-weight thresholds between
the two packages.
Example: finding matches against new records
splink (Python):
new_records = pd.DataFrame([{
"first_name": "Jhon", "surname": "Smith", "dob": "1990-01-15"
}])
results = linker.inference.find_matches_to_new_records(
new_records, blocking_rules=[], match_weight_threshold=-10
)irelink (R):
new_df <- data.frame(
first_name = "Jhon",
surname = "Smith",
dob = "1990-01-15"
)
results <- il_find_matches(model, new_df, threshold = 0.5)Splink uses a match-weight threshold for this workflow.
il_find_matches() filters on posterior match probability.
Translate those thresholds with the same caution as in the prediction
example above.
