Skip to contents

irelink translates the Python splink library into idiomatic R. This vignette maps common Splink patterns to irelink so you can get started quickly.

Design differences

Splink uses an object-oriented design centered on a Linker class. irelink uses a functional pipeline that fits naturally in R. The Linker object’s namespaced methods such as linker.training.* and linker.inference.* become standalone functions that accept and return an il_model object.

Splink bundles comparison levels into high-level comparison classes such as JaroWinklerAtThresholds. In irelink, the cl_*() functions fill the same role and can be passed directly to il_compare().

Core workflow

Step splink (Python) irelink (R)
Load data splink_datasets.fake_1000 fake_1000
Choose backend DuckDBAPI() DBI::dbConnect(duckdb::duckdb())
Define settings SettingsCreator(...) il_spec() |>
il_compare(...) |>
il_block_on(...)
Create model Linker(df, settings, db_api) il_model(df, spec = spec, con = con)
Estimate prior linker.training.
estimate_probability_two_random_records_match(...)
il_estimate_prior(model, ...)
Estimate u linker.training.
estimate_u_using_random_sampling(...)
il_estimate_u(model)
Estimate m (EM) linker.training.
estimate_parameters_using_expectation_maximisation(...)
il_estimate_em(model, ...)
Estimate m (labels) linker.training.
estimate_m_from_pairwise_labels(...)
il_estimate_m_from_labels(model, ...)
Predict linker.inference.predict(...) predict(model, ...)
Cluster linker.clustering.
cluster_pairwise_predictions_at_threshold(...)
il_cluster(pairs)
Deterministic link linker.deterministic_link() il_deterministic_link(df, ...)
Find matches linker.inference.
find_matches_to_new_records(...)
il_find_matches(model, new_records, ...)

irelink also supports link_type = "link_and_dedupe" for two-table jobs where duplicates may exist within each input table and across the two tables.

Comparison levels

Comparison levels are the building blocks used to score how similar two records are on a field. Each cl_*() function corresponds to a Splink comparison level class.

splink (Python) irelink (R)
ExactMatchLevel cl_exact()
LevenshteinLevel cl_levenshtein()
DamerauLevenshteinLevel cl_damerau_levenshtein()
JaroLevel cl_jaro()
JaroWinklerLevel cl_jaro_winkler()
JaccardLevel cl_jaccard()
CosineSimilarityLevel cl_cosine()
AbsoluteDifferenceLevel cl_numeric_diff()
PercentageDifferenceLevel cl_pct_diff()
AbsoluteTimeDifferenceAtThresholds cl_date_diff()
DistanceInKMLevel cl_geo_distance()
ArrayIntersectLevel cl_array_intersect()
CustomLevel cl_custom()
NullLevel cl_null()
ElseLevel cl_else()
And cl_and()
Or cl_or()
Not cl_not()

Domain-specific comparisons

Splink provides high-level comparison classes for common field types. In irelink, these are helper functions that return preconfigured sets of levels.

splink (Python) irelink (R)
NameComparison cl_name()
ForenameSurnameComparison cl_forename_surname()
DateOfBirthComparison cl_dob()
EmailComparison cl_email()
PostcodeComparison cl_postcode()

Model inspection

splink (Python) irelink (R)
linker.visualisations.match_weights_chart() il_weights(model)
linker.visualisations.
parameter_estimate_comparisons_chart()
il_parameters(model)
linker.visualisations.waterfall_chart(...) il_waterfall(pairs, ...)
linker.misc.query_comparison_details(...) il_compare_records(record_a, record_b, ...)
linker.training.
prediction_errors_from_labels_column(...)
il_errors(model, ...)
linker.evaluation.unlinkables_chart() il_unlinkables(model)

Evaluation

splink (Python) irelink (R)
linker.evaluation.
accuracy_chart_from_labels_column(...)
il_accuracy(model, ...)
linker.evaluation.
precision_recall_chart_from_labels_column(...)
il_precision_recall(model, ...)
linker.evaluation.
roc_chart_from_labels_column(...)
il_roc(model, ...)

Data profiling

splink (Python) irelink (R)
linker.profile_columns(...) il_profile(df, ...)
linker.count_num_comparisons_from_blocking_rule(...) il_count_pairs(df, ...)
completeness profiling il_completeness(df, ...)

Persistence

splink (Python) irelink (R)
linker.misc.save_model_to_json(...) il_save(model, path)
load_model_from_json(...) il_load(path)
delete_tables_created_by_splink_from_db(...) il_cleanup_all(con)
model-scoped cleanup il_cleanup(model)

Blocking rules

In Splink, you create blocking rules with block_on(), and irelink uses the same function name. The main difference is where the rules are used: Splink passes them into SettingsCreator, while irelink adds them to a spec with il_block_on() or passes them directly to training functions.

# blocking in the spec
spec <- il_spec() |>
  il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |>
  il_block_on(surname)

# blocking in EM training
model <- il_estimate_em(model, block_on(surname))

Example: side-by-side deduplication

Below is a minimal deduplication example in both Splink and irelink.

splink (Python):

from splink import Linker, SettingsCreator, DuckDBAPI, block_on, splink_datasets
import splink.comparison_library as cl

df = splink_datasets.fake_1000
db_api = DuckDBAPI()

settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.JaroWinklerAtThresholds("first_name", [0.9, 0.7]),
        cl.JaroWinklerAtThresholds("surname", [0.9, 0.7]),
        cl.ExactMatch("dob"),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name"),
        block_on("surname"),
    ],
)

linker = Linker(df, settings, db_api)
linker.training.estimate_u_using_random_sampling(max_pairs=1e6)
linker.training.estimate_parameters_using_expectation_maximisation(
    block_on("surname")
)

pairwise = linker.inference.predict(threshold_match_probability=0.5)
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
    pairwise, 0.95
)

irelink (R):

library(irelink)

df <- fake_1000
con <- DBI::dbConnect(duckdb::duckdb())

spec <- il_spec() |>
  il_compare(first_name, cl_jaro_winkler(0.9, 0.7)) |>
  il_compare(surname, cl_jaro_winkler(0.9, 0.7)) |>
  il_compare(dob, cl_exact()) |>
  il_block_on(first_name) |>
  il_block_on(surname)

model <- il_model(df, spec = spec, con = con)
model <- il_estimate_u(model)
model <- il_estimate_em(model, block_on(surname))

pairs <- predict(model, threshold = 0.5)
clusters <- il_cluster(pairs)

il_cleanup(model)
DBI::dbDisconnect(con, shutdown = TRUE)

The examples above use probability thresholds because those transfer cleanly between Splink and irelink. In Splink, prediction match_weight includes the prior odds. In irelink, match_weight is evidence only, and total_match_weight is the prior-inclusive log2 odds. Keep that difference in mind if you translate match-weight thresholds between the two packages.

Example: finding matches against new records

splink (Python):

new_records = pd.DataFrame([{
    "first_name": "Jhon", "surname": "Smith", "dob": "1990-01-15"
}])
results = linker.inference.find_matches_to_new_records(
    new_records, blocking_rules=[], match_weight_threshold=-10
)

irelink (R):

new_df <- data.frame(
  first_name = "Jhon",
  surname = "Smith",
  dob = "1990-01-15"
)
results <- il_find_matches(model, new_df, threshold = 0.5)

Splink uses a match-weight threshold for this workflow. il_find_matches() filters on posterior match probability. Translate those thresholds with the same caution as in the prediction example above.