Skip to content

An R package for blocking records for record linkage / data deduplication based on approximate nearest neighbours algorithms.

Notifications You must be signed in to change notification settings

ncn-foreigners/blocking

Repository files navigation

R-CMD-check test-coverage

Overview

Description

A small package used to block records for data deduplication and record linkage (entity resolution) based on approximate nearest neighbours algorithms (ANN) and graphs (via igraph).

Currently supports the following R packages that binds to specific ANN algorithms

The package also supports integration with the reclin2 package via blocking::pair_ann function.

Funding

Work on this package is supported by the National Science Centre, OPUS 22 grant no. 2020/39/B/HS4/00941.

Installation

You can install the development version of blocking from GitHub with:

# install.packages("remotes") # uncomment if needed
remotes::install_github("ncn-foreigners/blocking")

Basic usage

Load packages for the examples

library(blocking)
library(reclin2)
#> Loading required package: data.table
#> 
#> Attaching package: 'reclin2'
#> The following object is masked from 'package:base':
#> 
#>     identical

Generate simple data with two groups.

df_example <- data.frame(txt = c(
  "jankowalski",
  "kowalskijan",
  "kowalskimjan",
  "kowaljan",
  "montypython",
  "pythonmonty",
  "cyrkmontypython",
  "monty"
))
df_base <- data.frame(txt = c("montypython", "kowalskijan"))

df_example
#>               txt
#> 1     jankowalski
#> 2     kowalskijan
#> 3    kowalskimjan
#> 4        kowaljan
#> 5     montypython
#> 6     pythonmonty
#> 7 cyrkmontypython
#> 8           monty

df_base
#>           txt
#> 1 montypython
#> 2 kowalskijan

Deduplication using blocking

blocking_result <- blocking(x = df_example$txt)
#> 'as(<dgTMatrix>, "dgCMatrix")' is deprecated.
#> Use 'as(., "CsparseMatrix")' instead.
#> See help("Deprecated") and help("Matrix-deprecated").
## data frame with indices and block 
blocking_result
#> Blocking based on the hnsw method.
#> Number of blocks: 2.
#> Number of columns used for blocking: 28.
#> Distribution of the size of the blocks:
#> 1 
#> 2

Table with blocking

blocking_result$result
#>        x     y block
#>    <int> <int> <num>
#> 1:     1     2     1
#> 2:     2     1     1
#> 3:     2     3     1
#> 4:     2     4     1
#> 5:     5     6     2
#> 6:     5     7     2
#> 7:     5     8     2
#> 8:     6     5     2

Deduplication followed by the reclin2 package

pair_ann(x = df_example, on = "txt") |>
  compare_pairs(on = "txt", comparators = list(cmp_jarowinkler())) |>
  score_simple("score", on = "txt") |>
  select_threshold("threshold", score = "score", threshold = 0.55) |>
  link(selection = "threshold")
#>   Total number of pairs: 10 pairs
#> 
#> Key: <.y>
#>        .y    .x       txt.x           txt.y
#>     <int> <int>      <char>          <char>
#>  1:     2     1 jankowalski     kowalskijan
#>  2:     3     1 jankowalski    kowalskimjan
#>  3:     3     2 kowalskijan    kowalskimjan
#>  4:     4     1 jankowalski        kowaljan
#>  5:     4     2 kowalskijan        kowaljan
#>  6:     6     5 montypython     pythonmonty
#>  7:     7     5 montypython cyrkmontypython
#>  8:     7     6 pythonmonty cyrkmontypython
#>  9:     8     5 montypython           monty
#> 10:     8     6 pythonmonty           monty

Record linkage

pair_ann(x = df_base, y = df_example, on = "txt", deduplication = FALSE) |>
  compare_pairs(on = "txt", comparators = list(cmp_jarowinkler())) |>
  score_simple("score", on = "txt") |>
  select_threshold("threshold", score = "score", threshold = 0.55) |>
  link(selection = "threshold")
#>   Total number of pairs: 8 pairs
#> 
#> Key: <.y>
#>       .y    .x       txt.x           txt.y
#>    <int> <int>      <char>          <char>
#> 1:     1     2 kowalskijan     jankowalski
#> 2:     2     2 kowalskijan     kowalskijan
#> 3:     3     2 kowalskijan    kowalskimjan
#> 4:     4     2 kowalskijan        kowaljan
#> 5:     5     1 montypython     montypython
#> 6:     6     1 montypython     pythonmonty
#> 7:     7     1 montypython cyrkmontypython
#> 8:     8     1 montypython           monty

See also

See section Data Integration (Statistical Matching and Record Linkage) in the Official Statistics Task View.

Packages that allow blocking:

  • klsh – k-means locality sensitive hashing,
  • reclin2pair_blocking, pari_minsim functions,
  • fastLinkblockData function.

About

An R package for blocking records for record linkage / data deduplication based on approximate nearest neighbours algorithms.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages