Read required packages.
Read the RLdata500 data (taken from the RecordLinkage
package).
| fname_c1 | fname_c2 | lname_c1 | lname_c2 | by | bm | bd | rec_id | ent_id |
|---|---|---|---|---|---|---|---|---|
| CARSTEN | MEIER | 1949 | 7 | 22 | 1 | 34 | ||
| GERD | BAUER | 1968 | 7 | 27 | 2 | 51 | ||
| ROBERT | HARTMANN | 1930 | 4 | 30 | 3 | 115 | ||
| STEFAN | WOLFF | 1957 | 9 | 2 | 4 | 189 | ||
| RALF | KRUEGER | 1966 | 1 | 13 | 5 | 72 | ||
| JUERGEN | FRANKE | 1929 | 7 | 4 | 6 | 142 |
This dataset contains 500 rows with 450 entities.
Now we create a new column that concatenates the information in each row.
RLdata500[, id_count :=.N, ent_id] ## how many times given unit occurs
RLdata500[, bm:=sprintf("%02d", bm)] ## add leading zeros to month
RLdata500[, bd:=sprintf("%02d", bd)] ## add leading zeros to day
RLdata500[, txt:=tolower(paste0(fname_c1,fname_c2,lname_c1,lname_c2,by,bm,bd))]
head(RLdata500)| fname_c1 | fname_c2 | lname_c1 | lname_c2 | by | bm | bd | rec_id | ent_id | id_count | txt |
|---|---|---|---|---|---|---|---|---|---|---|
| CARSTEN | MEIER | 1949 | 07 | 22 | 1 | 34 | 1 | carstenmeier19490722 | ||
| GERD | BAUER | 1968 | 07 | 27 | 2 | 51 | 2 | gerdbauer19680727 | ||
| ROBERT | HARTMANN | 1930 | 04 | 30 | 3 | 115 | 1 | roberthartmann19300430 | ||
| STEFAN | WOLFF | 1957 | 09 | 02 | 4 | 189 | 1 | stefanwolff19570902 | ||
| RALF | KRUEGER | 1966 | 01 | 13 | 5 | 72 | 1 | ralfkrueger19660113 | ||
| JUERGEN | FRANKE | 1929 | 07 | 04 | 6 | 142 | 1 | juergenfranke19290704 |
In the next step we use the newly created column in the
blocking function. If we specify verbose, we get
information about the progress.
df_blocks <- blocking(x = RLdata500$txt, ann = "nnd", verbose = 1, graph = TRUE, seed = 2024)
#> ===== creating tokens =====
#> ===== starting search (nnd, x, y: 500, 500, t: 429) =====
#> ===== creating graph =====Results are as follows:
rnndescent we have created 133 blocks,df_blocks
#> ========================================================
#> Blocking based on the nnd method.
#> Number of blocks: 133.
#> Number of columns used for blocking: 429.
#> Reduction ratio: 0.9917.
#> ========================================================
#> Distribution of the size of the blocks:
#> 2 3 4 5 6 7 8 9 10 11 17
#> 47 34 18 12 8 5 3 3 1 1 1Structure of the object is as follows:
result – a data.table with identifiers and
block IDs,method – the method used,deduplication – whether deduplication was applied,representation – whether shingles or vectors were
used,metrics – standard metrics and based on the
igraph::compare methods for comparing graphs (here
NULL),confusion – confusion matrix (here NULL),colnames – column names used for the comparison,graph – an igraph object mainly for
visualisation.str(df_blocks,1)
#> List of 8
#> $ result :Classes 'data.table' and 'data.frame': 367 obs. of 4 variables:
#> ..- attr(*, ".internal.selfref")=<externalptr>
#> $ method : chr "nnd"
#> $ deduplication : logi TRUE
#> $ representation: chr "shingles"
#> $ metrics : NULL
#> $ confusion : NULL
#> $ colnames : chr [1:429] "86" "ap" "av" "bf" ...
#> $ graph :Class 'igraph' hidden list of 10
#> - attr(*, "class")= chr "blocking"Plot connections.
The resulting data.table has four columns:
x – reference dataset (i.e. RLdata500) –
this may not contain all units of RLdata500,y - query (each row of RLdata500) – this
may not contain all units of RLdata500,block – the block ID,dist – distance between objects.| x | y | block | dist |
|---|---|---|---|
| 1 | 64 | 33 | 0.4737987 |
| 2 | 43 | 1 | 0.0807453 |
| 2 | 486 | 1 | 0.4102322 |
| 3 | 450 | 88 | 0.4326335 |
| 4 | 234 | 12 | 0.5256584 |
| 5 | 128 | 2 | 0.5133357 |
Create long data.table with information on blocks and
units from original dataset.
df_block_melted <- melt(df_blocks$result, id.vars = c("block", "dist"))
df_block_melted_rec_block <- unique(df_block_melted[, .(rec_id=value, block)])
head(df_block_melted_rec_block)| rec_id | block |
|---|---|
| 1 | 33 |
| 2 | 1 |
| 3 | 88 |
| 4 | 12 |
| 5 | 2 |
| 6 | 33 |
We add block information to the final dataset.
| fname_c1 | fname_c2 | lname_c1 | lname_c2 | by | bm | bd | rec_id | ent_id | id_count | txt | block_id |
|---|---|---|---|---|---|---|---|---|---|---|---|
| CARSTEN | MEIER | 1949 | 07 | 22 | 1 | 34 | 1 | carstenmeier19490722 | 33 | ||
| GERD | BAUER | 1968 | 07 | 27 | 2 | 51 | 2 | gerdbauer19680727 | 1 | ||
| ROBERT | HARTMANN | 1930 | 04 | 30 | 3 | 115 | 1 | roberthartmann19300430 | 88 | ||
| STEFAN | WOLFF | 1957 | 09 | 02 | 4 | 189 | 1 | stefanwolff19570902 | 12 | ||
| RALF | KRUEGER | 1966 | 01 | 13 | 5 | 72 | 1 | ralfkrueger19660113 | 2 | ||
| JUERGEN | FRANKE | 1929 | 07 | 04 | 6 | 142 | 1 | juergenfranke19290704 | 33 |
We can check in how many blocks the same entities
(ent_id) are observed. In our example, all the same
entities are in the same blocks.
| uniq_blocks | N |
|---|---|
| 1 | 450 |
We can visualise the distances between units stored in the
df_blocks$result data set. Clearly we have a mixture of two
groups: matches (close to 0) and non-matches (close to 1).
hist(df_blocks$result$dist, xlab = "Distances", ylab = "Frequency", breaks = "fd",
main = "Distances calculated between units")Finally, we can visualise the result based on the information whether block contains matches or not.
df_for_density <- copy(df_block_melted[block %in% RLdata500$block_id])
df_for_density[, match:= block %in% RLdata500[id_count == 2]$block_id]
plot(density(df_for_density[match==FALSE]$dist), col = "blue", xlim = c(0, 0.8),
main = "Distribution of distances between\nclusters type (match=red, non-match=blue)")
lines(density(df_for_density[match==TRUE]$dist), col = "red", xlim = c(0, 0.8))