30 September 2019 to 4 October 2019
Montenegro, Budva, Becici
Europe/Podgorica timezone

Blocking strategies to accelerate record matching for Big Data integration

3 Oct 2019, 12:15
15m
Splendid Conference & SPA Resort, Conference Hall Petroviċa

Splendid Conference & SPA Resort, Conference Hall Petroviċa

Sectional Machine Learning Algorithms and Big Data Analytics Machine Learning Algorithms and Big Data Analytics

Speaker

Mr Ivan Kadochnikov (JINR, PRUE)

Description

Record matching represents a key step in Big Data analysis, especially important to leverage dis-parate large data sources. Methods of probabilistic record linkage provide a good framework to estimate and interpret partial record matches. However, they require combining string distances for the compared records. That is, direct use of probabilistic record linkage requires processing the Cartesian product of record sets. A “blocking” step is often used where candidate record pairs are required to match exactly on a categorical column, greatly limiting the number of record comparisons and computational cost. However, this method requires a level of data quality and agreement between sources on the cat-egorical column. We propose a more flexible approach for situations where no good blocking col-umn can be chosen. The key idea is to use approximate nearest neighbor search as the blocking filter. One possible method is to vectorize one string column with TF or TF/IDF into term frequency vectors, then use Location Sensitive Hashing to quickly search for approximate nearest neighbors in this vector space. Apache Spark libraries were used to show the effectiveness of this approach for linking open company registration datasets.

Primary author

Mr Ivan Kadochnikov (JINR, PRUE)

Co-author

Presentation materials