Symposium on Nuclear Electronics and Computing - NEC'2019

Name: Symposium on Nuclear Electronics and Computing - NEC'2019
Start: 2019-09-30T08:30:00+02:00
End: 2019-10-04T19:00:00+02:00
Location: Montenegro, Budva, Becici

30 September 2019 to 4 October 2019

Montenegro, Budva, Becici

Europe/Podgorica timezone

Support

nec2019@jinr.ru

Blocking strategies to accelerate record matching for Big Data integration

3 Oct 2019, 12:15

15m

Splendid Conference & SPA Resort, Conference Hall Petroviċa

Sectional Machine Learning Algorithms and Big Data Analytics Machine Learning Algorithms and Big Data Analytics

Mr Ivan Kadochnikov (JINR, PRUE)

Record matching represents a key step in Big Data analysis, especially important to leverage dis-parate large data sources. Methods of probabilistic record linkage provide a good framework to estimate and interpret partial record matches. However, they require combining string distances for the compared records. That is, direct use of probabilistic record linkage requires processing the Cartesian product of record sets. A “blocking” step is often used where candidate record pairs are required to match exactly on a categorical column, greatly limiting the number of record comparisons and computational cost. However, this method requires a level of data quality and agreement between sources on the cat-egorical column. We propose a more flexible approach for situations where no good blocking col-umn can be chosen. The key idea is to use approximate nearest neighbor search as the blocking filter. One possible method is to vectorize one string column with TF or TF/IDF into term frequency vectors, then use Location Sensitive Hashing to quickly search for approximate nearest neighbors in this vector space. Apache Spark libraries were used to show the effectiveness of this approach for linking open company registration datasets.

Mr Ivan Kadochnikov (JINR, PRUE)

Papoyan Vladimir (JINR)

Slides

4Kadochnikov_-_Blocking_strategies.pdf

Symposium on Nuclear Electronics and Computing - NEC'2019

Support

Blocking strategies to accelerate record matching for Big Data integration

Splendid Conference & SPA Resort, Conference Hall Petroviċa

Speaker

Description

Primary author

Co-author

Presentation materials

Choose timezone

Symposium on Nuclear Electronics and Computing - NEC'2019

Support

Speaker

Description

Primary author

Co-author

Presentation materials