NFDI4DS | UHH-SEMS - Publication Details

An Efficient Parallel Top-k Similarity Join for Massive Multidimensional Data Using Spark

0202 electrical engineering, electronic engineering, information engineering 02 engineering and technology

DOI: 10.14257/ijdta.2015.8.3.06 Publication Date: 2016-02-25T02:22:13Z

Abstract Supplemental Material References Cited by

AUTHORS (4)

Dehua Chen

Changgan Shen

Jieying Feng

Jiajin Le

ABSTRACT

Top-k similarity join has been used in a wide range of applications that require calculating the most top-k similar pairs of data records in a given database. However, the time performance will be a challenging problem, as an increasing trend of applications that need to process massive data. Obviously, finding the top-k pairs in such vast amounts of data with traditional methods is awkward. In this paper, we propose the RDD-based algorithm to perform the top-k similarity join for massive multidimensional data over a large cluster built with commodity machines using Spark. The RDD-based algorithm consists of four steps, which loads a set of multidimensional records stored in HDFS and finally output an ordered list of top-k closest pairs into HDFS. Firstly, we develop an efficient distance function based on LSH(Locality Sensitive Hashing) to improve the efficiency in pairwise similarity comparison. Secondly, to minimize the amount of data during the RDD running-time, we split conceptually all pairs of LSH signatures into partitions. Moreover, we exploit a serial computation strategy to calculate all top-k closest pairs in parallel. Finally, all the local top-k pairs sorted by their Hamming distances will contribute to the global top-k pairs. In this paper, the performance evaluation between Spark and Hadoop confirms the effectiveness and scalability of our RDD-based algorithm.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES (0)

CITATIONS (9)

EXTERNAL LINKS

CROSSREF - Publications OPENAIRE - Products

PlumX Metrics

An Efficient Parallel Top-k Similarity Join for Massive Multidimensional Data Using Spark

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....