Skip to content

Solve the problem of large-scale text data similarity computing using Apache Spark and the LSH algorithm

License

Notifications You must be signed in to change notification settings

MiguelSteph/LargeScaleTextSimilarity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 

Repository files navigation

Large Scale Text Data Similarity

This project solve the problem of large-scale text data similarity computing using Apache Spark and the LSH algorithm.

This repository is created by Stephane Miguel KAKANAKOU. For any questions or suggestion, you can contact me at Skakanakou@gmail.com

What is the goal of this project?

The goal of this project is, give a large set of documents, find the similars pairs of documents in a very efficient time.
For that, I use Apache Spark with LSH. And as data, I use Wikipedia articles. The data is available on the following link wikispeedia_articles_plaintext.tar.gz

Implementation Details

I can divide the implementation of the project into two mains parts. In the first part, I prepare our data and in the second part, I perform the LSH to find candidates pairs and then use the Jaccard similarity to find the similars documents.

Preparation Part

In the preparation part, I have the following steps :

  • STEP 1 : First I have written code to find and save all the primes numbers smaller than 100000. Here is the link to the java file FindPrimesNumbers.
  • STEP 2 : I have created a class that contains all the hashing method that I need in the project. In this project, I use the Universal hashing technique to hash integer and I use the DJB2 hashing technique to hash String and also list of integer. Here is the link to the java code Hasher.
  • STEP 3 : I have converted each document of the dataset in a set of k-shingle and hash each k-shingle in token. So during this step, I have transformed each document of the dataset into set of tokens. The set of tokens fit well in memory. Here is the link to the java file DataPreparation.

Perform LSH

In this part, I have the following steps :

  • STEP 1: Load the tokens that I get from the previous part
  • STEP 2: Compute the signature matrix
  • STEP 3: Divide the signature matrix into bands and compute the hash value of each document in each band
  • STEP 4: Identify the candidate pairs
  • STEP 5: Find the similars documents by computing the jaccard similarity between candidate pair documents.
  • STEP 6: Save the result into a file

About

Solve the problem of large-scale text data similarity computing using Apache Spark and the LSH algorithm

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages