Skip to content

Latest commit

 

History

History
37 lines (27 loc) · 2.32 KB

File metadata and controls

37 lines (27 loc) · 2.32 KB

Quora-Duplicate-Question-Pairs-Identification

Quora Duplicate Questions Detection Quora Duplicate Questions Detection Quora Duplicate Questions Detection

Problem Statement

Quora is a fantastic site where queries are posed, addressed, followed, and altered by online businesses. As a result, people are better able to grasp the world and benefit from the other's knowledge. It's not surprising that many questions have similar wording since there are about 100 million visitors to Quora each month. Asking its users to write an answer to the same question is not a better strategy for Quora. Therefore, if there is a system that can recognize whether a new question is similar to ones that have already been addressed, that would be ideal.

So, determining if a pair of questions is duplicated or not is our problem statement.

Dataset Description

The goal of this project is to predict which of the provided pairs of questions contain two questions with the same meaning. The ground truth is the set of labels that have been supplied by human experts. The ground truth labels are inherently subjective, as the true meaning of sentences can never be known with certainty. Human labeling is also a 'noisy' process, and reasonable people will disagree. As a result, the ground truth labels on this dataset should be taken to be 'informed' but not 100% accurate, and may include incorrect labeling. We believe the labels, on the whole, to represent a reasonable consensus, but this may often not be true on a case by case basis for individual items in the dataset.

Data fields

id - the id of a training set question pair
qid1, qid2 - unique ids of each question (only available in train.csv)
question1, question2 - the full text of each question
is_duplicate - the target variable, set to 1 if question1 and question2 have essentially the same meaning, and 0 otherwise.

Python Libraries Used

  • Numpy
  • Pandas
  • Seaborn
  • Matplotlib
  • NLTK
  • FuzzyWuzzy
  • Scikit-learn
  • Keras
  • re
  • bs4