- OVERVIEW
- MOTIVATION
- SOURCES/USEFUL LINKS
- PROBLEM STATEMENT
- SOLUTION
- WHICH TYPE OF ML PROBLEM IS THIS?
- WHAT IS THE BEST PERFORMANCE METRIC FOR THIS PROBLEM?
- BUSINESS OBJECTIVES AND CONSTRAINTS
- DATA OVERVIEW
- TRAIN AND TEST RATIO
- AGENDA
- TECHNICAL ASPECT
- INSTALLATION
Quora is a place to gain and share knowledge—about anything. It’s a platform to ask questions and connect with people who contribute unique insights and quality answers. This empowers people to learn from each other and to better understand the world.
Over 100 million people visit Quora every month, so it's no surprise that many people ask similarly worded questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question, and make writers feel they need to answer multiple versions of the same question. Quora values canonical questions because they provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.
Credits: Kaggle
What could be a perfect way to utilize unfortunate lockdown period? The only solutions to handle the situation are definitely among one of the smart ways to utilize the time industriously. Like most of you, I spend my time in YouTube, Netflix, coding and reading some research papers on weekends. The idea of classifying “Is that a duplicate Quora question?” struck to me when I was browsing through some research papers. Specially, when I found a You Tube video of Kaggle grandmaster “Abhishek Thakur” about this topic. I find a relevant research paper associated with it. And that led me to collect the Dataset of “Is that a duplicate Quora question?” to train a Machine learning model.
- Video Link : https://www.youtube.com/watch?v=vA1V8A69e9c
- SlideShare Link : https://www.slideshare.net/abhishekkrthakur/is-that-a-duplicate-quora-question
- Blog 1 : https://www.dropbox.com/sh/93968nfnrzh8bp5/AACZdtsApc1QSTQc7X0H3QZ5a?dl=0&preview=2761178.pdf
- Blog 2 : https://www.dropbox.com/sh/93968nfnrzh8bp5/AACZdtsApc1QSTQc7X0H3QZ5a?dl=0&preview=NilssonTiman.pdf
- Identify which questions asked on Quora are duplicates of questions that have already been asked.
- This could be useful to instantly provide answers to questions that have already been answered.
- We are tasked with predicting whether a pair of questions are duplicates or not.
Suppose we have a fairly large data set of question-pairs that has been labeled (by humans) as “duplicate” or “not duplicate.” We could then use natural language processing (NLP) techniques to extract the difference in meaning or intent of each question-pair, use machine learning (ML) to learn from the human-labeled data, and predict whether a new pair of questions is duplicate or not.
It is a binary classification problem, for a given pair of questions we need to predict if they are duplicate or not.
- log-loss: https://www.kaggle.com/wiki/LogarithmicLoss
- Qns: Why log-loss is right Metric for this??
Ans: This is a “Binary class classification problem” this doesn’t mean we want output as “0” or “1”. we want “ p (q1 ≈ q2) “ and here probability lies b/w “0 to 1”, and when we have probability value and predicting for binary class classification problem the log-loss is one of the best metric.
- Qns: Why log-loss is right Metric for this??
- Binary Confusion Matrix
- The cost of a mis-classification can be very high.
- You would want a probability of a pair of questions to be duplicates so that you can choose any threshold of choice.
- Qsn: Why we choose any threshold of choice??
Ans: This mean, see we want “p (q1 ≈ q2)“ and here probability lies b/w “ 0 to 1”, so here we can choose some threshold which confirm me “ q1 ≈ q2 ”.
- Example: If we choose threshold 0.95, this mean p(q1 ≈ q2) when p>0.95.
- Benefit of choosing threshold here: If suppose we set threshold >0.95 and Human read the answer and they told this is the wrong answer for this question, then we can change the threshold.
- Qsn: Why we choose any threshold of choice??
- No strict latency concerns.
- Interpretability is partially important.
- Data will be in a file Train.csv
- Train.csv contains 5 columns: qid1, qid2, question1, question2, is_duplicate
- Number of rows in Train.csv = 404,290
id | qid1 | qid2 | question1 | question2 | is_duplicate |
---|---|---|---|---|---|
0 | 1 | 2 | What is the step by step guide to invest in share market in India? | What is the step by step guide to invest in share market? | 0 |
1 | 3 | 4 | What is the story of Kohinoor (Koh-i-Noor) Diamond? | What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back? | 0 |
We build train and test by randomly splitting in the ratio of 60:40 or 70:30 whatever we choose as we have sufficient points to work with.
-
Some Analysis on Train Data Set below:
-
Getting Deep knowledge of Data Set (on question parameter)
-
Output:
(1). Total number of questation pairs for training:- 404290
(2). Questation pairs are not similar (is_duplicate= 0) in percentage:- 63.08%
(3). Questation pairs are similar (is_duplicate= 1) in percentage:- 36.92%
-
Plotted above detail’s on graph:
df_train.groupby("is_duplicate")["id"].count().plot.bar()
We can clearly see this graph and analyze it, positive class (is_duplicate=0) has more pair question than negative class (is_duplicate=1). We can think this as unbalanced data set.
-
-
Now, Getting Deep knowledge about Number of unique questions:
-
Output:
(1). Total number of Unique Questions are: - 537933
(2). Number of unique questions that appear more than one time: - 111780 (20.7%)
(3). Max number of times a single question is repeated:- 157
-
Plotting Number of occurrences of each question:
In terms of questions, most questions only appear a few times, with very few questions appearing several times (and a few questions appearing many times). One question appears more than 157 times.
-
- Basic Features - Extracted some simple features before cleaning the data as below.
- freq_qid1 = Frequency of qid1's
- freq_qid2 = Frequency of qid2's
- q1len = Length of q1
- q2len = Length of q2
- q1_n_words = Number of words in Question 1
- q2_n_words = Number of words in Question 2
- word_Common = (Number of common unique words in Question 1 and Question 2)
- word_Total = (Total num of words in Question 1 + Total num of words in Question 2)
- word_share = (word_common)/(word_Total)
- freq_q1+freq_q2 = sum total of frequency of qid1 and qid2
- freq_q1-freq_q2 = absolute difference of frequency of qid1 and qid2
- Before creating advanced feature, I did some preprocessing on text data.
- Function to Compute and get the features: With 2 parameters of Question 1 and Question 2.
- Before getting deep knowledge about advanced feature we need to understand some terms which helps us to understand advance feature sets below.
- Definition or terms:
- Token: You get a token by splitting sentence a space
- Stop_Word : stop words as per NLTK.
- Word : A token that is not a stop_word
- Features:
-
cwc_min : Ratio of common_word_count to min lenghth of word count of Q1 and Q2
cwc_min = common_word_count / (min(len(q1_words), len(q2_words))
-
cwc_max : Ratio of common_word_count to max lenghth of word count of Q1 and Q2
cwc_max = common_word_count / (max(len(q1_words), len(q2_words))
-
csc_min : Ratio of common_stop_count to min lenghth of stop count of Q1 and Q2
csc_min = common_stop_count / (min(len(q1_stops), len(q2_stops))
-
csc_max : Ratio of common_stop_count to max lenghth of stop count of Q1 and Q2
csc_max = common_stop_count / (max(len(q1_stops), len(q2_stops))
-
ctc_min : Ratio of common_token_count to min lenghth of token count of Q1 and Q2
ctc_min = common_token_count / (min(len(q1_tokens), len(q2_tokens))
-
ctc_max : Ratio of common_token_count to max lenghth of token count of Q1 and Q2
ctc_max = common_token_count / (max(len(q1_tokens), len(q2_tokens))
-
last_word_eq : Check if last word of both questions is equal or not
last_word_eq = int(q1_tokens[-1] == q2_tokens[-1])
-
first_word_eq : Check if First word of both questions is equal or not
first_word_eq = int(q1_tokens[0] == q2_tokens[0])
-
abs_len_diff : Abs. length difference
abs_len_diff = abs(len(q1_tokens) - len(q2_tokens))
-
mean_len : Average Token Length of both Questions
mean_len = (len(q1_tokens) + len(q2_tokens))/2
-
Levenshtein Distance: Levenshtein Distance measures the difference between two text sequences based on the number of single character edits (insertions, deletions, and substitutions) it takes to change one sequence to another. It is also known as “edit distance”. The Python library fuzzy-wuzzy can be used to compute the following:
-
fuzz_ratio : This computes the similarity between two word-sequences (in this case, the two questions) using the simple edit distance between them.
fuzz.ratio("YANKEES", "NEW YORK YANKEES") ⇒ 60 fuzz.ratio("NEW YORK METS", "NEW YORK YANKEES") ⇒ 75
Reference: https://github.com/seatgeek/fuzzywuzzy#usage http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
-
fuzz_partial_ratio : This improves on the simple ratio method above using a heuristic called “best partial,” which is useful when the two sequences are of noticeably different lengths. If the shorter sequence is length m, the simple ratio score of the best matching substring of length m is taken into account.
fuzz.partial_ratio("YANKEES", "NEW YORK YANKEES") ⇒ 100 fuzz.partial_ratio("NEW YORK METS", "NEW YORK YANKEES") ⇒ 69
Reference: https://github.com/seatgeek/fuzzywuzzy#usage http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
-
token_sort_ratio : This involves tokenizing each sequence, sorting the tokens alphabetically, and then joining them back. These new sequences are then compared using the simple ratio method.
fuzz.token_sort_ratio("New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets") ⇒100
Reference: https://github.com/seatgeek/fuzzywuzzy#usage http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
-
token_set_ratio : This involves tokenizing both the sequences and splitting the tokens into three groups: the intersection component common to both sequences and the two remaining components from each sequence. The scores increase when the intersection component makes up a larger percentage of the full sequence. The score also increases with the similarity of the two remaining components.
t0 = "angels mariners" t1 = "angels mariners vs" t2 = "angels mariners anaheim angeles at los of seattle" fuzz.ratio(t0, t1) ⇒ 90 fuzz.ratio(t0, t2) ⇒ 46 fuzz.ratio(t1, t2) ⇒ 50 fuzz.token_set_ratio("mariners vs angels", "los angeles angels of anaheim at seattle mariners") ⇒ 90
Reference: https://github.com/seatgeek/fuzzywuzzy#usage http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
-
-
longest_substr_ratio : Ratio of length longest common substring to min lenghth of token count of Q1 and Q2
longest_substr_ratio = len(longest common substring) / (min(len(q1_tokens), len(q2_tokens))
-
4. Featuring text data with tf-idf weighted word-vectors (With 2 parameters of Question1 and Question2)
- Extracted Tf-Idf features for this combined question1 and question2 and got features with Train data.
- After we find TF-IDF scores, we convert each question to a weighted average of word2vec vectors by these scores.
- here I use a pre-trained GLOVE model which comes free with "Spacy". https://spacy.io/usage/vectors-similarity
- It is trained on Wikipedia and therefore, it is stronger in terms of word semantics.
- Note: When you are reviewing this part of code, I am sure you will be confuse, why I am directly copy pest the directory of glove pre-trained embedding model in spacy.load function, this is because due to some issue I am unable to call this downloaded file directly.
- Performing Simple TF-IDF Tokenization on columns- 'question1', 'question2'.
vectorizer= TfidfVectorizer() ques1 = vectorizer.fit_transform(data['question1'].values.astype('U')) ques2 = vectorizer.fit_transform(data['question2'].values.astype('U'))
6. Word2Vec Feature: Distance Feature And Genism’s WmdSimilarity Features (To use WMD, we need some word embeddings first of all. Download the GoogleNews-vectors-negative300.bin.gz pre-trained embeddings (warning: 1.5 GB))
-
Word embeddings such as Word2Vec is a key AI method that bridges the human understanding of language to that of a machine and is essential to solving many NLP problems. Here we discuss applications of Word2Vec to Question analysis.
-
Word2Vec feature:
-
Multi-dimensional vector for all the words in any dictionary
-
Always great insights
-
Very popular in natural language processing tasks
-
Google news vectors 300d (Pre trained embedding)
def sent2vec(s): words = str(s).lower() words = word_tokenize(words) words = [w for w in words if not w in stop_words] words = [w for w in words if w.isalpha()] M = [] for w in words: try: M.append(model[w]) except: continue M = np.array(M) v = M.sum(axis=0) return v / np.sqrt((v ** 2).sum()) model = gensim.models.KeyedVectors.load_word2vec_format('Data/GoogleNews-vectors-negative300.bin.gz', binary=True)
-
-
As we performed Word2Vec, now time to create distance feature.
-
The similarity between questions can be computed using word-to-word (pairwise) distances, which are weighted with Word2Vec.
-
Pairwise Distances — We can compute the pairwise distances for each pair of words by picking the first word from question 1 and the second word from question 2. Several pairwise distance metrics can be used as features, including WMD_distance, norm_wmd distance, Cityblock_distance, Bray-Curtis_distance, Cosine_distance, Canberra_distance, Euclidean_distance, Minkowski_distance, and jaccard_distance.
- Reference Link to Understand WMD and Norm_Wmd distance:- https://markroxor.github.io/gensim/static/notebooks/WMD_tutorial.html
- Logistic regression is a linear model for classification. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function. The logistic function is a sigmoid function, which takes any real input and outputs a value between 0 and 1, and hence is ideal for classification.
When a model learns the training data too closely, it fails to fit new data or predict unseen observations reliably. This condition is called overfitting and is countered, in one of many ways, with ridge (L2) regularization. Ridge regularization penalizes model predictors if they are too big, thus enforcing them to be small. This reduces model variance and avoids overfitting.
Hyperparameter Tuning:
Cross-validation is a good technique to tune model parameters like regularization factor and the tolerance for stopping criteria (for determining when to stop training). Here, a validation set is held out from the training data for each run (called fold) while the model is trained on the remaining training data and then evaluated on the validation set. This is repeated for the total number of folds (say five or 10) and the parameters from the fold with the best evaluation score are used as the optimum parameters for the model.
- Linear SVM is the newest extremely fast machine learning (data mining) algorithm for solving multiclass classification problems from ultra large data sets that implements an original proprietary version of a cutting plane algorithm for designing a linear support vector machine. LinearSVM is a linearly scalable routine meaning that it creates an SVM model in a CPU time which scales linearly with the size of the training data set.
- Features
- Efficiency in dealing with extra-large data sets (say, several millions training data pairs)
- Solution of multiclass classification problems with any number of classes
- Working with high dimensional data (thousands of features, attributes) in both sparse and dense format
- No need for expensive computing resources (personal computer is a standard platform)
- Stands for eXtreme Gradient Boosting. Gradient boosting is an approach that predicts the errors made by existing models and adds models until no improvements can be made.
- There are two main reasons for using XGBoost:
- Execution speed
- Model performance
- For below table we are comparing all the ML model test log-loss score.
- I didn’t used total train data to train my algorithms. Because of ram availability constraint in my PC, I sampled some data and Trained my models. below are models and their test log-loss scores.
- For below table Sim Fs - Simple or Basic Feature set,and Adv Fs – Advanced Feature set.
DataSet Size | Model Name | Features | Hyperparameter Tuning | Test Log Loss |
---|---|---|---|---|
~ 404K | Random | Sim Fs+Adv Fs+TFIDF Weighted W2V |
NA | 0.88 |
~ 404K | Logistic Regression | Sim Fs+Adv Fs+TFIDF Weighted W2V |
Done | 0.42 |
~ 404K | Linear SVM | Sim Fs+Adv Fs+TFIDF Weighted W2V |
Done | 0.45 |
~ 404K | XGBoost | Sim Fs+Adv Fs+TFIDF Weighted W2V |
NA | 0.35 |
~ 100K | XGBoost | Sim Fs+Adv Fs+TFIDF Weighted W2V |
Done | 0.33 |
---------- | ---------- | -------- | ------ | -------- |
~ 202K | Random | Sim Fs+Adv Fs+TFIDF Simple |
NA | 0.88 |
~ 202K | Logistic Regression | Sim Fs+Adv Fs+TFIDF Simple |
Done | 0.39 |
~ 202K | Linear SVM | Sim Fs+Adv Fs+TFIDF Simple |
Done | 0.43 |
~ 202K | XGBoost | Sim Fs+Adv Fs+TFIDF Simple |
Done | 0.31 |
---------- | ---------- | -------- | ------ | -------- |
~ 202K | Random | Sim Fs+Adv Fs+Word2Vec Features |
NA | 0.88 |
~ 202K | Logistic Regression | Sim Fs+Adv Fs+Word2Vec Features |
Done | 0.40 |
~ 202K | Linear SVM | Sim Fs+Adv Fs+Word2Vec Features |
Done | 0.41 |
~ 202K | XGBoost | Sim Fs+Adv Fs+Word2Vec Features |
Done | 0.33 |
- We can see, as dimension increases (dim increases with TFIDF Simple) Logistic Regression and XGB starts to perform well, whereas Linear SVM produces best results with
Sim Fs + Adv Fs + Word2Vec Features
.
This project is divided into five part:
-
I have done EDA, Created Basic Feature set (FS1), preprocessing on text data, Created Advanced Feature set using Fuzzy feature (FS2), Featuring text data with tf-idf weighted word-vectors (FS3), and applying ML Model (Random Model, Logistic Regression with hyperparameter tuning, and Linear SVM with hyperparameter tuning) in first part.
-
Training XGBoost with hyperparameter tuning using FS1 + FS2 + FS3 in second part.
-
I have created simple TF-IDF Vectorizer (FS4) and training ML Model (Logistic Regression with hyperparameter tuning, Linear SVM with hyperparameter tuning, and XGBoost with hyperparameter tuning) using FS1 + FS2 +FS4 in third part.
-
I have created Distance Feature and Genism’s WmdSimilarity Features (FS5) and training ML Model (Logistic Regression with hyperparameter tuning, Linear SVM with hyperparameter tuning, and XGBoost with hyperparameter tuning) using FS1 + FS2 + FS5 in fourth part.
-
Model Comparison and conclusion in fifth part.
The Code is written in Python 3.7. If you don't have Python installed you can find it here. If you are using a lower version of Python you can upgrade using the pip package, ensuring you have the latest version of pip.
-
all the code for the article is available with this REPOSITORIES..
How To
-
Install Required Libraries
pip3 install pandas pip3 install numpy pip3 install scikit-learn pip3 install nltk pip3 install tqdm pip3 install pyemd pip3 install fuzzywuzzy pip3 install python-levenshtein pip3 install --upgrade gensim
-
Download Required Language libraries
wget https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
-