You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think for Chinese text, unlike English where words are separated by space, it is safe to use a lower ngram size such as 2 or 3. (Modify the ngram code if necessary e.g. you can also use jieba instead)
With the following test code:
def jaccard_similarity_ngrams(str1, str2, n):
"""
Calculate Jaccard similarity between two strings based on n-grams.
Args:
str1 (str): First input string
str2 (str): Second input string
n (int): Size of n-grams
Returns:
float: Jaccard similarity score between 0 and 1
"""
# Convert strings to lowercase and remove non-alphanumeric characters
str1 = [char.lower() for char in str1 if char.isalnum()]
str2 = [char.lower() for char in str2 if char.isalnum()]
# Generate n-grams for both strings
ngrams1 = set(''.join(str1[i:i+n]) for i in range(len(str1) - n + 1))
ngrams2 = set(''.join(str2[i:i+n]) for i in range(len(str2) - n + 1))
print(ngrams1)
print(ngrams2)
# Calculate intersection and union of n-grams
intersection = ngrams1.intersection(ngrams2)
union = ngrams1.union(ngrams2)
# Calculate Jaccard similarity
if len(union) == 0:
return 0.0 # Handle empty sets
similarity = len(intersection) / len(union)
return similarity
# Example usage
text_1 = "世界经济史是一部基于假象和谎言的连续剧。要获得财富,做法就是认清其假象,投入其中,然后在假象被公众认识之前退出游戏。\nby某个名字都不能说的人。"
text_2 = "世界经济史是一部基于假象和谎言的连续剧。要想获得财富,做法就是认清其假象,投入其中,然后在假象被公众认识之前退出游戏。——索罗斯"
n = 2 # Using bigrams
similarity = jaccard_similarity_ngrams(text_1, text_2, n)
print(f"Jaccard similarity: {similarity:.4f}")
You will get 0.7313 and 0.7101 if choosing 3. Short text has always been tricky to process, it's better to tune the parameters/settings with a sample set before running the de-duplication on the entire dataset.
hi there,
when I use minhash with lsh or simhash, it's hard to remove short text. anybody could provide some useful method to solve this problem, thanks a ton!
take below example, and dive into the process:
set ngram_size is 5, the jacard similarity is 0.53, for ngram size 13, the similarity is 0.21.
when we select threshold 0.7, both below text will be kept.
The text was updated successfully, but these errors were encountered: