Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to dedup short text? #103

Open
varuy322 opened this issue Oct 14, 2024 · 1 comment
Open

how to dedup short text? #103

varuy322 opened this issue Oct 14, 2024 · 1 comment

Comments

@varuy322
Copy link

hi there,

when I use minhash with lsh or simhash, it's hard to remove short text. anybody could provide some useful method to solve this problem, thanks a ton!

take below example, and dive into the process:

  1. set ngram_size is 5, the jacard similarity is 0.53, for ngram size 13, the similarity is 0.21.

  2. when we select threshold 0.7, both below text will be kept.

text_1 = "世界经济史是一部基于假象和谎言的连续剧。要获得财富,做法就是认清其假象,投入其中,然后在假象被公众认识之前退出游戏。\nby某个名字都不能说的人。"
text_2 = "世界经济史是一部基于假象和谎言的连续剧。要想获得财富,做法就是认清其假象,投入其中,然后在假象被公众认识之前退出游戏。——索罗斯"

@ChenghaoMou
Copy link
Owner

I think for Chinese text, unlike English where words are separated by space, it is safe to use a lower ngram size such as 2 or 3. (Modify the ngram code if necessary e.g. you can also use jieba instead)

With the following test code:

def jaccard_similarity_ngrams(str1, str2, n):
    """
    Calculate Jaccard similarity between two strings based on n-grams.
    
    Args:
    str1 (str): First input string
    str2 (str): Second input string
    n (int): Size of n-grams
    
    Returns:
    float: Jaccard similarity score between 0 and 1
    """
    # Convert strings to lowercase and remove non-alphanumeric characters
    str1 = [char.lower() for char in str1 if char.isalnum()]
    str2 = [char.lower() for char in str2 if char.isalnum()]
    
    # Generate n-grams for both strings
    ngrams1 = set(''.join(str1[i:i+n]) for i in range(len(str1) - n + 1))
    ngrams2 = set(''.join(str2[i:i+n]) for i in range(len(str2) - n + 1))
    
    print(ngrams1)
    print(ngrams2)
    # Calculate intersection and union of n-grams
    intersection = ngrams1.intersection(ngrams2)
    union = ngrams1.union(ngrams2)
    
    # Calculate Jaccard similarity
    if len(union) == 0:
        return 0.0  # Handle empty sets
    
    similarity = len(intersection) / len(union)
    return similarity

# Example usage
text_1 = "世界经济史是一部基于假象和谎言的连续剧。要获得财富,做法就是认清其假象,投入其中,然后在假象被公众认识之前退出游戏。\nby某个名字都不能说的人。"
text_2 = "世界经济史是一部基于假象和谎言的连续剧。要想获得财富,做法就是认清其假象,投入其中,然后在假象被公众认识之前退出游戏。——索罗斯"
n = 2  # Using bigrams

similarity = jaccard_similarity_ngrams(text_1, text_2, n)
print(f"Jaccard similarity: {similarity:.4f}")

You will get 0.7313 and 0.7101 if choosing 3. Short text has always been tricky to process, it's better to tune the parameters/settings with a sample set before running the de-duplication on the entire dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants