"Diagnosing Hate Speech Classification: Where Do Humans and Machines Disagree, and Why?" [Abstract] This study uses the cosine similarity ratio, embedding regression, and manual re-annotation to diagnose hate speech classification. We use a dataset Measuring Hate Speech that contains 135,556 annotated comments on social media. We begin with exploring the inconsistency of human annotation from the dataset. Using embedding regression as a basic diagnostic, we found that female annotators are more sensitive to racial slurs that target the black population. Examples like this illustrate that human evaluations of hate speech content vary based on their identity. We perform a more complicated diagnostic by training a hate speech classifier using a the best-performing pre-trained large language model(LLM) NV-Embed-v2 on the MTEB benchmark (ranked No.1 as of Aug 30, 2024).We convert texts to embeddings and run a logistic regression, and our classifier achieves a testing accuracy of 94%. Using manual re-annotation, we find that machines make fewer mistakes than humans despite the fact that human annotations are treated as ground truth in the training set. Pairing manual annotation and cosine similarity ratio, we find that machines perform better than humans in correctly labeling long statements but perform worse in labeling short and obvious instances of racial slurs. We hypothesize that Aristotle's virtue as a mean and lesser evil as good may play a role in restricting the latest models’ ability to detect obvious malicious content. Humans may, for their own moral restrictions and fear of risks, curate data or align the model as their means to choose the "lesser evil." This consideration is important for future studies on the risks of AI.
-
Notifications
You must be signed in to change notification settings - Fork 0
xiliny/research_presentation
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
"Diagnosing Hate Speech Classification: Where Do Humans and Machines Disagree, and Why?"
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published