Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syn gives antonyms as well. #190

Closed
trinker opened this issue Jun 30, 2014 · 3 comments
Closed

syn gives antonyms as well. #190

trinker opened this issue Jun 30, 2014 · 3 comments

Comments

@trinker
Copy link
Owner

trinker commented Jun 30, 2014

I have a question about the function "syn", seems sometimes it returns antomyms of a word as the last element in the result list, and there is no way to tell when that happens without looking at the results. I am trying to automate a mapping between two lists of words using synomyms and this might cause bias. So I was wondering if there is a way to get around of this. -Jingjing Zou-

syn(c("outstanding", "memorable", "hilarious", "relish", "excellent", 
   "fantastic", "brisk", "perfectly", "offbeat"))   

can give some examples of what I meant - the last result for each word seems to be the antonyms instead of synonyms.

@trinker
Copy link
Owner Author

trinker commented Jun 30, 2014

The problem lies in the qdapDictionaries synomyms frame that was used: qdapDictionaries::key.syn in syn. It will require a re-scrape and formatting.

Here's the word `outstandings's result in Reverso Online Dictionary:

http://dictionary.reverso.net/english-synonyms/outstanding

and the source for scraping purposes:

view-source:http://dictionary.reverso.net/english-synonyms/outstanding

@trinker
Copy link
Owner Author

trinker commented Jun 30, 2014

Here's the scraping script I used previously that likely needs to be modified to eliminate antonym tag:

library(RCurl)
library(XML)
library(parallel)
library(qdap)
load("C:/Users/trinker/Dropbox/Public/LIST.RData") #the seed list
head(LIST)

#Parsing and counting functions:
term.count <- qdap:::term.count

#Scraping function:
FUN <- function(x){

    url1 <- "http://dictionary.reverso.net/english-synonyms/"
    url2 <- x
    doc <- htmlTreeParse(paste0(url1, url2), useInternalNodes = TRUE)
    ncontent2 <- getNodeSet(doc, "//span[@direction='']//text()")[[1]]
    if(xmlToList(ncontent2) != x) {
        return("***XX")
    }

    content <- getNodeSet(doc, "//span[@direction='target']//text()")
    ncontent <- getNodeSet(doc, "//span[@class='ellipsis_text']//text()")
    content <- content[!unlist(content) %in% unlist(ncontent)]

    if (is.null(content)) return(NA)

    x <- lapply(content, function(x) Trim(xmlToList(x)))
    x <- x[!sapply(x, function(y) y=="")]
    words <- unlist(lapply(x, function(x) length(unlist(strsplit(x, "\\s+")))))
    commas <- sapply(x, function(x) term.count(x, ","), USE.NAMES=FALSE)
    ctw <- commas/words
    ctw[words < 3] <- 1
    if (sum(ctw > .25) == 0) return("***XX")

    y <- x[ctw > .25]
    if (length(y) == 1 && y[[1]] == "") return("***XX")

    paste(paste("[", seq_len(length(y)), "]", y, sep = "") , collapse = " @@@@ ")
}


#parallel processing the scrape
cl <- makeCluster(mc <- getOption("cl.cores", detectCores()))
clusterExport(cl=cl, varlist=c("LIST", "Trim", "FUN", "term.count", "htmlTreeParse",
    "getNodeSet", "xmlToList"), envir=environment())

L1 <- parLapply(cl, LIST, function(x) {
    Sys.sleep(.75)
    try(FUN(x))
})

stopCluster(cl) #stop the cluster

names(L1) <- LIST

@trinker
Copy link
Owner Author

trinker commented Jun 30, 2014

The Antonym doesn't appear to be a tag but a header:

...d="ID0ETD" style="color:#0;" direction="">Antonyms<span...

This makes parsing more difficult. A possibility is to split right away on Antonyms<span. Take the first break and parse that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant