-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
syn
gives antonyms as well.
#190
Comments
The problem lies in the Here's the word `outstandings's result in Reverso Online Dictionary: http://dictionary.reverso.net/english-synonyms/outstanding and the source for scraping purposes: view-source:http://dictionary.reverso.net/english-synonyms/outstanding |
Here's the scraping script I used previously that likely needs to be modified to eliminate antonym tag: library(RCurl)
library(XML)
library(parallel)
library(qdap)
load("C:/Users/trinker/Dropbox/Public/LIST.RData") #the seed list
head(LIST)
#Parsing and counting functions:
term.count <- qdap:::term.count
#Scraping function:
FUN <- function(x){
url1 <- "http://dictionary.reverso.net/english-synonyms/"
url2 <- x
doc <- htmlTreeParse(paste0(url1, url2), useInternalNodes = TRUE)
ncontent2 <- getNodeSet(doc, "//span[@direction='']//text()")[[1]]
if(xmlToList(ncontent2) != x) {
return("***XX")
}
content <- getNodeSet(doc, "//span[@direction='target']//text()")
ncontent <- getNodeSet(doc, "//span[@class='ellipsis_text']//text()")
content <- content[!unlist(content) %in% unlist(ncontent)]
if (is.null(content)) return(NA)
x <- lapply(content, function(x) Trim(xmlToList(x)))
x <- x[!sapply(x, function(y) y=="")]
words <- unlist(lapply(x, function(x) length(unlist(strsplit(x, "\\s+")))))
commas <- sapply(x, function(x) term.count(x, ","), USE.NAMES=FALSE)
ctw <- commas/words
ctw[words < 3] <- 1
if (sum(ctw > .25) == 0) return("***XX")
y <- x[ctw > .25]
if (length(y) == 1 && y[[1]] == "") return("***XX")
paste(paste("[", seq_len(length(y)), "]", y, sep = "") , collapse = " @@@@ ")
}
#parallel processing the scrape
cl <- makeCluster(mc <- getOption("cl.cores", detectCores()))
clusterExport(cl=cl, varlist=c("LIST", "Trim", "FUN", "term.count", "htmlTreeParse",
"getNodeSet", "xmlToList"), envir=environment())
L1 <- parLapply(cl, LIST, function(x) {
Sys.sleep(.75)
try(FUN(x))
})
stopCluster(cl) #stop the cluster
names(L1) <- LIST |
The
This makes parsing more difficult. A possibility is to split right away on |
can give some examples of what I meant - the last result for each word seems to be the antonyms instead of synonyms.
The text was updated successfully, but these errors were encountered: