Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when searching for a word that doesn't exist in the corpus #75

Closed
tra38 opened this issue Aug 12, 2016 · 9 comments
Closed

Error when searching for a word that doesn't exist in the corpus #75

tra38 opened this issue Aug 12, 2016 · 9 comments

Comments

@tra38
Copy link
Contributor

tra38 commented Aug 12, 2016

require 'classifier-reborn'
lsi = ClassifierReborn::LSI.new

#add strings to lsi that has nothing to do with dogs

lsi.search("dogs", 8)

/Users/tariqali/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0
/gems/classifier-reborn-2.0.4/lib/classifier-reborn
/lsi.rb:213:in `sort_by': comparison of Float with NaN failed (ArgumentError)
    from /Users/tariqali/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0
/gems/classifier-reborn-2.0.4/lib/classifier-reborn/lsi.rb:213:in `proximity_norms_for_content'
    from /Users/tariqali/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0
/gems/classifier-reborn-2.0.4/lib/classifier-reborn/lsi.rb:225:in `search'
    from chat.rb:24:in `<main>'

I'm assuming that somewhere in the code, we have a "0/0" out there that is being converted into a NaN. This error is avoidable so long as you don't search for terms not specifically within the corpus you're training the LSI on, but...well...what happens if I'm using a very huge corpus? How am I supposed to know what words are (or are not) present?

@Ch4s3
Copy link
Member

Ch4s3 commented Aug 15, 2016

This is a known issue when using lsi without rb-gel. There is a bug in our matrix SVD, that I haven't been able to track down. In the meantime, using rb-gsl should fix the issue.

@tra38
Copy link
Contributor Author

tra38 commented Aug 18, 2016

I was already using rb-gsl beforehand (which I needed to do since the Ruby's version was so slow). Using binding.pry, I determined that $GSL is set to true after I required the classifier-reborn library, yet I'm still getting the error that I pointed above. I'll see if I can try to debug the issue...

@tra38
Copy link
Contributor Author

tra38 commented Aug 19, 2016

So here's Day 1 of me trying to solve this problem (and writing down notes so that I remember when I come back to this problem again). When I call LSI#search, I first construct a ContentNode (using my search term). Here's an example of a ContentNode:

=> #<ClassifierReborn::ContentNode:0x007fafe415b468
 @categories=[],
 @lsi_norm=nil,
 @lsi_vector=nil,
 @raw_norm=GSL::Vector
[   nan   nan   nan   nan   nan   nan   nan ... ],
 @raw_vector=GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
 @word_hash={:cat=>1}>

After I did so, I then call ContentNode#proximity_norms_for_content, and that method checks the content node's @raw_norm, which is a GSL::Vector that is filled with ``NaNs...Obviously doing operations on that vector is bound to cause an error.

The @raw_vector itself seems to look alright though (although it only gets used if I call ContentNode#proximity_array_for_content), and won't error out when I try to do Matrix operations with that vector. I guess a workaround for me is for me to just use ContentNode#proximity_array_for_content and accept the probable worse results. But that seems like a bad workaround. I'd be better off trying to see what is causing the @raw_norm to be corrupted. It's not using Ruby's standard library Matrix though...so at least we can confirm that GSL is being used on my machine. Ugh. I wish I had paid more attention to Calculus in college.

EDIT: I see what's corrupting the raw_norm. Normalize. After calculating the raw_vector, we call then normalize on that vector and save that result into @raw_norm. Except...

content_node.raw_vector
=> GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ]
content_node.raw_vector.normalize
=> GSL::Vector
[   nan   nan   nan   nan   nan   nan   nan ... ]

It seems that GSL::Vector#normalize is failing to normalize the raw_vector properly. I'm going to have to look at the GSL code to see why that might be the case. This rabbit hole is much deeper than I expected.

EDIT2: So I found out how to normalize a vector.

To normalize a vector
v1=(x0,y0,z0)
d=sqrt(x0 * x0 + y0 * y0 + z0 * z0)
x1=x0/d
y1=y0/d
z1=z0/d
this is your new normalized vector (x1,y1,z1)

Well the sqrt(0 * 0 * ... * 0) is going to be sqrt(0)...which is 0. And, so when we normalize a vector where each coordinate point is 0...we'll be seeing a lot of 0/0 errors. So GSL::Vector's normalize isn't failing, it's Math itself that is betraying us.

Would it be okay to write some code to check if a normalized vector has any "NaNs", and then replacing all instances of NaNs with 0s, so that vector multiplication can still occur properly? Or would doing so be seen as too hacky?

EDIT3:
StackOverflow Link: How do you normalize a zero-vector?, with some possible yet unappealing answers to this dilemma. Fairly depressing stuff.

@Ch4s3
Copy link
Member

Ch4s3 commented Aug 19, 2016

I wish I had paid more attention to Calculus in college
same.

Yeah, some form of this problem has been causing issues since the first versions. If you're willing to try and put together a pr with some normalization and NaN handling, that would be amazing!

@Ch4s3
Copy link
Member

Ch4s3 commented Aug 19, 2016

I'd be curious to know more about the input that's causing that error.

@tra38
Copy link
Contributor Author

tra38 commented Sep 15, 2016

Here is the source code of the input that was causing the error.

@Ch4s3
Copy link
Member

Ch4s3 commented Sep 30, 2016

@tra38 sorry for the delay.

So I was able to reproduce the issue with the following searches:

array = lsi.search("we",9)
array = lsi.search("we can p",9)

But, array = lsi.search("we can predict", 9) works. :(

Maybe we can catch this error, and respond with a sensible message.

@tra38
Copy link
Contributor Author

tra38 commented Oct 9, 2016

array = lsi.search("we",9)
array = lsi.search("we can p",9)

But, array = lsi.search("we can predict", 9) works. :(

This makes logical sense (for a computer, I mean). Please forgive me if this seems a bit too technical, but I wanted to write the following explanation down to clarify for myself what's going on:

"we" and "can" are stopwords, so obviously the computer will ignore them. "p" isn't a stop word, but ClassifierReborn::Hasher.word_hash_for_words only stores words that have more than 2 characters. "p" has less than or equal to 2 characters, so...we throw it away. The end result is that we are asking LSI to find a document that is similar to that an empty hash, and obviously none of the documents we have are empty hashes. So we have NaN vectors and errors galore.

Interestingly, if I throw in the sentence "we can p" into my array of strings, LSI searching breaks entirely and you will be unable to search for anything. This is how the computer views the sentence "we can p":

=> #<ClassifierReborn::ContentNode:0x007f9aaa208248
 @categories=[],
 @lsi_norm=GSL::Vector
[   nan   nan   nan   nan   nan   nan   nan ... ],
 @lsi_vector=GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
 @raw_norm=GSL::Vector
[   nan   nan   nan   nan   nan   nan   nan ... ],
 @raw_vector=GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
 @word_hash={}>

(This is the same ContentNode that is constructed if we use "we can p" as a search term instead of as a document.)

ContentNode#proximity_norms_for_content, the method we use for searching, will multiply the search term's content node's lsi_norm with all existing content nodes' lsi_norm``s, to determine how similar each content node is to the search term. And thelsi_normin this ContentNode is`NaN```, we can be assured that an error can occur. (This, by the way, is an in-depth explanation of how #64 can occur).

On the other hand...

array = lsi.search("we can predict", 9)

Well, "predict" isn't a stopword. It's a word that has more than 2 characters. So a new Content Node can be created, which only includes the word "predict":

#<ClassifierReborn::ContentNode:0x007f8b131d8df0
 @categories=[],
 @lsi_norm=GSL::Vector
[ 5.018e-03 5.836e-03 9.885e-04 -2.808e-02 5.018e-03 5.044e-02 -2.720e-03 ... ],
 @lsi_vector=GSL::Vector
[ 3.550e-03 4.129e-03 6.994e-04 -1.987e-02 3.550e-03 3.569e-02 -1.924e-03 ... ],
 @raw_norm=GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
 @raw_vector=GSL::Vector
[ 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 ... ],
 @word_hash={:predict=>1}>

Since the lsi_norm here is not full of NaNs, searching works as normal.

...However, keep in mind that to replicate the error that led to me posting this issue, you need to type this out...

array = lsi.search("dogs", 9)

We only just discovered more bugs in the system.

So there's two major issues to worry about then.

  1. Dealing with invalid search terms...where the word doesn't appear in the corpus at all
  2. Dealing with "invalid" documents...where the document is composed of all stop words.

Number 2 is more of an edge case, since the larger the document, the more likely it is that there will be words that aren't stop words. So I'll probably focus on dealing with Number 1 (normalization and NaN handling).

@Ch4s3
Copy link
Member

Ch4s3 commented Nov 29, 2016

fixed by #77

@Ch4s3 Ch4s3 closed this as completed Nov 29, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants