Skip to content

Commit

Permalink
Add Numo/LAPACK SVD Option
Browse files Browse the repository at this point in the history
**Background:**
The slow step of LSI is the SVD (singular value decomposition) of a
matrix. With even a relatively small collection of documents (say, about
20 blog posts), the native ruby implementation it too slow to be usable.

To work around this problem, classifier-reborn allows you to optionally
use the `gsl` gem to make use of the [Gnu Scientific
Library](https://www.gnu.org/software/gsl/) when performing matrix
calculations. This performs at least an order of magnitude faster than
the ruby-only matrix decomposition, and is fast enough that using LSI
with Jekyll finishes in a reasonable amount of time.

Unfortunately, [rb-gsl](https://github.com/SciRuby/rb-gsl) is
unmaintained -- luckily, there's a commit on main that makes it
compatible with Ruby 3, but nobody has released the gem so the only way
to use rb-gsl with Ruby 3 right now is to specify the git hash in your
Gemfile. See SciRuby/rb-gsl#67

**Changes:**
In this PR, my goal is to provide an alternative matrix implementation
that can perform the singular value decomposition quickly and works with
Ruby 3. Doing so will allow classifier-reborn to be used with Ruby 3
without depending on the unmaintained/unreleased GSL gem. Options for
ruby matrix libraries are somewhat limited, but
[Numo](https://github.com/ruby-numo) seems to be more actively
maintained than rb-gsl, and Numo has a working Ruby 3 implementation
that can perform a singular value decomposition. This requires
[numo-narray](https://github.com/ruby-numo/numo-narray) and
[numo-linalg](https://github.com/ruby-numo/numo-linalg).

My goal is to allow users to (optionally) use classifier-reborn the same
way they would use it with GSL. That is, the user should install
`numo-narray` and `numo-linalg` gems, and classifier-reborn will detect
and use these if they are found.
  • Loading branch information
mkasberg committed May 28, 2022
1 parent fb5da8e commit 3dad285
Show file tree
Hide file tree
Showing 8 changed files with 97 additions and 29 deletions.
18 changes: 12 additions & 6 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,17 +14,17 @@ on:

jobs:
ci:
name: "Run Tests (Ruby ${{ matrix.ruby_version }}, GSL: ${{ matrix.gsl }})"
name: "Run Tests (Ruby ${{ matrix.ruby_version }}, Lib: ${{ matrix.matrix_lib }})"
runs-on: "ubuntu-latest"
env:
# See https://github.com/marketplace/actions/setup-ruby-jruby-and-truffleruby#matrix-of-gemfiles
BUNDLE_GEMFILE: ${{ matrix.gemfile }}
LOAD_GSL: ${{ matrix.gsl }}
MATRIX_LIB: ${{ matrix.matrix_lib }}
strategy:
fail-fast: false
matrix:
ruby_version: ["2.7", "3.0", "3.1", "jruby-9.3.4.0"]
gsl: [true, false]
matrix_lib: ["none", "gsl", "lapack"]
# We use `include` to assign the correct Gemfile for each ruby_version
include:
- ruby_version: "2.7"
Expand All @@ -39,17 +39,23 @@ jobs:
# Ruby 3.0 does not work with the latest released gsl gem
# https://github.com/SciRuby/rb-gsl/issues/67
- ruby_version: "3.0"
gsl: true
matrix_lib: "gsl"
# Ruby 3.1 does not work with the latest released gsl gem
# https://github.com/SciRuby/rb-gsl/issues/67
- ruby_version: "3.1"
gsl: true
matrix_lib: "gsl"
# jruby-9.3.4.0 doesn't easily build the gsl gem on a GitHub worker. Skipping for now.
- ruby_version: "jruby-9.3.4.0"
gsl: true
matrix_lib: "gsl"
# jruby-9.3.4.0 doesn't easily build the numo gems on a GitHub worker. Skipping for now.
- ruby_version: "jruby-9.3.4.0"
matrix_lib: "lapack"
steps:
- name: Checkout Repository
uses: actions/checkout@v3
- name: Install Lapack
if: ${{ matrix.matrix_lib == 'lapack' }}
run: sudo apt-get install -y liblapacke-dev libopenblas-dev
- name: "Set up ${{ matrix.label }}"
uses: ruby/setup-ruby@v1
with:
Expand Down
2 changes: 1 addition & 1 deletion .rubocop.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
inherit_from: .rubocop_todo.yml

Style/GlobalVars:
AllowedVariables: [$GSL]
AllowedVariables: [$SVD]

Naming/MethodName:
Exclude:
Expand Down
7 changes: 6 additions & 1 deletion Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,9 @@ source 'https://rubygems.org'
gemspec name: 'classifier-reborn'

# For testing with GSL support & bundle exec
gem 'gsl' if ENV['LOAD_GSL'] == 'true'
gem 'gsl' if ENV['MATRIX_LIB'] == 'gsl'

if ENV['MATRIX_LIB'] == 'lapack'
gem 'numo-narray'
gem 'numo-linalg'
end
65 changes: 54 additions & 11 deletions lib/classifier-reborn/lsi.rb
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,31 @@
# Copyright:: Copyright (c) 2005 David Fayram II
# License:: LGPL

# Try to load Numo first - it's the most current and the most well-supported.
# Fall back to GSL.
# Fall back to native vector.
begin
raise LoadError if ENV['NATIVE_VECTOR'] == 'true' # to test the native vector class, try `rake test NATIVE_VECTOR=true`
raise LoadError if ENV['GSL'] == 'true'

require 'gsl' # requires https://github.com/SciRuby/rb-gsl
require_relative 'extensions/vector_serialize'
$GSL = true
require 'numo/narray'
require 'numo/linalg'
$SVD = :numo
puts 'Using Numo!'
rescue LoadError
$GSL = false
require_relative 'extensions/vector'
require_relative 'extensions/zero_vector'
begin
raise LoadError if ENV['NATIVE_VECTOR'] == 'true' # to test the native vector class, try `rake test NATIVE_VECTOR=true`

require 'gsl' # requires https://github.com/SciRuby/rb-gsl
require_relative 'extensions/vector_serialize'
$SVD = :gsl
puts 'Using GSL!'
rescue LoadError
puts 'Using Ruby!'
$SVD = :ruby
require_relative 'extensions/vector'
require_relative 'extensions/zero_vector'
end
end

require_relative 'lsi/word_list'
Expand Down Expand Up @@ -140,7 +155,15 @@ def build_index(cutoff = 0.75)
doc_list = @items.values
tda = doc_list.collect { |node| node.raw_vector_with(@word_list) }

if $GSL
if $SVD == :numo
tdm = Numo::NArray.asarray(tda.map(&:to_a)).transpose
ntdm = numo_build_reduced_matrix(tdm, cutoff)

ntdm.each_over_axis(1).with_index do |col_vec, i|
doc_list[i].lsi_vector = col_vec
doc_list[i].lsi_norm = col_vec / Numo::Linalg.norm(col_vec)
end
elsif $SVD == :gsl
tdm = GSL::Matrix.alloc(*tda).trans
ntdm = build_reduced_matrix(tdm, cutoff)

Expand Down Expand Up @@ -201,7 +224,9 @@ def proximity_array_for_content(doc, &block)
content_node = node_for_content(doc, &block)
result =
@items.keys.collect do |item|
val = if $GSL
val = if $SVD == :numo
content_node.search_vector.dot(@items[item].transposed_search_vector)
elsif $SVD == :gsl
content_node.search_vector * @items[item].transposed_search_vector
else
(Matrix[content_node.search_vector] * @items[item].search_vector)[0]
Expand All @@ -220,7 +245,8 @@ def proximity_norms_for_content(doc, &block)
return [] if needs_rebuild?

content_node = node_for_content(doc, &block)
if $GSL && content_node.raw_norm.isnan?.all?
# TODO handle numo?
if $SVD == :gsl && content_node.raw_norm.isnan?.all?
puts "There are no documents that are similar to #{doc}"
else
content_node_norms(content_node)
Expand All @@ -230,7 +256,9 @@ def proximity_norms_for_content(doc, &block)
def content_node_norms(content_node)
result =
@items.keys.collect do |item|
val = if $GSL
val = if $SVD == :numo
content_node.search_norm.dot(@items[item].search_norm)
elsif $SVD == :gsl
content_node.search_norm * @items[item].search_norm.col
else
(Matrix[content_node.search_norm] * @items[item].search_norm)[0]
Expand Down Expand Up @@ -332,7 +360,22 @@ def build_reduced_matrix(matrix, cutoff = 0.75)
s[ord] = 0.0 if s[ord] < s_cutoff
end
# Reconstruct the term document matrix, only with reduced rank
u * ($GSL ? GSL::Matrix : ::Matrix).diag(s) * v.trans
# TODO handle numo
u * ($SVD == :gsl ? GSL::Matrix : ::Matrix).diag(s) * v.trans
end

def numo_build_reduced_matrix(matrix, cutoff = 0.75)
# OPTIMIZE ME: Consider other drivers/options like sdd.
s, u, vt = Numo::Linalg.svd(matrix, driver: 'svd', job: 'S')

# TODO: Better than 75% term, please. :\
s_cutoff = s.sort.reverse[(s.size * cutoff).round - 1]
s.size.times do |ord|
s[ord] = 0.0 if s[ord] < s_cutoff
end

# Reconstruct the term document matrix, only with reduced rank
u.dot(::Numo::DFloat.eye(s.size) * s).dot(vt)
end

def node_for_content(item, &block)
Expand Down
24 changes: 18 additions & 6 deletions lib/classifier-reborn/lsi/content_node.rb
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,12 @@ def search_vector

# Method to access the transposed search vector
def transposed_search_vector
search_vector.col
if $SVD == :numo
# TODO is this OK?
search_vector
else
search_vector.col
end
end

# Use this to fetch the appropriate search vector in normalized form.
Expand All @@ -40,7 +45,9 @@ def search_norm
# Creates the raw vector out of word_hash using word_list as the
# key for mapping the vector space.
def raw_vector_with(word_list)
vec = if $GSL
vec = if $SVD == :numo
Numo::DFloat.zeros(word_list.size)
elsif $SVD == :gsl
GSL::Vector.alloc(word_list.size)
else
Array.new(word_list.size, 0)
Expand All @@ -51,7 +58,9 @@ def raw_vector_with(word_list)
end

# Perform the scaling transform and force floating point arithmetic
if $GSL
if $SVD == :numo
total_words = vec.sum.to_f
elsif $SVD == :gsl
sum = 0.0
vec.each { |v| sum += v }
total_words = sum
Expand All @@ -61,7 +70,7 @@ def raw_vector_with(word_list)

total_unique_words = 0

if $GSL
if [:numo, :gsl].include?($SVD)
vec.each { |word| total_unique_words += 1 if word != 0.0 }
else
total_unique_words = vec.count { |word| word != 0 }
Expand All @@ -85,12 +94,15 @@ def raw_vector_with(word_list)
hash[val] = Math.log(val + 1) / -weighted_total
end

vec.collect! do |val|
vec = vec.map do |val|
cached_calcs[val]
end
end

if $GSL
if $SVD == :numo
@raw_norm = vec / Numo::Linalg.norm(vec)
@raw_vector = vec
elsif $SVD == :gsl
@raw_norm = vec.normalize
@raw_vector = vec
else
Expand Down
2 changes: 1 addition & 1 deletion test/extensions/matrix_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

class MatrixTest < Minitest::Test
def test_zero_division
skip "extensions/vector is only used by non-GSL implementation" if $GSL
skip "extensions/vector is only used by non-GSL implementation" if $SVD != :ruby

matrix = Matrix[[1, 0], [0, 1]]
matrix.SV_decomp
Expand Down
2 changes: 1 addition & 1 deletion test/extensions/zero_vector_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

class ZeroVectorTest < Minitest::Test
def test_zero?
skip "extensions/zero_vector is only used by non-GSL implementation" if $GSL
skip "extensions/zero_vector is only used by non-GSL implementation" if $SVD != :ruby

vec0 = Vector[]
vec1 = Vector[0]
Expand Down
6 changes: 4 additions & 2 deletions test/lsi/lsi_test.rb
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# frozen_string_literal: true

require File.dirname(__FILE__) + '/../test_helper'
# require_relative '../test_helper'
# require 'debug'

class LSITest < Minitest::Test
def setup
Expand Down Expand Up @@ -163,7 +165,7 @@ def test_cached_content_node_option
end

def test_clears_cached_content_node_cache
skip "transposed_search_vector is only used by GSL implementation" unless $GSL
skip "transposed_search_vector is only used by GSL implementation" unless $SVD == :gsl

lsi = ClassifierReborn::LSI.new(cache_node_vectors: true)
lsi.add_item @str1, 'Dog'
Expand Down Expand Up @@ -192,7 +194,7 @@ def test_keyword_search
end

def test_invalid_searching_when_using_gsl
skip "Only GSL currently raises invalid search error" unless $GSL
skip "Only GSL currently raises invalid search error" unless $SVD == :gsl

lsi = ClassifierReborn::LSI.new
lsi.add_item @str1, 'Dog'
Expand Down

0 comments on commit 3dad285

Please sign in to comment.