Skip to content

a Lucene/Solr filter and filter factory to fold certain CJK characters to improve recall. For example, it converts some modern Japanese Kanji characters to their traditional equivalents (when the modern Kanji doesn't map to the simplified Han character). Used by SearchWorks at index and query time

License

Notifications You must be signed in to change notification settings

sul-dlss/CJKFilterUtils

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CJKFilterUtils

Build Status codecov

This is a Lucene filter and filter factory (see http://lucene.apache.org ) to fold certain CJK characters to improve recall. You should put it in your analysis chain BEFORE ICUTransforms from Traditional->Simplified Han, as it converts modern Japanese Kanji to their traditional equivalents.

Usage

  • clone the project

git clone git://github.com/solrmarc/CJKFilterUtils.git

  • run the maven installation

mvn clean install

  • put the CJKFilterUtils*.jar file found in the target directory into your Solr lib directory
  • utilize the Solr CJKFoldingFilterFactory in your schema.xml file.

Checking example locally

(Uses Ruby)

Install Ruby dependencies

$ bundle install

Setup Solr with CJKFilterUtils and config/schema

$ bundle exec rake setup_server

Run solr_wrapper

$ solr_wrapper

In another shell, index fixtures

$ bundle exec rake fixtures

Run some queries (these should return results):

$ curl http://127.0.0.1:8983/solr/test/select?debugQuery=on&indent=on&q=cjk_test:呂思勉两晋南北朝&wt=json

$ curl http://127.0.0.1:8983/solr/test/select?debugQuery=on&indent=on&q=cjk_test:俞平伯红楼梦&wt=json

$ curl http://127.0.0.1:8983/solr/test/select?debugQuery=on&indent=on&q=cjk_test:南洋&wt=json

Contributing

  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Added some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request

About

a Lucene/Solr filter and filter factory to fold certain CJK characters to improve recall. For example, it converts some modern Japanese Kanji characters to their traditional equivalents (when the modern Kanji doesn't map to the simplified Han character). Used by SearchWorks at index and query time

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published