Lucene search (#11542)

* Use pattern matching for cast Co-authored-by: Christoph <siedlerkiller@gmail.com> * Fix pattern matching * Fix merge * Speed up switches between sorting/filtering modes * Fixed merge errors * Fixed small issues * Removed obsolete tests, fixed some tests * Fixed merge error in CHANGELOG.md * Fixed checkstyle * Fixed more tests * Removed obsolete tests * Fixes "Fixed merge error in CHANGELOG.md" by removing duplicate entries This reverts commit 536ecfa. * WiP on tests * Checkstyle * Checkstyle * Update Java version * Refine logging * Fix compile error * Add LuceneTest * Update CHANGELOG.md * Move search classes to pdf package * Move search classes to search package * rewriteRun * Remove bibEntry from DocumentReader * Rewrite LuceneIndexer * Remove IndexingTaskManager * Separate Bib fields index and LinkedFiles index * Fix null LuceneManager in ExternalFilesEntryLinker * Save as action * Clear linkedFiles indexer when fullText indexing is disabled in preferences * remove comments * get indexed files on update * Add LUCENE_MANAGERS map for accessing managers by databaseContext.getUid * Move LuceneManager from search.indexing package to search * Fix wrong order for import * Move SearchQuery to model package * Fix issue with opening multiple unsaved libraries * Pass LuceneManager down to the entry editor * Improve searching performance * Change SearchFieldConstants to enum * More performance improvements for searching - Read document only one time - getHighlighterFragments only when the search results tab is opened * Update FulltextSearchResultsTab.java * Fix group union, intersection * Fix backgroundtask * Fix subscriptions * Remove lastSearchQueryLogic * Fix possible NPE * Fix searchTask check * Remove sort by score flag * Fix score column sorting * Fix modifier buttons listener * Add search rank column In floating mode entries will be ranked and sorted by it. Rank: (1= entry matches group and search, 2= matches group but not search, 3= matches search but not group, 4= matches nothing) * hide search rank column from preferences * Add search_rank column to sort order by default * Update CHANGELOG.md * fix typo * Change the order of the rank 1= entry matches group and search, 2= matches search but not group, 3= matches group but not search, 4= matches nothing * Use NGramAnalyzer for indexing * Resolve conflicts * update search matches with lucene * PreviewViewer highlighting with Lucene * Delete IndexingTaskManager.java * SourceTab highlighting with Lucene * Fix non-ASCII characters * Extract query terms from search query * Highlight regex queries * return js highlight function * Fix invalid search query throw exception * Refactor Lucene indexer classes * Refactor linked files indexer * Update search matches when entries are added or updated * Remove preferences from ActionHelper * checkstyle * comment out search tests * OpenRewrite * Fix Groups Parser/Serializer * Localization * Search groups * Release `IndexSearcher` after completing search task * Checkstyle * Correct typo * Remove GroupSearchQuery * Remove EventBus from LuceneManager and use BibDatabase eventBus * Fix number of matched entries in groups * Fix search groups * Localization * Remove bib fields highlighter * Pass LuceneManager to search groups * Fix performance issues by caching matched entries * Update GroupDialogViewModelTest.java * Update main table matches * Fix groups icon * Restore Search.g4 and GrammarBasedSearchRule * First version of search group migration Co-authored-by: Loay Ghreeb <52158423+LoayGhreeb@users.noreply.github.com> * Add groups field to the index * Remove search rules * Localization * Add test cases * Fix names Co-authored-by: Loay Ghreeb <52158423+LoayGhreeb@users.noreply.github.com> * Add some more functionality Co-authored-by: Loay Ghreeb <52158423+LoayGhreeb@users.noreply.github.com> * Always add "all" prefix Co-authored-by: Loay Ghreeb <52158423+LoayGhreeb@users.noreply.github.com> * Add comment for alternative implementation Co-authored-by: Loay Ghreeb <52158423+LoayGhreeb@users.noreply.github.com> * Mark library tab changed after migration Co-authored-by: Loay Ghreeb <52158423+LoayGhreeb@users.noreply.github.com> * Add another test for regular expression Co-authored-by: Loay Ghreeb <52158423+LoayGhreeb@users.noreply.github.com> * Small fixes * Fix markBaseChanged * Fix adding new entries did not update MatchCategory * Fix searching for Non-ASCII characters * Fix escaping special characters Use WhitespaceTokenizer instead of StandardTokenizer https://stackoverflow.com/a/6119584/21694752 * Fix tests Co-Authored-By: Oliver Kopp <kopp.dev@gmail.com> * Add first draft of LatexToUnicodeFoldingFilter Co-authored-by: Loay Ghreeb <52158423+LoayGhreeb@users.noreply.github.com> * Fix LatexToUnicodeFoldingFilter Co-Authored-By: Oliver Kopp <kopp.dev@gmail.com> * Remove LatexToUnicode from SearchQuery * Localization * AllowedToUseLogic * Update CHANGELOG.md * Use sentence case for search result heading * Add CHANGELOG for change in JabRefFrameViewModel * Add more changes to CHANGELOG.md * Add ADR-0038 * Rename "SCORE" to "MATCH_SCORE" * Add link to ADR-0038 * Add another CHANGELOG.md entry * Add CHANGELOG.md entry * Revert change of filename * Add JavaDoc comment * Trying to find better names * Discard changes to src/main/resources/tinylog.properties * Remove commented out code * Remove obsolete testing class * Remove obsolete test * Discard changes to src/test/resources/tinylog-test.properties * Remove completely disabled code * Rename "all" to "any" * Catch thrown exception Invalid regex queries throws an exception * Remove groups field from the default field #7996 * Remove SearchGroupsListener * Update Benchmarks.java * Update module-info.java * Fixes from code review on LibraryTab * Remove regex button from search bar * Use BibEntry.getId instead of System.identityHashCode * Add BibEntry index map * Readd option * Add `@ADR` annotation * Add some comment * One more annotation * Add CHANGELOG.md entry * One more annotation * Add CHANGELOG.md entry * Revert "Add BibEntry index map" This reverts commit 27ed105. * Use binary search to find the index of the entry * openrewrite * Tests for LinkedFilesIndexer * Fix DatabaseSearcher * LocalizationConsistencyTest * DatabaseSearcherWithBibFilesTest * Fix typo in CHANGELOG.md * Fix typo * Use parameterized test for DatabaseSearcherTest * Fix DatabaseSearcherWithBibFiles tests * Fix exportMatches test * Remove regex check box from search groups dialog * JavaDoc * Fix SearchGroups test * Remove closeAndWait methods and use CurrentThreadTaskExecutor * Fix architecture test * Allow to use logic * Add debug logging for search * Add more logging * Assert with containsInAnyOrder * Fix DatabaseSearcher test * Global search dialog * Rename method * Improve code quality - Maintain a map of BibEntryId to BibEntry. - Store search results within SearchQuery instead of using the map in StateManager. - Remove LuceneManager from SearchGroups. - Use a different Analyzer for PDFs. * Use non-static preferences variables * Update CHANGELOG.md * Delete SearchGroupTest.java * fix typo * fix indentation * Update matchedEntries on the UI thread matchedEntries should be updated on the UI thread because the size binding of matchedEntries will be reflected in the UI. * Discard changes to src/main/java/org/jabref/gui/importer/actions/GUIPostOpenAction.java * Fix LoayGhreeb#12 * Sync search flags between search bar and global search bar * Move VERSION_6_0_ALPHA const to SearchGroupsMigrationAction * Refactor LuceneSearcher * Use linked files analyzer for highlighting full-text results * Fix line break * Fix tests * Use EnglishAnalyzer for indexing/searching linked files https://github.com/apache/lucene/blob/68cc8734ca28a9db800e4192a636d3b490cfd41a/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/EnglishAnalyzer.java#L101-L110 * Ask to wait for linked files indexing on shutdown When closing JabRef, only ask users to wait for the linked files indexer to finish. The bib fields indexer is recalculated on startup, so it doesn't need to be completed before shutdown. * Use EdgeNGram instead of NGram * Return comment * Update CHANGELOG.md --------- Co-authored-by: Benedikt Tutzer <btut@users.noreply.github.com> Co-authored-by: Christoph <siedlerkiller@gmail.com> Co-authored-by: Carl Christian Snethlage <50491877+calixtus@users.noreply.github.com> Co-authored-by: Oliver Kopp <kopp.dev@gmail.com>
JabRef · Sep 5, 2024 · 6af91b9 · 6af91b9
1 parent 059ec47
commit 6af91b9
Show file tree

Hide file tree

Showing 138 changed files with 3,114 additions and 3,894 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -11,6 +11,9 @@ Note that this project **does not** adhere to [Semantic Versioning](https://semv
 
 ### Added
 
+- We added probable search hits instead of exact matches. Sorting by hit score can be done by the new score table column. [#11542](https://github.com/JabRef/jabref/pull/11542)
+- We added support finding LaTeX-encoded special characters based on plain Unicode and vice versa. [#11542](https://github.com/JabRef/jabref/pull/11542)
+- When a search hits a file, the file icon of that entry is changed accordingly. [#11542](https://github.com/JabRef/jabref/pull/11542)
 - We added an AI-based chat for entries with linked PDF files. [#11430](https://github.com/JabRef/jabref/pull/11430)
 - We added an AI-based summarization possibility for entries with linked PDF files. [#11430](https://github.com/JabRef/jabref/pull/11430)
 - We added support for selecting and using CSL Styles in JabRef's OpenOffice/LibreOffice integration for inserting bibliographic and in-text citations into a document. [#2146](https://github.com/JabRef/jabref/issues/2146), [#8893](https://github.com/JabRef/jabref/issues/8893)
@@ -28,6 +31,9 @@ Note that this project **does not** adhere to [Semantic Versioning](https://semv
 
 ### Changed
 
+- The search syntax is changed to [Apache Lucene syntax](https://lucene.apache.org/core/9_11_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Overview) (also to be similar to the [online search syntax](https://docs.jabref.org/collect/import-using-online-bibliographic-database#search-syntax)). [#11542](https://github.com/JabRef/jabref/pull/11542/)
+- When searching using a regular expression, one needs to enclose the search string in `/`. [#11542](https://github.com/JabRef/jabref/pull/11542/)
+- A search in "any" fields ignores the [groups](https://docs.jabref.org/finding-sorting-and-cleaning-entries/groups). [#7996](https://github.com/JabRef/jabref/issues/7996)
 - When a communication error with an [online service](https://docs.jabref.org/collect/import-using-online-bibliographic-database) occurs, JabRef displays the HTTP error. [#11223](https://github.com/JabRef/jabref/issues/11223)
 - The Pubmed/Medline Plain importer now imports the PMID field as well [#11488](https://github.com/JabRef/jabref/issues/11488)
 - The 'Check for updates' menu bar button is now always enabled. [#11485](https://github.com/JabRef/jabref/pull/11485)
@@ -52,9 +58,17 @@ Note that this project **does not** adhere to [Semantic Versioning](https://semv
 - We fixed an issue where text in Dark mode inside "Citation information" was not readable. [#11512](https://github.com/JabRef/jabref/issues/11512)
 - We fixed an issue where the selection of an entry in the table lost after searching for a group. [#3176](https://github.com/JabRef/jabref/issues/3176)
 - We fixed the non-functionality of the option "Automatically sync bibliography when inserting citations" in the OpenOffice panel, when enabled in case of JStyles. [#11684](https://github.com/JabRef/jabref/issues/11684)
+- We fixed an issue where the library was not marked changed after a migration. [#11542](https://github.com/JabRef/jabref/pull/11542)
+- We fixed an issue where rebuilding the full-text search index was not working. [#11374](https://github.com/JabRef/jabref/issues/11374)
+- We fixed an issue where the progress of indexing linked files showed an incorrect number of files. [#11378](https://github.com/JabRef/jabref/issues/11378)
+- We fixed an issue where the full-text search results were incomplete. [#8626](https://github.com/JabRef/jabref/issues/8626)
+- We fixed an issue where search result highlighting was incorrectly highlighting the boolean operators. [#11595](https://github.com/JabRef/jabref/issues/11595)
+- We fixed an issue where search result highlighting was broken at complex searches. [#8067](https://github.com/JabRef/jabref/issues/8067)
 
 ### Removed
 
+- We removed support for case-sensitive and exact search. [#11542](https://github.com/JabRef/jabref/pull/11542)
+- We removed the description of search strings. [#11542](https://github.com/JabRef/jabref/pull/11542)
 - We removed support for importing using the SilverPlatterImporter (`Record INSPEC`). [#11576](https://github.com/JabRef/jabref/pull/11576)
 
 

diff --git a/build.gradle b/build.gradle
@@ -344,6 +344,9 @@ dependencies {
 
     implementation 'commons-io:commons-io:2.16.1'
 
+    // Even if "compileOnly" is used, IntelliJ always adds to module-info.java. To avoid issues during committing, we use "implementation" instead of "compileOnly"
+    implementation 'io.github.adr:e-adr:2.0.0-SNAPSHOT'
+
     testImplementation 'io.github.classgraph:classgraph:4.8.175'
     testImplementation 'org.junit.jupiter:junit-jupiter:5.11.0'
     testImplementation 'org.junit.platform:junit-platform-launcher:1.10.3'

diff --git a/docs/decisions/0038-use-entryId-for-bibentries.md b/docs/decisions/0038-use-entryId-for-bibentries.md
@@ -0,0 +1,31 @@
+---
+title: Use BibEntry.getId for BibEntry at indexing
+nav_order: 38
+parent: Decision Records
+---
+
+<!-- markdownlint-disable-next-line MD025 -->
+# Use `BibEntry.getId` for BibEntries at Indexing
+
+## Context and Problem Statement
+
+The `BibEntry` class has `equals` and `hashCode` implemented on the content of the bib entry.
+Thus, if two bib entries have the same type, the same fields, and the same content, they are equal.
+
+This, however, is not useful in the UI, where equal entries are not the same entries.
+
+## Decision Drivers
+
+* Simple code
+* Not changing much other JabRef code
+* Working Lucene
+
+## Considered Options
+
+* Use `BibEntry.getId` for indexing `BibEntry`
+* Use `System.identityHashCode` for indexing `BibEntry`
+* Rewrite `BibEntry` logic
+
+## Decision Outcome
+
+Chosen option: "Use `BibEntry.getId` for indexing `BibEntry`", because is the "natural" thing to ensure distinction between two instances of a `BibEntry` object - regardless of equality.
diff --git a/external-libraries.md b/external-libraries.md
@@ -342,6 +342,13 @@ URL:     https://github.com/tdebatty/java-string-similarity
 License: MIT
 ```
 
+```yaml
+Id:io.github.adr:e-adr
+Project:EmbeddedArchitecturalDecisionRecords
+URL:https://github.com/adr/e-adr/
+License:EPL-2.0
+```
+
 ```yaml
 Id:      io.github.java-diff-utils:java-diff-utils
 Project: java-diff-utils

diff --git a/src/jmh/java/org/jabref/benchmarks/Benchmarks.java b/src/jmh/java/org/jabref/benchmarks/Benchmarks.java
@@ -3,10 +3,8 @@
 import java.io.IOException;
 import java.io.StringReader;
 import java.io.StringWriter;
-import java.util.EnumSet;
 import java.util.List;
 import java.util.Random;
-import java.util.stream.Collectors;
 
 import org.jabref.logic.bibtex.FieldPreferences;
 import org.jabref.logic.citationkeypattern.CitationKeyPatternPreferences;
@@ -18,7 +16,6 @@
 import org.jabref.logic.importer.fileformat.BibtexParser;
 import org.jabref.logic.layout.format.HTMLChars;
 import org.jabref.logic.layout.format.LatexToUnicodeFormatter;
-import org.jabref.logic.search.SearchQuery;
 import org.jabref.logic.util.OS;
 import org.jabref.model.database.BibDatabase;
 import org.jabref.model.database.BibDatabaseContext;
@@ -32,7 +29,6 @@
 import org.jabref.model.groups.KeywordGroup;
 import org.jabref.model.groups.WordKeywordGroup;
 import org.jabref.model.metadata.MetaData;
-import org.jabref.model.search.rules.SearchRules.SearchFlags;
 import org.jabref.preferences.JabRefPreferences;
 import org.jabref.preferences.PreferencesService;
 
@@ -105,16 +101,14 @@ public String write() throws Exception {
 
     @Benchmark
     public List<BibEntry> search() {
-        // FIXME: Reuse SearchWorker here
-        SearchQuery searchQuery = new SearchQuery("Journal Title 500", EnumSet.noneOf(SearchFlags.class));
-        return database.getEntries().stream().filter(searchQuery::isMatch).collect(Collectors.toList());
+        // TODO: Create Benchmark for LuceneSearch
+        return List.of();
     }
 
     @Benchmark
-    public List<BibEntry> parallelSearch() {
-        // FIXME: Reuse SearchWorker here
-        SearchQuery searchQuery = new SearchQuery("Journal Title 500", EnumSet.noneOf(SearchFlags.class));
-        return database.getEntries().parallelStream().filter(searchQuery::isMatch).collect(Collectors.toList());
+    public List<BibEntry> index() {
+        // TODO: Create Benchmark for LuceneIndexer
+        return List.of();
     }
 
     @Benchmark

diff --git a/src/main/java/module-info.java b/src/main/java/module-info.java
@@ -153,14 +153,15 @@
     requires langchain4j.open.ai;
     // endregion
 
-    // region: fulltext search
-    requires org.apache.lucene.core;
-    // In case the version is updated, please also adapt SearchFieldConstants#VERSION to the newly used version
+    // region: Lucene
+    /**
+     * In case the version is updated, please also adapt {@link org.jabref.model.search.SearchFieldConstants#VERSION} to the newly used version.
+     */
     uses org.apache.lucene.codecs.lucene99.Lucene99Codec;
-    requires org.apache.lucene.queryparser;
-    uses org.apache.lucene.queryparser.classic.MultiFieldQueryParser;
     requires org.apache.lucene.analysis.common;
+    requires org.apache.lucene.core;
     requires org.apache.lucene.highlighter;
+    requires org.apache.lucene.queryparser;
     // endregion
 
     requires net.harawata.appdirs;
@@ -176,6 +177,7 @@
     // region: other libraries (alphabetically)
     requires cuid;
     requires dd.plist;
+    requires io.github.adr;
     // required by okhttp and some AI library
     requires kotlin.stdlib;
     requires mslinks;

diff --git a/src/main/java/org/jabref/cli/ArgumentProcessor.java b/src/main/java/org/jabref/cli/ArgumentProcessor.java
@@ -15,6 +15,7 @@
 
 import org.jabref.gui.externalfiles.AutoSetFileLinksUtil;
 import org.jabref.gui.undo.NamedCompound;
+import org.jabref.gui.util.CurrentThreadTaskExecutor;
 import org.jabref.logic.JabRefException;
 import org.jabref.logic.UiCommand;
 import org.jabref.logic.bibtex.FieldPreferences;
@@ -42,7 +43,6 @@
 import org.jabref.logic.l10n.Localization;
 import org.jabref.logic.net.URLDownload;
 import org.jabref.logic.search.DatabaseSearcher;
-import org.jabref.logic.search.SearchQuery;
 import org.jabref.logic.shared.prefs.SharedDatabasePreferences;
 import org.jabref.logic.util.OS;
 import org.jabref.logic.util.io.FileUtil;
@@ -52,6 +52,7 @@
 import org.jabref.model.database.BibDatabaseMode;
 import org.jabref.model.entry.BibEntry;
 import org.jabref.model.entry.BibEntryTypesManager;
+import org.jabref.model.search.SearchQuery;
 import org.jabref.model.strings.StringUtil;
 import org.jabref.model.util.DummyFileUpdateMonitor;
 import org.jabref.model.util.FileUpdateMonitor;
@@ -454,11 +455,17 @@ private boolean exportMatches(List<ParserResult> loaded) {
         // $ stands for a blank
         ParserResult pr = loaded.getLast();
         BibDatabaseContext databaseContext = pr.getDatabaseContext();
-        BibDatabase dataBase = pr.getDatabase();
 
         SearchPreferences searchPreferences = preferencesService.getSearchPreferences();
         SearchQuery query = new SearchQuery(searchTerm, searchPreferences.getSearchFlags());
-        List<BibEntry> matches = new DatabaseSearcher(query, dataBase).getMatches();
+
+        List<BibEntry> matches;
+        try {
+            matches = new DatabaseSearcher(query, databaseContext, new CurrentThreadTaskExecutor(), preferencesService.getFilePreferences()).getMatches();
+        } catch (IOException e) {
+            LOGGER.error("Error occurred when searching", e);
+            return false;
+        }
 
         // export matches
         if (!matches.isEmpty()) {