Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lucene search #11542

Merged
merged 331 commits into from
Sep 5, 2024
Merged

Lucene search #11542

merged 331 commits into from
Sep 5, 2024

Conversation

LoayGhreeb
Copy link
Collaborator

@LoayGhreeb LoayGhreeb commented Jul 28, 2024

Lucene search backend

Follow-up to: #8963, #8206, #11326.

Indexing

  • All bib fields and linked files (PDFs) are indexed separately in two different indexes.
  • Indexing operations startup, adding, removing, and updating are performed in the background. Each index operates in a separate thread.
  • Startup:
    • Bib Fields Index: Recalculated for the entire library on startup.
    • Linked Files Index: Only the differences between the current library and previously indexed files are recalculated. Files that have been updated on disk will be reindexed.
  • Storage:
    • Bib Fields Index: Stored in memory rather than on disk, due to the non-persistent of BibEntry#hashCode across sessions.
    • Linked Files Index: Stored in the directory provided by AppDirs.
  • Each bib entry is stored as a Lucene document. Each bib field is tokenized and added to the document. Additionally, all bib fields (except the "Groups" due to #7996) are collected into one field "any", is used as the default field during searches.
  • For both the Bib Fields Index and Linked Files Index, the IndexWriter is opened only once at startup and remains open during the runtime.
  • During shutdown, all changes are committed to the index, and the index is optimized by merging all segments into a single segment.

Analyzing

  • Bib Fields Index: A custom analyzer is used to support "contains" searches, LaTeX, and Unicode characters. The analyzer includes:
  • The same analyzer used for indexing bib fields is also used for searching, but without the EdgeNGramTokenFilter.
  • The Linked files index uses the EnglishAnalyzer for both indexing and searching. This analyzer converts all strings to lowercase, removing English stop words, and uses PorterStemFilterwhich reduces words to their base or root form, known as the "stem". For example, terms like "computer", "compute", "computations", and "computerized" will all be reduced to the stem "comput", to get more relevant search results.

Searching

Search Results

  • Added a new column displaying the search score.
  • The file icon in the table now displays a magnifying glass when search results are found within a linked file.
  • Fixed issues with highlighting search results in the Preview Viewer and the Source Tab.

Search Groups free-search expression

Caution

Before proceeding, create a backup of your library. This is an alpha release, and the search syntax is changed.

  • If the library contains Search Groups, users will be prompted to migrate the search syntax to the new syntax.
  • Search Group matches are now cached, and switching between search groups improved.

Removed

  • Case-sensitive and exact match searches are no longer supported.
  • Removed case-sensitive and regular expression toggles for the search bar, and search groups dialog.
  • Removed the description of search strings.
  • Removed all search rules.

Screenshots

  • Search groups migration. image
  • Full-text search results. image
  • Highlighting search results. image

Closes: #8857
Closes: #11374
Closes: #11378
Closes: #8626
Closes: #11595
Closes: #11246
Closes: #7996
Closes: #8067
Closes: #1975

Mandatory checks

  • Change in CHANGELOG.md described in a way that is understandable for the average user (if applicable)
  • Tests created for changes (if applicable)
  • Manually tested changed features in running JabRef (always required)
  • Screenshots added in PR description (for UI changes)
  • Checked developer's documentation: Is the information available and up to date? If not, I outlined it in this pull request.
  • Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

btut and others added 30 commits November 6, 2022 19:28
Co-authored-by: Christoph <siedlerkiller@gmail.com>
# Conflicts:
#	CHANGELOG.md
#	src/main/java/org/jabref/gui/LibraryTab.java
#	src/main/java/org/jabref/gui/StateManager.java
#	src/main/java/org/jabref/gui/openoffice/OpenOfficePanel.java
# Conflicts:
#	CHANGELOG.md
#	src/jmh/java/org/jabref/benchmarks/Benchmarks.java
#	src/main/java/org/jabref/gui/JabRefFrame.java
#	src/main/java/org/jabref/gui/LibraryTab.java
#	src/main/java/org/jabref/gui/entryeditor/EntryEditor.java
#	src/main/java/org/jabref/gui/entryeditor/fileannotationtab/FulltextSearchResultsTab.java
#	src/main/java/org/jabref/gui/externalfiles/ExternalFilesEntryLinker.java
#	src/main/java/org/jabref/gui/externalfiles/ImportHandler.java
#	src/main/java/org/jabref/gui/groups/GroupDialogView.java
#	src/main/java/org/jabref/gui/groups/GroupsPreferences.java
#	src/main/java/org/jabref/gui/maintable/MainTable.java
#	src/main/java/org/jabref/gui/maintable/MainTableColumnFactory.java
#	src/main/java/org/jabref/gui/maintable/columns/FileColumn.java
#	src/main/java/org/jabref/gui/preview/PreviewPanel.java
#	src/main/java/org/jabref/gui/search/GlobalSearchBar.java
#	src/main/java/org/jabref/gui/search/RebuildFulltextSearchIndexAction.java
#	src/main/java/org/jabref/gui/search/SearchResultsTableDataModel.java
#	src/main/java/org/jabref/logic/pdf/search/indexing/IndexingTaskManager.java
#	src/main/java/org/jabref/model/database/BibDatabaseContext.java
#	src/main/java/org/jabref/model/pdf/search/SearchFieldConstants.java
#	src/main/java/org/jabref/model/search/rules/SearchRules.java
#	src/main/java/org/jabref/preferences/JabRefPreferences.java
#	src/main/java/org/jabref/preferences/SearchPreferences.java
#	src/test/java/org/jabref/gui/groups/GroupTreeViewModelTest.java
# Conflicts:
#	CHANGELOG.md
#	src/main/java/org/jabref/model/search/rules/ContainsBasedSearchRule.java
#	src/main/java/org/jabref/model/search/rules/GrammarBasedSearchRule.java
#	src/main/java/org/jabref/model/search/rules/RegexBasedSearchRule.java
# Conflicts:
#	CHANGELOG.md
#	src/main/java/org/jabref/model/search/rules/ContainsBasedSearchRule.java
#	src/main/java/org/jabref/model/search/rules/GrammarBasedSearchRule.java
#	src/main/java/org/jabref/model/search/rules/RegexBasedSearchRule.java
LoayGhreeb and others added 2 commits September 4, 2024 09:24
When closing JabRef, only ask users to wait for the linked files indexer to finish. The bib fields indexer is recalculated on startup, so it doesn't need to be completed before shutdown.
@LoayGhreeb LoayGhreeb marked this pull request as ready for review September 4, 2024 18:23
Copy link
Contributor

github-actions bot commented Sep 4, 2024

The build for this PR is no longer available. Please visit https://builds.jabref.org/main/ for the latest build.

@@ -58,6 +59,7 @@ public class StateManager {
private final OptionalObjectProperty<LibraryTab> activeTab = OptionalObjectProperty.empty();
private final ObservableList<BibEntry> selectedEntries = FXCollections.observableArrayList();
private final ObservableMap<String, ObservableList<GroupTreeNode>> selectedGroups = FXCollections.observableHashMap();
private final ObservableMap<String, LuceneManager> luceneManagers = FXCollections.observableHashMap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For later discussion: This is maybe a hint that LuceneManager should be called different. Maybe "SearchIndex" or sthg alike? Usually you have one manager for the entire app, not a manager for each file...

for (String resultTextHtml : searchResult.getAnnotationsResultStringsHtml()) {
content.getChildren().addAll(TooltipTextUtil.createTextsFromHtml(resultTextHtml.replace("</b> <b>", " ")));
content.getChildren().addAll(new Text(System.lineSeparator()), lineSeparator(0.8), createPageLink(linkedFile, searchResult.getPageNumber()));
stateManager.activeSearchQuery(SearchType.NORMAL_SEARCH).get().ifPresent(searchQuery -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very weird to put that into a lambda expression. Also makes stack traces way longer. Maybe a simple if check is enough or better - fail fast strategy (if (!activeSearchQuery.isPresent()) { return; } )

import com.tobiasdiez.easybind.EasyBind;
import com.tobiasdiez.easybind.Subscription;
import org.jspecify.annotations.Nullable;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that @Siedlerchr will like this...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... by default all is nullable in java

private static final Logger LOGGER = LoggerFactory.getLogger(SearchGroup.class);
private final GroupSearchQuery query;

@ADR(38)
Copy link
Member

@calixtus calixtus Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In DuplicateSearch it uses the comment format:
grafik

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Java, annotations are limited. We tried to use e-adr whereever possible. Where not, we used Java comments. #research.

@koppor koppor added this pull request to the merge queue Sep 5, 2024
Merged via the queue into main with commit 6af91b9 Sep 5, 2024
31 of 32 checks passed
@koppor koppor deleted the LuceneSearch branch September 5, 2024 20:07
@calixtus
Copy link
Member

calixtus commented Sep 5, 2024

🎉

@subhramit
Copy link
Collaborator

Congratulations, Loay!

This was referenced Sep 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment