Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

automatic historycache entry removal corrupts incremental history indexing #1701

Closed
vladak opened this issue Jul 27, 2017 · 2 comments
Closed
Assignees

Comments

@vladak
Copy link
Member

vladak commented Jul 27, 2017

The changes for #767 in cset d3dbeac badly interact with incrementally adding history cache entries. I only noticed in indexer logs that each changed file triggers call to both removeFile() and addFile() in IndexDatabase. After said changes, removeFile() deletes history cache for given file.

Thus, the indexing now works like this for a changed file:

  1. entries are added to history cache incrementally
  2. indexDown() calls removeFile() that removes the history cache for the file
  3. addFile() creates the history cache for the file from scratch in populateDocument()

Obviously, this increases indexing time dramatically for repositories that support per-directory history.

@vladak vladak added the bug label Jul 27, 2017
@vladak vladak self-assigned this Jul 27, 2017
@vladak
Copy link
Member Author

vladak commented Jul 27, 2017

Relevant code in indexDown() looks like this:

871          for (File file : files) {

...

890                      if (uidIter != null) {
891                          String uid = Util.path2uid(path,
892                              DateTools.timeToString(file.lastModified(),
893                              DateTools.Resolution.MILLISECOND)); // construct uid for doc
894                          BytesRef buid = new BytesRef(uid);
895                          while (uidIter != null && uidIter.term() != null
896                                  && uidIter.term().compareTo(emptyBR) !=0
897                                  && uidIter.term().compareTo(buid) < 0) {
898                              removeFile();
899                              BytesRef next = uidIter.next();
900                              if (next==null) {uidIter=null;}
901                          }
902  
903                          if (uidIter != null && uidIter.term() != null
904                                  && uidIter.term().bytesEquals(buid)) {
905                              BytesRef next = uidIter.next(); // keep matching docs
906                              if (next==null) {uidIter=null;}
907                              continue;
908                          }
909                      }
910                      try {
911                          addFile(file, path);
912                      } catch (Exception e) {
913                          LOGGER.log(Level.WARNING,
914                                  "Failed to add file " + file.getAbsolutePath(),
915                                  e);
916                      }

Obviously, changed file is first removed (from the index, xref, history cache) and then re-added.

I believe that the cycle on lines 895-897 makes sure that directory traversal is in sync with term traversal. E.g. if I have a repository with files A.txt and B.txt and A.txt is removed, indexDown() will cycle through the list of files in given directory and will hit file B.txt first (because there is no A.txt file anymore), in this cycle it will perform the "uid" check for A.txt first because it comes first in the ordering (that is based on file paths with appended time value as visible in path2uid()), removes its data and then will continue to the term corresponding to B.txt, removes its data but then it will go to line 911 to re-add its data again.

There is also call to removeFile() in IndexDatabase#update() after indexDown() returns which I believe clears data for the trailing files.

Hopefully there is a way how to distinguish the case of data refresh and file removal.

@vladak vladak added the stopper label Jul 27, 2017
@vladak
Copy link
Member Author

vladak commented Jul 27, 2017

Curiously, IndexChangedListener.java contains fileUpdate() however that is not used anywhere.

vladak added a commit to vladak/OpenGrok that referenced this issue Jul 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant