Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use soft-update to maintain document history #29458

Merged
merged 7 commits into from
Apr 12, 2018

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Apr 10, 2018

Today we can use the soft-update feature from Lucene to maintain a history of document. This change simply replaces hard-update in the Engine by soft-update methods. Stale operations, delete, and no-ops will be handled in subsequent changes. This change is just a cut-over from hard-update to soft-update, no new functionality has been introduced.

Today we can the soft-update feature from Lucene to maintain a history
of document. This change simply replaces hard-update in the Engine by
soft-update methods. Stale operations, delete, and no-ops will be
handled in subsequent changes.
@dnhatn dnhatn added >enhancement :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. labels Apr 10, 2018
@dnhatn dnhatn requested review from s1monw and bleskes April 10, 2018 20:17
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@dnhatn dnhatn requested review from jasontedor and martijnvg April 10, 2018 20:19
@dnhatn
Copy link
Member Author

dnhatn commented Apr 10, 2018

test this please.

@dnhatn
Copy link
Member Author

dnhatn commented Apr 11, 2018

please test this.

@s1monw
Copy link
Contributor

s1monw commented Apr 11, 2018

@elasticmachine test this please

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great, I left some comments

/**
* Returns a numeric docvalues which can be used to soft-delete documents.
*/
public static NumericDocValuesField getSoftDeleteDVMarker() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this newSoftDeletesField?

* Returns <code>true</code> if soft-delete is enabled.
*/
public boolean isSoftDeleteEnabled() {
return getIndexVersionCreated().onOrAfter(Version.V_7_0_0_alpha1) && softDeleteEnabled;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe resolve this in the ctor?

if (softDeleteEnabled) {
iwc.setSoftDeletesField(Lucene.SOFT_DELETE_FIELD);
// TODO: soft-delete retention policy
mergePolicy = new SoftDeletesRetentionMergePolicy(Lucene.SOFT_DELETE_FIELD, () -> new MatchAllDocsQuery(), mergePolicy);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need this here since it's match all. Lets just omit it until we use it

@@ -144,6 +145,11 @@ public boolean useCompoundFile(SegmentInfos segments, SegmentCommitInfo newSegme
return delegate.useCompoundFile(segments, newSegment, writer);
}

@Override
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of doing this I think we should extend MergePolicyWrapper in master and 6.x and then this is handled upstream

@@ -504,7 +507,7 @@ public void testSegmentsWithMergeFlag() throws Exception {

if (flush) {
// we should have had just 1 merge, so last generation should be exact
assertEquals(gen2, store.readLastCommittedSegmentsInfo().getLastGeneration());
assertEquals(gen2 + 1, store.readLastCommittedSegmentsInfo().getLastGeneration());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm why does this need to change?

@Override
public long softUpdateDocument(Term term, Iterable<? extends IndexableField> doc, Field... softDeletes) throws IOException {
maybeThrowFailure();
return super.softUpdateDocument(term, doc, softDeletes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

public LeafReader wrap(LeafReader in) {
return new FilterLeafReader(in) {
@Override
public CacheHelper getCoreCacheHelper() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can return in.getCoreCacheHelper();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just move this entire thing to a util method in Lucene.java then we can reuse it in other places WDYT? @martijnvg you need this too somewhere right?

Copy link
Member Author

@dnhatn dnhatn Apr 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the test and this wrapper in this PR, and will add it back to Lucene.java later. I prefer to keep this change as a cut-over.

@@ -1070,10 +1074,18 @@ private boolean assertDocDoesNotExist(final Index index, final boolean allowDele
}

private void updateDocs(final Term uid, final List<ParseContext.Document> docs, final IndexWriter indexWriter) throws IOException {
if (docs.size() > 1) {
indexWriter.updateDocuments(uid, docs);
if (softDeleteEnabled) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the delete case we should also use indexWriter.updateDocValue(uid, Lucene.getSoftDeleteDVMarker()); instead of IndexWriter#deleteDocuments and then override this in the test to maybe throw an exception.

@dnhatn
Copy link
Member Author

dnhatn commented Apr 11, 2018

@s1monw I have addressed your comments. Can you please take a look? Thank you.

Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -1073,9 +1070,9 @@ private boolean assertDocDoesNotExist(final Index index, final boolean allowDele
private void updateDocs(final Term uid, final List<ParseContext.Document> docs, final IndexWriter indexWriter) throws IOException {
if (softDeleteEnabled) {
if (docs.size() > 1) {
indexWriter.softUpdateDocuments(uid, docs, Lucene.getSoftDeleteDVMarker());
indexWriter.softUpdateDocuments(uid, docs, Lucene.newSoftDeleteField());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can cache this in here in a constant so we don't need to re-create a new one every time.

@dnhatn
Copy link
Member Author

dnhatn commented Apr 12, 2018

Thanks @s1monw for reviewing.

@dnhatn dnhatn merged commit 0dd61fc into elastic:ccr Apr 12, 2018
@dnhatn dnhatn deleted the ccr-soft-updates branch April 12, 2018 16:30
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Apr 28, 2018
Since elastic#29458, we use a searcher to calculate the number of documents for
a commit stats. Sadly, that approach is flawed. The searcher might no
longer point to the last commit if it's refreshed.

This commit uses SoftDeletesDirectoryReaderWrapper to exclude the
soft-deleted documents from numDocs in a SegmentInfos. I chose to modify
the method Luence#getNumDocs so that we can read a store metadata
snapshot correctly without opening an engine.

Relates elastic#29458
dnhatn added a commit that referenced this pull request May 10, 2018
Today we can use the soft-update feature from Lucene to maintain a
history of document. This change simply replaces hard-update in the
Engine by soft-update methods. Stale operations, delete, and no-ops will
be handled in subsequent changes. This change is just a cut-over from
hard-update to soft-update, no new functionality has been introduced.

Relates #29530
dnhatn added a commit that referenced this pull request May 15, 2018
Since #29458, we use a searcher to calculate the number of documents for
a commit stats. Sadly, that approach is flawed. The searcher might no
longer point to the last commit if it's refreshed. As synced-flush
requires an exact numDocs to work correctly, we have to exclude all
soft-deleted docs.

This commit makes synced-flush stop using CommitStats but read an exact
numDocs directly from an index commit.

Relates #29458
Relates #29530
dnhatn added a commit that referenced this pull request May 30, 2018
Since #29458, we use a searcher to calculate the number of documents for
a commit stats. Sadly, that approach is flawed. The searcher might no
longer point to the last commit if it's refreshed. As synced-flush
requires an exact numDocs to work correctly, we have to exclude all
soft-deleted docs.

This commit makes synced-flush stop using CommitStats but read an exact
numDocs directly from an index commit.

Relates #29458
Relates #29530
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. >enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants