Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TinkerPop tests are failing in master branch (closed after 6 hours) #3747

Closed
porunov opened this issue Apr 26, 2023 · 8 comments · Fixed by #3760
Closed

TinkerPop tests are failing in master branch (closed after 6 hours) #3747

porunov opened this issue Apr 26, 2023 · 8 comments · Fixed by #3760

Comments

@porunov
Copy link
Member

porunov commented Apr 26, 2023

On February 22 we had the last time TinkerPop tests CI executed without any problem (scheduled weekly job).
Starting from March 1 the scheduled job failed and it's failing from that time each week. There were no any successful runs after that.
It either means: We made a bug somewhere from February 22 to March 1 or something changed in the CI execution (less powerful machines or something else.
Previously TP tests job were taking 4:45 hours in average. From March 1 it takes over 6 hours, but GitHub actions kill TP tests after 6 hours.

Scheduled execution jobs can be found here: https://github.com/JanusGraph/janusgraph/actions?query=branch%3Amaster+event%3Aschedule

The problem was noticed in the next PRs: #3418 , #3743

Link to the branch with 2 canceled after 6 hours recent jobs: https://github.com/porunov/janusgraph/actions/runs/4809004095/jobs/8559714661

  • Version: master
  • Storage Backend: cassandra (byteordered)
  • Mixed Index Backend: none
  • Expected Behavior: TP tests should pass
  • Current Behavior: CI for TP tests takes 6 hours after that it's force closed by GitHub
  • Steps to Reproduce:

For Cassandra 3:

mvn clean install --projects janusgraph-cql -DskipTests=true --batch-mode --also-make -Pjava-11
mvn verify --projects janusgraph-cql -Pjava-11 -Pcassandra3-byteordered -Dtest.skip.tp=false -DskipTests=true

For Cassandra 4:

mvn clean install --projects janusgraph-cql -DskipTests=true --batch-mode --also-make -Pjava-11
mvn verify --projects janusgraph-cql -Pjava-11 -Pcassandra4-byteordered -Dtest.skip.tp=false -DskipTests=true

Stack Trace (if you have one)

Error: The operation was canceled.
@porunov
Copy link
Member Author

porunov commented Apr 26, 2023

The last commit for which TP tests supped to passed on February 22 was made on February 17: d30e789 (started TP tests to verify TP tests still pass for that commit here (PASSED 5 times). Branch is here)

Here is a list of commits which we had from February 22 to March 1 (the issue is most likely in one of those commits).
Commits sorted by date (starting from oldest commits and ending the most recent commits):

94b3da5 (started TP tests here (PASSED 5 times). Branch is here)

20e14a7 (started TP tests here . Branch is here) (This commit was earlier than the latest successful TP tests execution commit. Thus skipping this commit)

69e01db (skipping checks for this commit due to doc only updates) (This commit was earlier than the latest successful TP tests execution commit. Thus skipping this commit)

9684494 (Only configuration option description doc was changed here. Thus, skipping this commit) (This commit was earlier than the latest successful TP tests execution commit. Thus skipping this commit)

6725fd3 (started TP tests here . Branch is here) (This commit was earlier than the latest successful TP tests execution commit. Thus skipping this commit)

3429215 (started TP tests here (PASSED 5 times) . Branch is here) (This commit was earlier than the latest successful TP tests execution commit. Thus skipping this commit)

245e79f (This commit was earlier than the latest successful TP tests execution commit. Thus skipping this commit) (started TP tests here . Branch is here) (PASSED 1 time)

fd3b92c (This commit was earlier than the latest successful TP tests execution commit. Thus skipping this commit) (started TP tests here . Branch is here) (PASSED 3 times)

a44647a (This commit was earlier than the latest successful TP tests execution commit. Thus skipping this commit) (started TP tests here . Branch is here) (FAILED 1 time)

ba672b9 (release candidate for which we already executed TP tests successfully. Thus, skipping this commit checks) started TP tests here (FAILED). Branch is here)

49c5f80 (snapshot return commit. Skipping this commit)

cdb6a93 (doc only change. Skipping this commit)

19a8637 (started TP tests here (1st run PASSED, 2nd run FAILED). Branch is here)

76f9a6b (started TP tests here (FAILED 2 times). Branch is here)

671672b (started TP tests here (FAILED 2 times). Branch is here)

fcc8b33 (started TP tests here (FAILED 1st run, PASSED 2nd run). Branch is here)

4475507 (started TP tests here (FAILED 2 times). Branch is here)

2c70076 (doc only change. Skipping this commit)

@porunov porunov changed the title TinkerPop tests are failing for in master branch (closed after 6 hours) TinkerPop tests are failing in master branch (closed after 6 hours) Apr 26, 2023
@porunov
Copy link
Member Author

porunov commented Apr 27, 2023

It’s strange but based on the above research the trouble commit was 76f9a6b
Before that commit TP tests are passing (checked 3 different commits before that one).
After that commit TP tests are failing due to timeout issues (4 commits after that one is tested).

Update: tests for the previous commit 19a8637 Passed the first time, but failed the second time.

porunov added a commit to porunov/janusgraph that referenced this issue Apr 27, 2023
Fixes JanusGraph#3747

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
porunov added a commit to porunov/janusgraph that referenced this issue Apr 27, 2023
Fixes JanusGraph#3747

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
@porunov
Copy link
Member Author

porunov commented Apr 27, 2023

It’s strange but based on the above research the trouble commit was 76f9a6b

Before that commit TP tests are passing (checked 3 different commits before that one). After that commit TP tests are failing due to timeout issues (4 commits after that one is tested).

I tried to revert the failing commit here #3749 but looks like it didn't help.
Moreover @li-boxuan commented here that TP passed for some commits in #3418 .

It could be that above investigation of a wrong commit could be influenced by instability of GitHub Actions. That said, it could be that we just have now multiple commits in master branch which prevent TinkerPop tests from finishing in time. It is strange that before (and including) commit 19a8637 TP tests are passing for multiple tries (as seen from the above investigation), but right after that commit TinkerPop are constantly failing both for scheduled TP tests on master and in the above investigation.

Thus, I think the problem is one of the below problems:

  1. GitHub Actions are suddenly become unstable (maybe they reduced resources for GitHub Actions or did something else). (Due to ongoing experiments I think this point is quite unlikely because today I already got successful pass of one of the previous commits).
  2. The problem is in this commit 76f9a6b and some other commits after that one.

It's a little bit tricky to find our the root cause of this issue and why TinkerPop tests are not passing for cassandra3-byteordered and cassandra4-byteordered in time anymore.

@JanusGraph/committers in case anyone can investigate this issue as well or have some thoughts about how to fix TP tests for GitHub Actions that would be great.
The problem is that we can't make new JanusGraph releases without TP tests passing. Finding a troubling commit is more challenging due to many commits after the last successful commit with TP tests and also due to the big time to test each commit.

@porunov porunov added this to the Release v1.0.0 milestone Apr 27, 2023
porunov added a commit to porunov/janusgraph that referenced this issue Apr 27, 2023
…sts] [tp-tests]

Investigation of JanusGraph#3747

Related 3749#

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
porunov added a commit to porunov/janusgraph that referenced this issue Apr 28, 2023
…sts] [tp-tests]

Investigation of JanusGraph#3747

Related 3749#

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
@porunov
Copy link
Member Author

porunov commented Apr 28, 2023

More tests:

Finished (Failed):

  1. Revert checker-qual to 3.30.0. (FAILED). PR: Revert 76f9a6b6a01881d39764b0af17b800162d7e17fb to allow TP tests CI to pass [cql-tests] [tp-tests] #3749
  2. Revert all dependencies but leave other changes. (FAILED). PR: Revert all bumped versions prior last successful TP tests run [cql-tests] [tp-tests] #3752
  3. Reset branch to cc0843e and revert all dependencies. Branch https://github.com/porunov/janusgraph/tree/bugfix/tp-tests-2 . Commit porunov@bcfd80f . (FAILED).
  4. Reset branch to d6b883b and revert all dependencies. Branch https://github.com/porunov/janusgraph/tree/bugfix/tp-tests-3 . Commit porunov@6e77a94 . (FAILED).
  5. Reset branch to 671672b and revert all dependencies. Branch https://github.com/porunov/janusgraph/tree/bugfix/tp-tests-5 . Commit porunov@95fccf7 . (FAILED).
  6. Reset branch to 671672b and revert all dependencies and ensure org.jacoco.ant is aligned to 0.8.8 . Branch https://github.com/porunov/janusgraph/tree/bugfix/tp-tests-6 . Commit porunov@9dd6b56 . (FAILED).
  7. Use latest commit 5159135 and update org.jacoco.ant to 0.8.10 . Branch https://github.com/porunov/janusgraph/tree/bugfix/inline-org-jacoco-ant . Commit porunov@6a78580 . (FAILED).
  8. Use latest commit 5159135 and revert org.jacoco.ant to 0.8.8 . Branch https://github.com/porunov/janusgraph/tree/bugfix/revert-jacoco-0.8.8 . Commit porunov@5870d2a . (FAILED).
  9. Reset branch to 292f4cb and use jacoco 0.8.8 and checker-qual 3.30.0. Branch https://github.com/porunov/janusgraph/tree/bugfix/tp-tests-10 . Commit porunov@755723f . (FAILED).
  10. Reset branch to the latest master commit 5159135 and use jacoco 0.8.8 and checker-qual 3.30.0. Branch https://github.com/porunov/janusgraph/tree/bugfix/tp-tests-11 . Commit porunov@8d5e066 . (FAILED).

Finished (Passed):

  1. Reset branch to 292f4cb and revert all dependencies. Branch https://github.com/porunov/janusgraph/tree/bugfix/tp-tests-4 . Commit porunov@0af891a . (PASSED)

Wasn't executed?

porunov added a commit to porunov/janusgraph that referenced this issue Apr 28, 2023
…sts] [tp-tests]

Investigation of JanusGraph#3747

Related 3749#

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
porunov added a commit to porunov/janusgraph that referenced this issue Apr 28, 2023
…sts] [tp-tests]

Investigation of JanusGraph#3747

Related 3749#

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
porunov added a commit to porunov/janusgraph that referenced this issue Apr 28, 2023
…sts] [tp-tests]

Investigation of JanusGraph#3747

Related 3749#

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
porunov added a commit to porunov/janusgraph that referenced this issue Apr 28, 2023
Investigation of JanusGraph#3747

Related JanusGraph#3749

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
@porunov
Copy link
Member Author

porunov commented May 1, 2023

Executing more tests shows that the last rc-2 release had the same problem, but we were simply lucky that those tests passed. Re-running tp-tests for that commit - failed.
Multiple re-runs of previous commits showed that most likely the problem is in one of the followings commits:

245e79f (started TP tests here . Branch is here)

fd3b92c (started TP tests here . Branch is here) (PASSED 1 time)

a44647a (started TP tests here . Branch is here) (FAILED 1 time)

Started tests executions for these commits to see which one may fail.
Notice, that sometimes job succeeds, thus if the job seceded only once it doesn't mean it will have the same behavior on a second run. We should test commit multiple times to see if it fails or not. If it fails at least once - there is a problem with the code under that commit (most likely).

@porunov
Copy link
Member Author

porunov commented May 1, 2023

Update:

With the ongoing investigation the scope of the trouble commits is reduced to this commit only: a44647a
It also explains why only TinkerPop tests for cassandra backend were failing.

I noticed that both byteordered and murmur partitioning was failing. Thus, it's probably not related to particular partitioning. That said, I also noticed that the tests hang only for Java 11 and not for Java 8 (could be just coincidence).
Looks like with disabling custom CQL executor pool, some tinkerpop tests just hang sometimes.
That said, the behavior isn't consistent. Sometimes tests are finished in 4:10 hours and sometimes they hang for more than 6 hours and then killed by GitHub Actions.

The strange thing is that those TinkerPop CQL tests are either finishing in the next window [from 4 hours to 4:50 hours] or don't finish at all. From many times over these days when I was running tests I have never seen a time when those tests finish in 5:10, 5:30, 5:55 (i.e. they always finish under 5 hours). This brings me to thoughts that potentially it could be unrelated to CQL requests throughput but to something else (just a theory).

In case anyone knows why would some TP tests hang for over 6 hours for CQL storage backend when CQL executor pool is disabled it would be really great if you could share your thoughts here because currently I'm not getting it.
From my perspective the internal JanusGraph executor pool to handle CQL requests was just unnecessary overhead which would artificially limit the amount of parallel CQL requests (even so we have a separate parameter for that storage.cql.max-requests-per-connection). Thus, the usage of CQL executor pool is unnecessary and it adds just additional overhead in term of threads.
It looks like I'm wrong about CQL executor pool and maybe it should be enabled, but I don't understand why.

The merged PR of that commit is #3607

cc @li-boxuan @cdegroc because we had some related discussions here #3170

porunov added a commit to porunov/janusgraph that referenced this issue May 4, 2023
…tests] [tp-tests]

- Adds multiQuery support to CQL storage backend.
- Ensure `storage.parallel-backend-ops` thread pool is created only for storage backends which don't support multiQuery (multi-key slice operations).
- Change purpose of `storage.cql.executor-service` to be used for results deserialization jobs only (not for IO operations).

Fixes JanusGraph#2406
Fixes JanusGraph#3747
Fixes JanusGraph#3759

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
porunov added a commit to porunov/janusgraph that referenced this issue May 4, 2023
…tests] [tp-tests]

- Adds multiQuery support to CQL storage backend.
- Ensure `storage.parallel-backend-ops` thread pool is created only for storage backends which don't support multiQuery (multi-key slice operations).
- Change purpose of `storage.cql.executor-service` to be used for results deserialization jobs only (not for IO operations).

Fixes JanusGraph#2406
Fixes JanusGraph#3747
Fixes JanusGraph#3759

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
@porunov
Copy link
Member Author

porunov commented May 5, 2023

Started TP tests for master branch with enabled CQL executor pool: porunov@9d346b8 (Branch https://github.com/porunov/janusgraph/tree/bugfix/tp-tests-30) (PASSED 2 times)

TP tests for #3760 but with executor service used for mutation queries (don't understand why would we use it for those queries because mutation queries don't have any deserialization work to be done as slice queries. Nevertheless, we do have 2 implementations for some reason (with and without executor service used).
porunov@f4d4b0e (Branch https://github.com/porunov/janusgraph/tree/feature/cql-multi-key-slices-mutate-many-with-executor) (PASSED 1 time)

TP tests in #3760 before ensuring PermanentBackendException is thrown on InterruptedException (FAILED 2 times)
TP tests in #3760 after ensuring PermanentBackendException is thrown on InterruptedException (PASSED 3 times)

porunov added a commit to porunov/janusgraph that referenced this issue May 5, 2023
…tests] [tp-tests]

- Adds multiQuery support to CQL storage backend.
- Ensure `storage.parallel-backend-ops` thread pool is created only for storage backends which don't support multiQuery (multi-key slice operations).
- Change purpose of `storage.cql.executor-service` to be used for results deserialization jobs only (not for IO operations).

Fixes JanusGraph#2406
Fixes JanusGraph#3747
Fixes JanusGraph#3759
Related to JanusGraph#3170

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
porunov added a commit to porunov/janusgraph that referenced this issue May 5, 2023
…tests] [tp-tests]

- Adds multiQuery support to CQL storage backend.
- Ensure `storage.parallel-backend-ops` thread pool is created only for storage backends which don't support multiQuery (multi-key slice operations).
- Change purpose of `storage.cql.executor-service` to be used for results deserialization jobs only (not for IO operations).

Fixes JanusGraph#2406
Fixes JanusGraph#3747
Fixes JanusGraph#3759
Related to JanusGraph#3170

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
porunov added a commit to porunov/janusgraph that referenced this issue May 6, 2023
…tests] [tp-tests]

- Adds multiQuery support to CQL storage backend.
- Ensure `storage.parallel-backend-ops` thread pool is created only for storage backends which don't support multiQuery (multi-key slice operations).
- Change purpose of `storage.cql.executor-service` to be used for results deserialization jobs only (not for IO operations).

Fixes JanusGraph#2406
Fixes JanusGraph#3747
Fixes JanusGraph#3759
Related to JanusGraph#3170

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
porunov added a commit to porunov/janusgraph that referenced this issue May 6, 2023
…tests] [tp-tests]

- Adds multiQuery support to CQL storage backend.
- Ensure `storage.parallel-backend-ops` thread pool is created only for storage backends which don't support multiQuery (multi-key slice operations).
- Change purpose of `storage.cql.executor-service` to be used for results deserialization jobs only (not for IO operations).

Fixes JanusGraph#2406
Fixes JanusGraph#3747
Fixes JanusGraph#3759
Related to JanusGraph#3170

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
porunov added a commit to porunov/janusgraph that referenced this issue May 6, 2023
…tests] [tp-tests]

- Adds multiQuery support to CQL storage backend.
- Ensure `storage.parallel-backend-ops` thread pool is created only for storage backends which don't support multiQuery (multi-key slice operations).
- Change purpose of `storage.cql.executor-service` to be used for results deserialization jobs only (not for IO operations).

Fixes JanusGraph#2406
Fixes JanusGraph#3747
Fixes JanusGraph#3759
Related to JanusGraph#3170

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
@porunov
Copy link
Member Author

porunov commented May 6, 2023

Started TP tests for master branch with enabled CQL executor pool: porunov@9d346b8 (Branch https://github.com/porunov/janusgraph/tree/bugfix/tp-tests-30) (PASSED 2 times)

TP tests for #3760 but with executor service used for mutation queries (don't understand why would we use it for those queries because mutation queries don't have any deserialization work to be done as slice queries. Nevertheless, we do have 2 implementations for some reason (with and without executor service used). porunov@f4d4b0e (Branch https://github.com/porunov/janusgraph/tree/feature/cql-multi-key-slices-mutate-many-with-executor) (PASSED 1 time)

TP tests in #3760 before ensuring PermanentBackendException is thrown on InterruptedException (FAILED 2 times) TP tests in #3760 after ensuring PermanentBackendException is thrown on InterruptedException (PASSED 3 times)

As the above research showed. The long running TinkerPop tests were not related to the ExecutorService usage or performance. Instead, it was related to the bug in CQL storage backend which were propagating InterruptedException as TemporaryBackendException instead of PermanentBackendException.
To cancel the query execution we should throw PermanentBackendException on InterruptedException.
This will be fixed as part of #3760

porunov added a commit to porunov/janusgraph that referenced this issue May 6, 2023
…tests] [tp-tests]

- Adds multiQuery support to CQL storage backend.
- Ensure `storage.parallel-backend-ops` thread pool is created only for storage backends which don't support multiQuery (multi-key slice operations).
- Change purpose of `storage.cql.executor-service` to be used for results deserialization jobs only (not for IO operations).

Fixes JanusGraph#2406
Fixes JanusGraph#3747
Fixes JanusGraph#3759
Related to JanusGraph#3170

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
porunov added a commit to porunov/janusgraph that referenced this issue May 6, 2023
…tests] [tp-tests]

- Adds multiQuery support to CQL storage backend.
- Ensure `storage.parallel-backend-ops` thread pool is created only for storage backends which don't support multiQuery (multi-key slice operations).
- Change purpose of `storage.cql.executor-service` to be used for results deserialization jobs only (not for IO operations).

Fixes JanusGraph#2406
Fixes JanusGraph#3747
Fixes JanusGraph#3759
Related to JanusGraph#3170

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
porunov added a commit to porunov/janusgraph that referenced this issue May 18, 2023
…tests] [tp-tests]

- Adds multiQuery support to CQL storage backend.
- Ensure `storage.parallel-backend-ops` thread pool is created only for storage backends which don't support multiQuery (multi-key slice operations).
- Change purpose of `storage.cql.executor-service` to be used for results deserialization jobs only (not for IO operations).

Fixes JanusGraph#2406
Fixes JanusGraph#3747
Fixes JanusGraph#3759
Related to JanusGraph#3170

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
porunov added a commit to porunov/janusgraph that referenced this issue May 26, 2023
…tests] [tp-tests]

- Adds multiQuery support to CQL storage backend.
- Ensure `storage.parallel-backend-ops` thread pool is created only for storage backends which don't support multiQuery (multi-key slice operations).
- Change purpose of `storage.cql.executor-service` to be used for results deserialization jobs only (not for IO operations).

Fixes JanusGraph#2406
Fixes JanusGraph#3747
Fixes JanusGraph#3759
Related to JanusGraph#3170

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
porunov added a commit that referenced this issue May 29, 2023
…tests] [tp-tests] (#3760)

* Implement multi-key slice async queries for CQL storage backend [cql-tests] [tp-tests]

- Adds multiQuery support to CQL storage backend.
- Ensure `storage.parallel-backend-ops` thread pool is created only for storage backends which don't support multiQuery (multi-key slice operations).
- Change purpose of `storage.cql.executor-service` to be used for results deserialization jobs only (not for IO operations).
- Add possibility to specify back pressure class.

Fixes #2406
Fixes #3747
Fixes #3759
Related to #3170

Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment