Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(discovery): observed containers should be checked with persisted nodes #423

Merged
merged 25 commits into from
May 3, 2024

Conversation

tthvo
Copy link
Member

@tthvo tthvo commented Apr 26, 2024

Welcome to Cryostat3! 👋

Before contributing, make sure you have:

  • Read the contributing guidelines
  • Linked a relevant issue which this PR resolves
  • Linked any other relevant issues, PR's, or documentation, if any
  • Resolved all conflicts, if any
  • Rebased your branch PR on top of the latest upstream main branch
  • Attached at least one of the following labels to the PR: [chore, ci, docs, feat, fix, test]
  • Signed all commits using a GPG signature

To recreate commits with GPG signature git fetch upstream && git rebase --force --gpg-sign upstream/main


Related to #420

Description of the change:

  • Update container event handler to check with persisted targets.

Motivation for the change:

See #420 (comment). This fixes the problem when cryostat is restarted, observed targets will cause duplicate key violation and removed targets remain in database (stale).

How to manually test

mvn clean package -DskipTests=true
CRYOSTAT_DISCOVERY_JDP_ENABLED=false bash smoketest.bash -O

@tthvo tthvo force-pushed the target-diff branch 2 times, most recently from b44e7d0 to 5c94bb8 Compare April 26, 2024 21:47
@tthvo tthvo marked this pull request as ready for review April 26, 2024 21:50
@tthvo tthvo requested a review from a team April 26, 2024 21:51
@andrewazores
Copy link
Member

Does the smoketest.bash -V switch help for testing the behaviour on restart? This prevents volumes from being destroyed on shutdown, so it should mean that on the next restart, the Postgres container reads from the existing database from the previous run. This somewhat emulates the case where the Cryostat container alone would restart within a k8s Pod.

@tthvo
Copy link
Member Author

tthvo commented Apr 26, 2024

Discovery works fine with podman while checks are done against database Though, I have run into these 2 issues that I need some advice on:

  • There are some conflicts in connectURL between podman and JDP during compose smoketest. For example, jfr-datasource, I think it happens on main too.
  • The approach I used here (same as KubeAPIDiscovery) is to run a single transaction all discovered targets. If one fails on-startup (i.e. duplicate key), the entire transaction will be aborted and the realm will remain empty. Is this desirable? Or should one transaction per target (wbu performance)?

@tthvo
Copy link
Member Author

tthvo commented Apr 26, 2024

Does the smoketest.bash -V switch help for testing the behaviour on restart? This prevents volumes from being destroyed on shutdown, so it should mean that on the next restart, the Postgres container reads from the existing database from the previous run. This somewhat emulates the case where the Cryostat container alone would restart within a k8s Pod.

Ahh I was about to ask! Thanks I will try that!

@tthvo
Copy link
Member Author

tthvo commented Apr 26, 2024

Ahh, I am seeing an odd violation during startup:

ERROR: Thread Thread[executor-thread-2,5,main] threw an uncaught exception
compose-cryostat-1        | org.hibernate.exception.ConstraintViolationException: could not execute statement [ERROR: duplicate key value violates unique constraint "target_connecturl_key"
compose-cryostat-1        |   Detail: Key (connecturl)=(\xaced00057372000c6a6176612e6e65742e555249ac01782e439e49ab0300014c0006737472696e677400124c6a6176612f6c616e672f537472696e673b7870740037736572766963653a6a6d783a726d693a2f2f2f6a6e64692f726d693a2f2f6372796f737461742d706f643a35313432332f6a6d78726d6978) already exists.] [insert into Target (alias,annotations,connectUrl,discoveryNode,jvmId,labels,id) values (?,?,?,?,?,?,?)]

@tthvo
Copy link
Member Author

tthvo commented Apr 26, 2024

Does the smoketest.bash -V switch help for testing the behaviour on restart? This prevents volumes from being destroyed on shutdown, so it should mean that on the next restart, the Postgres container reads from the existing database from the previous run. This somewhat emulates the case where the Cryostat container alone would restart within a k8s Pod.

Ahh I was about to ask! Thanks I will try that!

Looks good to me!! It was very helpful! I tried with these steps to test this PR:

# Initial setup
CRYOSTAT_DISCOVERY_JDP_ENABLED=false bash smoketest.bash -OV

podman pod create --replace --name cryostat-pod

podman run \
        --name jmxquarkus \
        --pod cryostat-pod \
        --label io.cryostat.discovery="true" \
        --label io.cryostat.jmxPort="51423" \
        --env QUARKUS_HTTP_PORT=10012 \
        --rm -d quay.io/roberttoyonaga/jmx:jmxquarkus@sha256:b067f29faa91312d20d43c55d194a2e076de7d0d094da3d43ee7d2b2b5a6f100

podman run \
        --name vertx-fib-demo-0 \
        --env HTTP_PORT=8079 \
        --env JMX_PORT=9089 \
        --env START_DELAY=60 \
        --pod cryostat-pod \
        --label io.cryostat.discovery="true" \
        --label io.cryostat.jmxHost="vertx-fib-demo-0" \
        --label io.cryostat.jmxPort="9089" \
        --rm -d quay.io/andrewazores/vertx-fib-demo:0.13.1


# Kill containers and restart
# The discovery tree should remain the same
CRYOSTAT_DISCOVERY_JDP_ENABLED=false bash smoketest.bash -OV

# Now, kill the containers again
# Remove the pod
podman pod rm cryostat-pod -f

# Start again
# The pod and its containers should be pruned
CRYOSTAT_DISCOVERY_JDP_ENABLED=false bash smoketest.bash -OV

Ahh, I am seeing an odd violation during startup:

ERROR: Thread Thread[executor-thread-2,5,main] threw an uncaught exception
compose-cryostat-1        | org.hibernate.exception.ConstraintViolationException: could not execute statement [ERROR: duplicate key value violates unique constraint "target_connecturl_key"
compose-cryostat-1        |   Detail: Key (connecturl)=(\xaced00057372000c6a6176612e6e65742e555249ac01782e439e49ab0300014c0006737472696e677400124c6a6176612f6c616e672f537472696e673b7870740037736572766963653a6a6d783a726d693a2f2f2f6a6e64692f726d693a2f2f6372796f737461742d706f643a35313432332f6a6d78726d6978) already exists.] [insert into Target (alias,annotations,connectUrl,discoveryNode,jvmId,labels,id) values (?,?,?,?,?,?,?)]

Okay this should be fixed now thanks to the synchronized keyword. Not sure what happens there but something related to threading during transaction. Otherwise, I just have the above questions: #423 (comment)

@andrewazores
Copy link
Member

There are some conflicts in connectURL between podman and JDP during compose smoketest. For example, jfr-datasource, I think it happens on main too.

Yea, that was semi-intentional back when I first put all that together just so I could test that Cryostat/the database would properly reject the definitions due to the duplicate URLs. I think by now that's already well-proven, so we can change it to do something different - give the containers different hostnames to be referenced by, or just turn off one of the discovery mechanisms in the smoketest by default so they don't overlap/collide.

The approach I used here (same as KubeAPIDiscovery) is to run a single transaction all discovered targets. If one fails on-startup (i.e. duplicate key), the entire transaction will be aborted and the realm will remain empty. Is this desirable? Or should one transaction per target (wbu performance)?

It's a good question. I think from a user perspective it would be much more preferable that it works a bit more slowly but is able to discover as many of the targets as possible, vs being very fast but if it runs into any errors then leaving me with no discovered targets at all.

Anyway, if we do determine there are performance issues due to database accesses and the associated latency, Quarkus/Hibernate/JPA has a lot of powerful caching capabilities:

https://quarkus.io/guides/hibernate-orm#caching

so it seems like some judicious application of the @Cacheable annotation would probably help solve that, if we need it to be solved.

@andrewazores
Copy link
Member

andrewazores commented Apr 27, 2024

Okay this should be fixed now thanks to the synchronized keyword. Not sure what happens there but something related to threading during transaction. Otherwise, I just have the above questions: #423 (comment)

This means that if there are many worker threads invoked by the informer, all those threads will get blocked waiting for the lock to perform their database transaction, which will be relatively slow because they're waiting for database I/O.

I suggest another approach: have the callback (invoked by worker threads?) just add the event to a blocking queue for processing, and have a dedicated thread that takes items from the queue as they become available and processes them in a transaction. This way there is only one thread that is spending a lot of its time waiting for database I/O before moving on to process the next event and waiting for the database again, and the other threads are free to do other work.

It would be a similar pattern to how I wrote the WebSocket MessagingServer:

"event" (notification to be sent): https://github.com/cryostatio/cryostat3/blob/12b97bf7746a72a965dc168c23c4fae69ec19953/src/main/java/io/cryostat/ws/MessagingServer.java#L144

task queue processor: https://github.com/cryostatio/cryostat3/blob/12b97bf7746a72a965dc168c23c4fae69ec19953/src/main/java/io/cryostat/ws/MessagingServer.java#L95


It is probably also possible to achieve this using the EventBus alone without an explicit ExecutorService, with something like this pseudocode:

class ContainerDiscovery {

onStart() {
  informer.listen(evt -> bus.publish(ContainerDiscovery.class.getName(), evt);
}

@ConsumeEvent(blocking = true, ordered = true) // maybe doesn't need to be ordered if it's done this way?
@Transactional
void onEvent(ContainerEvent evt) {
  ... // process the event, add/remove from database
}

@tthvo tthvo marked this pull request as draft April 30, 2024 05:59
@tthvo
Copy link
Member Author

tthvo commented Apr 30, 2024

Sounds good to me! Just need to look into it a little deeper so I am marking this draft for now. I suppose we also want to apply these changes to kubeAPI also?

@andrewazores
Copy link
Member

I think that makes sense.

@tthvo tthvo force-pushed the target-diff branch 4 times, most recently from 1f58984 to 6d7594e Compare May 2, 2024 00:59
@tthvo
Copy link
Member Author

tthvo commented May 2, 2024

/build_test

Copy link

github-actions bot commented May 2, 2024

Workflow started at 5/2/2024, 12:57:34 AM. View Actions Run.

Copy link

github-actions bot commented May 2, 2024

No OpenAPI schema changes detected.

Copy link

github-actions bot commented May 2, 2024

No GraphQL schema changes detected.

Copy link

github-actions bot commented May 2, 2024

CI build and push: All tests pass ✅ (JDK17)
https://github.com/cryostatio/cryostat3/actions/runs/8918817387

@tthvo
Copy link
Member Author

tthvo commented May 2, 2024

@ConsumeEvent(blocking = true, ordered = true) // maybe doesn't need to be ordered if it's done this way?
@Transactional
void onEvent(ContainerEvent evt) {
  ... // process the event, add/remove from database
}```

Unfortunately, for this case, we still need the order = true as it ensures pod node is persisted once. Otherwise, multiple discovery FOUND event messages will persist the pod node multiple times. That causes the same issue as #420. I guess its performance now is similar to synchronized keyword?

ie. if you don't specify value = "pkg.Class" in the @ConsumeEvent, it defaults to the class name. Then you can just publish to the class name.

Phew, took me a while to realize that the @ConsumeEvent must be specified on the sub-class so I have to do a workaround to allow sub-class to specify its own address. This also helps with separating podman and docker discovery event.

I think this should be working now? The changes are such that:

  • Cryostat will add and prune targets accordingly based on persisted and observed targets.

  • Handler that persists/prunes the target is invoked in a separate transaction, responding to a message received from the event bus. Thus, satisfying the requirement here:

    It's a good question. I think from a user perspective it would be much more preferable that it works a bit more slowly but is able to discover as many of the targets as possible, vs being very fast but if it runs into any errors then leaving me with no discovered targets at all.

But, seems like when using bash smoketest.bash -OV, there are initially no LOST message emitted from ContainerDiscovery, but the target list appears empty at first. Not first what other mechanisms are pruning targets at first...

@tthvo
Copy link
Member Author

tthvo commented May 2, 2024

Another thing I notice with this eventBus approach is that the same message might be emitted multiple times. I think its because eventBus.publish is called within a transaction so if current one is retried, the same message will be sent again. Tho, this PR also handles that :D

@tthvo
Copy link
Member Author

tthvo commented May 3, 2024

/build_test

Copy link

github-actions bot commented May 3, 2024

Workflow started at 5/3/2024, 12:26:09 PM. View Actions Run.

Copy link

github-actions bot commented May 3, 2024

No OpenAPI schema changes detected.

Copy link

github-actions bot commented May 3, 2024

No GraphQL schema changes detected.

Copy link

github-actions bot commented May 3, 2024

CI build and push: All tests pass ✅ (JDK17)
https://github.com/cryostatio/cryostat3/actions/runs/8941969954

@andrewazores
Copy link
Member

/build_test

Copy link

github-actions bot commented May 3, 2024

Workflow started at 5/3/2024, 2:37:27 PM. View Actions Run.

Copy link

github-actions bot commented May 3, 2024

No GraphQL schema changes detected.

Copy link

github-actions bot commented May 3, 2024

No OpenAPI schema changes detected.

Copy link

github-actions bot commented May 3, 2024

CI build and push: All tests pass ✅ (JDK17)
https://github.com/cryostatio/cryostat3/actions/runs/8943413897

@andrewazores andrewazores merged commit 626604e into cryostatio:main May 3, 2024
8 checks passed
@tthvo tthvo deleted the target-diff branch May 3, 2024 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants