fix(discovery): observed containers should be checked with persisted nodes #423

tthvo · 2024-04-26T19:00:16Z

Welcome to Cryostat3! 👋

Before contributing, make sure you have:

Read the contributing guidelines
Linked a relevant issue which this PR resolves
Linked any other relevant issues, PR's, or documentation, if any
Resolved all conflicts, if any
Rebased your branch PR on top of the latest upstream main branch
Attached at least one of the following labels to the PR: [chore, ci, docs, feat, fix, test]
Signed all commits using a GPG signature

To recreate commits with GPG signature git fetch upstream && git rebase --force --gpg-sign upstream/main

Related to #420

Description of the change:

Update container event handler to check with persisted targets.

Motivation for the change:

See #420 (comment). This fixes the problem when cryostat is restarted, observed targets will cause duplicate key violation and removed targets remain in database (stale).

How to manually test

mvn clean package -DskipTests=true
CRYOSTAT_DISCOVERY_JDP_ENABLED=false bash smoketest.bash -O

andrewazores · 2024-04-26T21:56:48Z

Does the smoketest.bash -V switch help for testing the behaviour on restart? This prevents volumes from being destroyed on shutdown, so it should mean that on the next restart, the Postgres container reads from the existing database from the previous run. This somewhat emulates the case where the Cryostat container alone would restart within a k8s Pod.

tthvo · 2024-04-26T21:57:12Z

~~Discovery works fine with podman while checks are done against database~~ Though, I have run into these 2 issues that I need some advice on:

There are some conflicts in connectURL between podman and JDP during compose smoketest. For example, jfr-datasource, I think it happens on main too.
The approach I used here (same as KubeAPIDiscovery) is to run a single transaction all discovered targets. If one fails on-startup (i.e. duplicate key), the entire transaction will be aborted and the realm will remain empty. Is this desirable? Or should one transaction per target (wbu performance)?

tthvo · 2024-04-26T21:57:31Z

Does the smoketest.bash -V switch help for testing the behaviour on restart? This prevents volumes from being destroyed on shutdown, so it should mean that on the next restart, the Postgres container reads from the existing database from the previous run. This somewhat emulates the case where the Cryostat container alone would restart within a k8s Pod.

Ahh I was about to ask! Thanks I will try that!

tthvo · 2024-04-26T22:37:02Z

Ahh, I am seeing an odd violation during startup:

ERROR: Thread Thread[executor-thread-2,5,main] threw an uncaught exception
compose-cryostat-1        | org.hibernate.exception.ConstraintViolationException: could not execute statement [ERROR: duplicate key value violates unique constraint "target_connecturl_key"
compose-cryostat-1        |   Detail: Key (connecturl)=(\xaced00057372000c6a6176612e6e65742e555249ac01782e439e49ab0300014c0006737472696e677400124c6a6176612f6c616e672f537472696e673b7870740037736572766963653a6a6d783a726d693a2f2f2f6a6e64692f726d693a2f2f6372796f737461742d706f643a35313432332f6a6d78726d6978) already exists.] [insert into Target (alias,annotations,connectUrl,discoveryNode,jvmId,labels,id) values (?,?,?,?,?,?,?)]

tthvo · 2024-04-26T22:57:45Z

Does the smoketest.bash -V switch help for testing the behaviour on restart? This prevents volumes from being destroyed on shutdown, so it should mean that on the next restart, the Postgres container reads from the existing database from the previous run. This somewhat emulates the case where the Cryostat container alone would restart within a k8s Pod.

Ahh I was about to ask! Thanks I will try that!

Looks good to me!! It was very helpful! I tried with these steps to test this PR:

# Initial setup
CRYOSTAT_DISCOVERY_JDP_ENABLED=false bash smoketest.bash -OV

podman pod create --replace --name cryostat-pod

podman run \
        --name jmxquarkus \
        --pod cryostat-pod \
        --label io.cryostat.discovery="true" \
        --label io.cryostat.jmxPort="51423" \
        --env QUARKUS_HTTP_PORT=10012 \
        --rm -d quay.io/roberttoyonaga/jmx:jmxquarkus@sha256:b067f29faa91312d20d43c55d194a2e076de7d0d094da3d43ee7d2b2b5a6f100

podman run \
        --name vertx-fib-demo-0 \
        --env HTTP_PORT=8079 \
        --env JMX_PORT=9089 \
        --env START_DELAY=60 \
        --pod cryostat-pod \
        --label io.cryostat.discovery="true" \
        --label io.cryostat.jmxHost="vertx-fib-demo-0" \
        --label io.cryostat.jmxPort="9089" \
        --rm -d quay.io/andrewazores/vertx-fib-demo:0.13.1


# Kill containers and restart
# The discovery tree should remain the same
CRYOSTAT_DISCOVERY_JDP_ENABLED=false bash smoketest.bash -OV

# Now, kill the containers again
# Remove the pod
podman pod rm cryostat-pod -f

# Start again
# The pod and its containers should be pruned
CRYOSTAT_DISCOVERY_JDP_ENABLED=false bash smoketest.bash -OV

Ahh, I am seeing an odd violation during startup:

ERROR: Thread Thread[executor-thread-2,5,main] threw an uncaught exception
compose-cryostat-1        | org.hibernate.exception.ConstraintViolationException: could not execute statement [ERROR: duplicate key value violates unique constraint "target_connecturl_key"
compose-cryostat-1        |   Detail: Key (connecturl)=(\xaced00057372000c6a6176612e6e65742e555249ac01782e439e49ab0300014c0006737472696e677400124c6a6176612f6c616e672f537472696e673b7870740037736572766963653a6a6d783a726d693a2f2f2f6a6e64692f726d693a2f2f6372796f737461742d706f643a35313432332f6a6d78726d6978) already exists.] [insert into Target (alias,annotations,connectUrl,discoveryNode,jvmId,labels,id) values (?,?,?,?,?,?,?)]

Okay this should be fixed now thanks to the synchronized keyword. Not sure what happens there but something related to threading during transaction. Otherwise, I just have the above questions: #423 (comment)

andrewazores · 2024-04-27T01:04:11Z

There are some conflicts in connectURL between podman and JDP during compose smoketest. For example, jfr-datasource, I think it happens on main too.

Yea, that was semi-intentional back when I first put all that together just so I could test that Cryostat/the database would properly reject the definitions due to the duplicate URLs. I think by now that's already well-proven, so we can change it to do something different - give the containers different hostnames to be referenced by, or just turn off one of the discovery mechanisms in the smoketest by default so they don't overlap/collide.

The approach I used here (same as KubeAPIDiscovery) is to run a single transaction all discovered targets. If one fails on-startup (i.e. duplicate key), the entire transaction will be aborted and the realm will remain empty. Is this desirable? Or should one transaction per target (wbu performance)?

It's a good question. I think from a user perspective it would be much more preferable that it works a bit more slowly but is able to discover as many of the targets as possible, vs being very fast but if it runs into any errors then leaving me with no discovered targets at all.

Anyway, if we do determine there are performance issues due to database accesses and the associated latency, Quarkus/Hibernate/JPA has a lot of powerful caching capabilities:

https://quarkus.io/guides/hibernate-orm#caching

so it seems like some judicious application of the @Cacheable annotation would probably help solve that, if we need it to be solved.

andrewazores · 2024-04-27T01:14:33Z

Okay this should be fixed now thanks to the synchronized keyword. Not sure what happens there but something related to threading during transaction. Otherwise, I just have the above questions: #423 (comment)

This means that if there are many worker threads invoked by the informer, all those threads will get blocked waiting for the lock to perform their database transaction, which will be relatively slow because they're waiting for database I/O.

I suggest another approach: have the callback (invoked by worker threads?) just add the event to a blocking queue for processing, and have a dedicated thread that takes items from the queue as they become available and processes them in a transaction. This way there is only one thread that is spending a lot of its time waiting for database I/O before moving on to process the next event and waiting for the database again, and the other threads are free to do other work.

It would be a similar pattern to how I wrote the WebSocket MessagingServer:

"event" (notification to be sent): https://github.com/cryostatio/cryostat3/blob/12b97bf7746a72a965dc168c23c4fae69ec19953/src/main/java/io/cryostat/ws/MessagingServer.java#L144

task queue processor: https://github.com/cryostatio/cryostat3/blob/12b97bf7746a72a965dc168c23c4fae69ec19953/src/main/java/io/cryostat/ws/MessagingServer.java#L95

It is probably also possible to achieve this using the EventBus alone without an explicit ExecutorService, with something like this pseudocode:

class ContainerDiscovery {

onStart() {
  informer.listen(evt -> bus.publish(ContainerDiscovery.class.getName(), evt);
}

@ConsumeEvent(blocking = true, ordered = true) // maybe doesn't need to be ordered if it's done this way?
@Transactional
void onEvent(ContainerEvent evt) {
  ... // process the event, add/remove from database
}

tthvo · 2024-04-30T06:00:37Z

Sounds good to me! Just need to look into it a little deeper so I am marking this draft for now. I suppose we also want to apply these changes to kubeAPI also?

andrewazores · 2024-04-30T13:28:13Z

I think that makes sense.

src/main/java/io/cryostat/discovery/ContainerDiscovery.java

tthvo · 2024-05-02T04:57:14Z

/build_test

github-actions · 2024-05-02T04:57:35Z

Workflow started at 5/2/2024, 12:57:34 AM. View Actions Run.

github-actions · 2024-05-02T05:00:42Z

No OpenAPI schema changes detected.

github-actions · 2024-05-02T05:00:43Z

No GraphQL schema changes detected.

github-actions · 2024-05-02T05:03:44Z

CI build and push: All tests pass ✅ (JDK17)
https://github.com/cryostatio/cryostat3/actions/runs/8918817387

tthvo · 2024-05-02T05:10:12Z

@ConsumeEvent(blocking = true, ordered = true) // maybe doesn't need to be ordered if it's done this way?
@Transactional
void onEvent(ContainerEvent evt) {
  ... // process the event, add/remove from database
}```

Unfortunately, for this case, we still need the order = true as it ensures pod node is persisted once. Otherwise, multiple discovery FOUND event messages will persist the pod node multiple times. That causes the same issue as #420. I guess its performance now is similar to synchronized keyword?

ie. if you don't specify value = "pkg.Class" in the @ConsumeEvent, it defaults to the class name. Then you can just publish to the class name.

Phew, took me a while to realize that the @ConsumeEvent must be specified on the sub-class so I have to do a workaround to allow sub-class to specify its own address. This also helps with separating podman and docker discovery event.

I think this should be working now? The changes are such that:

Cryostat will add and prune targets accordingly based on persisted and observed targets.
Handler that persists/prunes the target is invoked in a separate transaction, responding to a message received from the event bus. Thus, satisfying the requirement here:

It's a good question. I think from a user perspective it would be much more preferable that it works a bit more slowly but is able to discover as many of the targets as possible, vs being very fast but if it runs into any errors then leaving me with no discovered targets at all.

But, seems like when using bash smoketest.bash -OV, there are initially no LOST message emitted from ContainerDiscovery, but the target list appears empty at first. Not first what other mechanisms are pruning targets at first...

tthvo · 2024-05-02T05:12:14Z

Another thing I notice with this eventBus approach is that the same message might be emitted multiple times. I think its because eventBus.publish is called within a transaction so if current one is retried, the same message will be sent again. Tho, this PR also handles that :D

in subclass bean

tthvo · 2024-05-03T16:25:45Z

/build_test

github-actions · 2024-05-03T16:26:09Z

Workflow started at 5/3/2024, 12:26:09 PM. View Actions Run.

github-actions · 2024-05-03T16:29:38Z

No OpenAPI schema changes detected.

github-actions · 2024-05-03T16:29:41Z

No GraphQL schema changes detected.

github-actions · 2024-05-03T16:32:43Z

CI build and push: All tests pass ✅ (JDK17)
https://github.com/cryostatio/cryostat3/actions/runs/8941969954

src/main/java/io/cryostat/discovery/ContainerDiscovery.java

andrewazores · 2024-05-03T18:37:03Z

/build_test

github-actions · 2024-05-03T18:37:28Z

Workflow started at 5/3/2024, 2:37:27 PM. View Actions Run.

github-actions · 2024-05-03T18:40:48Z

No GraphQL schema changes detected.

github-actions · 2024-05-03T18:40:50Z

No OpenAPI schema changes detected.

github-actions · 2024-05-03T18:44:23Z

CI build and push: All tests pass ✅ (JDK17)
https://github.com/cryostatio/cryostat3/actions/runs/8943413897

tthvo added fix safe-to-test labels Apr 26, 2024

tthvo force-pushed the target-diff branch 2 times, most recently from b44e7d0 to 5c94bb8 Compare April 26, 2024 21:47

tthvo marked this pull request as ready for review April 26, 2024 21:50

tthvo requested a review from a team April 26, 2024 21:51

andrewazores mentioned this pull request Apr 27, 2024

chore(ws): simplify messaging server #424

Merged

7 tasks

tthvo marked this pull request as draft April 30, 2024 05:59

tthvo force-pushed the target-diff branch from 5e3790e to d220dee Compare April 30, 2024 21:21

andrewazores reviewed May 1, 2024

View reviewed changes

src/main/java/io/cryostat/discovery/ContainerDiscovery.java Outdated Show resolved Hide resolved

tthvo force-pushed the target-diff branch 4 times, most recently from 1f58984 to 6d7594e Compare May 2, 2024 00:59

tthvo added 14 commits May 3, 2024 09:22

fix(discovery): only handle message if enabled

e439b0f

fixup(discovery): no need for explicit address

5ac2e11

fix(discovery): message should be sent now - consumeEvent must be used

71292ac

in subclass bean

fixup(discovery): separate addresses for podman and docker discovery

87e8d4a

chore(discovery): remove unused fieldsc

701c7a1

fixup(discovery): targets should now be correctly persisted

f44c9d8

fixup(discovery): targets should be correctly persisted

12c5db5

chore(discovery): define common method in parent class

b26f9f9

fix(discovery): should return persisted target if any

d1086b4

fix(discovery): handle constraint violation for duplicate keys

50d6ae4

chore(discovery): simplify methods

2b5f990

fix(compose): do not clean up database if volumes are kept

e6bdc75

fix(discovery): require new transactions for discovery handler

8ce2731

fix(discovery): ensure target has the latest snapshot

d8009ad

tthvo force-pushed the target-diff branch from 68ae2fb to d8009ad Compare May 3, 2024 16:22

andrewazores reviewed May 3, 2024

View reviewed changes

src/main/java/io/cryostat/discovery/ContainerDiscovery.java Outdated Show resolved Hide resolved

chore(discovery): combine startup event listeners

c52b759

andrewazores approved these changes May 3, 2024

View reviewed changes

andrewazores merged commit 626604e into cryostatio:main May 3, 2024
8 checks passed

tthvo deleted the target-diff branch May 3, 2024 18:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(discovery): observed containers should be checked with persisted nodes #423

fix(discovery): observed containers should be checked with persisted nodes #423

tthvo commented Apr 26, 2024 •

edited

Loading

andrewazores commented Apr 26, 2024

tthvo commented Apr 26, 2024 •

edited

Loading

tthvo commented Apr 26, 2024

tthvo commented Apr 26, 2024 •

edited

Loading

tthvo commented Apr 26, 2024

andrewazores commented Apr 27, 2024

andrewazores commented Apr 27, 2024 •

edited

Loading

tthvo commented Apr 30, 2024

andrewazores commented Apr 30, 2024

tthvo commented May 2, 2024

github-actions bot commented May 2, 2024

github-actions bot commented May 2, 2024

github-actions bot commented May 2, 2024

github-actions bot commented May 2, 2024

tthvo commented May 2, 2024 •

edited

Loading

tthvo commented May 2, 2024

tthvo commented May 3, 2024

github-actions bot commented May 3, 2024

github-actions bot commented May 3, 2024

github-actions bot commented May 3, 2024

github-actions bot commented May 3, 2024

andrewazores commented May 3, 2024

github-actions bot commented May 3, 2024

github-actions bot commented May 3, 2024

github-actions bot commented May 3, 2024

github-actions bot commented May 3, 2024

fix(discovery): observed containers should be checked with persisted nodes #423

fix(discovery): observed containers should be checked with persisted nodes #423

Conversation

tthvo commented Apr 26, 2024 • edited Loading

Welcome to Cryostat3! 👋

Before contributing, make sure you have:

Description of the change:

Motivation for the change:

How to manually test

andrewazores commented Apr 26, 2024

tthvo commented Apr 26, 2024 • edited Loading

tthvo commented Apr 26, 2024

tthvo commented Apr 26, 2024 • edited Loading

tthvo commented Apr 26, 2024

andrewazores commented Apr 27, 2024

andrewazores commented Apr 27, 2024 • edited Loading

tthvo commented Apr 30, 2024

andrewazores commented Apr 30, 2024

tthvo commented May 2, 2024

github-actions bot commented May 2, 2024

github-actions bot commented May 2, 2024

github-actions bot commented May 2, 2024

github-actions bot commented May 2, 2024

tthvo commented May 2, 2024 • edited Loading

tthvo commented May 2, 2024

tthvo commented May 3, 2024

github-actions bot commented May 3, 2024

github-actions bot commented May 3, 2024

github-actions bot commented May 3, 2024

github-actions bot commented May 3, 2024

andrewazores commented May 3, 2024

github-actions bot commented May 3, 2024

github-actions bot commented May 3, 2024

github-actions bot commented May 3, 2024

github-actions bot commented May 3, 2024

tthvo commented Apr 26, 2024 •

edited

Loading

tthvo commented Apr 26, 2024 •

edited

Loading

tthvo commented Apr 26, 2024 •

edited

Loading

andrewazores commented Apr 27, 2024 •

edited

Loading

tthvo commented May 2, 2024 •

edited

Loading