Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add caching, tracing, logging, and batching to the plugin #413

Merged
merged 3 commits into from
Jun 15, 2024
Merged

Add caching, tracing, logging, and batching to the plugin #413

merged 3 commits into from
Jun 15, 2024

Conversation

craigtmoore
Copy link
Contributor

@craigtmoore craigtmoore commented Jun 11, 2024

Adds caching using Caffine cache, logging with java.util.logging, tracing using opentelemtry, and inserts the data into the database using batching. We have some builds on our Jenkins instance that have over 29,000 test results so we added these changes to optimize performance.

Fixes #412

  • Update DatabaseTestResultStorage:
    • add caseResultsCache
    • add packageResultsCache
    • update publish() method to use batch updates to improve performance
    • Add @WithSpan annotation to each method
    • add logger
  • Update DatabaseTestResultStorageTest:
    • add basic unit-test with a mocked database
    • move duplicate code segments to constants and methods
    • print tables with proper column widths (and truncate values that are too large)
  • Update pom.xml:
    • add opentelemtry dependencies (for tracing)
    • add caffine dependency (for caching)
    • add dependency on kotlin-std-lib-jkd8 to fix dependency resolution issues
    • tie the hpi goal to the compile phase
  • update docker-compose.yaml:
    • add commented out 'mysql' database config
    • add jaeger server
    • add jenkins server with otel agent embedded
  • docker-compose-with-zipkin.yaml: same as docker-compose.yaml, but using zipkin instead of jaeger
  • Add Dockerfile: creates an instance of the Jenkins docker image that includes the opentelemetry agent jar file, used to generate theweatherman/jenkins:lts-jdk17-otel image.
  • Add setup_jenkins.sh: bash script for installing the plugin and configuring jenkins to store junit results in the databse (works with the docker-compose.yaml)

Testing done

I re-used the existing tests to verify that I've introduced no new regressions, but also addeed a new unit-test to the getCaseResults_mockDatabase() method and verify that it loads the data correctly.

Submitter checklist

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • [NA] Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

@timja timja added the enhancement New feature or request label Jun 11, 2024
@timja timja changed the title #412 Add caching, tracing, logging, and batching to the plugin Add caching, tracing, logging, and batching to the plugin Jun 11, 2024
Copy link
Member

@timja timja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How extensively have you tested this? Have you deployed it to your instance with the large number of results?

My main concern is that the optimised database calls for only reading e.g the fail or skipped count are now being in Java so the full amount needs to be read into memory and filtered client side.

Its possible that most of the methods get called when this is used anyway so it may not impact much, or possibly there could either be more caches.

I assume you found a cache was needed because retrieval was slow? do you have any info on this?

The batch changes make full sense

Dockerfile Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
pom.xml Outdated Show resolved Hide resolved
pom.xml Outdated Show resolved Hide resolved
pom.xml Outdated Show resolved Hide resolved
pom.xml Outdated Show resolved Hide resolved
setup_jenkins.sh Outdated Show resolved Hide resolved
@craigtmoore
Copy link
Contributor Author

As far as testing goes, the unit-tests at our company are quite large, over 29,000 test results in a single build. We're looking into using a plugin like this, but the performance was quite poor when first tried it (which is why I looked at improving the performance). At first I deployed it using mysql instead of postgres because I could not connect the database to Jenkins.

Once I was able to deploy it using and connect it to a dockerized mysql database, I ran the following pipeline:

pipeline {
    agent any
    stages {
        stage('Extract test results') {
            steps {
                sh 'unzip -o /tmp/archive.zip -d build'
            }
        }
    }
    post {
        always {
            junit 'build/**/TEST-*.xml'
        }
    }
}

Basically, I downloaded the zip file with the large number of test results from our production Jenkins server and then
copied that to the Jenkins instance, using:

docker cp archive.zip junit-sql-storage-plugin-jenkins-1:/tmp/archive.zip

Then I ran the pipeline and it took a long time to store the test results (nearly 3 minutes to store the results using
a mysql database). So I added open telemetry to the plugin so I could see where the bottlenecks were. I decided to add
batching to the publish method to help reduce the time it takes to store the values. This certainly helped, but the
other problem was loading the test results took a really long time. I thought it might have been that the queries were
too slow, but it turned out that the junit plugin was calling the getAllPackageResults over 1500 times
(which can takes 100s of milliseconds) so I added caching which reduced the processing time down to 10s of microseconds for each method call. I decided to take it one step further and cache the value returned by the retrieveCaseResults() method
(which I renamed getCaseResults()), this also really helped because the case results query takes ~20ms to run and
that takes a long time when there are over 29000 test results. Also, that query was being called by many of the
meta-data methods, like:

  • getFailedTests
  • getSkippedTests
  • getPassedTests
    By caching the caseResults, we avoided running the query repeatedly and reduced the processing time to 10s of
    microseconds.

So to answer your comment, yes I did a lot of testing to verify that my changes improved the performance. The telemetry that I added was also very useful in figuring out the bottle necks.

Copy link
Member

@timja timja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting close

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
deploy.sh Outdated Show resolved Hide resolved
docker-compose.yaml Outdated Show resolved Hide resolved
jenkins-config.yaml Outdated Show resolved Hide resolved
@craigtmoore
Copy link
Contributor Author

I think I've resolved all of your comments, please let me know if there is any thing else. I really appreciate all of the feedback. I'm going to do a bit more testing, especially with junit attachments plugin.

craigtmoore and others added 3 commits June 14, 2024 12:45
- Update DatabaseTestResultStorage:
  * add caseResultsCache
  * add packageResultsCache
  * update publish() method to use batch updates to improve performance
  * Add manual trace creation (annotation based tracing is only supported using an open telemetry agent)
    * add withSpan(), createSpan(), addSqlAttributes(), addPackageAttributes(), and getTracer() helper methods for creating manual traces to the code
  * add logger
- Update DatabaseTestResultStorageTest:
  * add basic unit-test with a mocked database
  * move duplicate code segments to constants and methods
  * print tables with proper column widths (and truncate values that are too large)
  * use LoggerRule for configuring the log level of the DatabaseTestResultStorage's logger
- Update pom.xml:
  * add dependency on io.jenkins.plugins:opentelemetry plugin
  * update dependencyManagements: fix kotlin and errorprone dependency versions cause build to fail due to multiple version being pulled in by transitive dependencies
  * add io.jenkins.plugins:caffine dependency (for caching)
  * add dependencyManagement for `kotlin-std-lib-jkd*` to fix dependency resolution issues
- update docker-compose.yaml:
  * add jaeger server
  * add jenkins server: skip setup wizard and add otel exporter (to jaeger)
- Add Dockerfile:
  * creates an instance of the Jenkins docker image (used as the 'jenkins' image in the docker-compose.yaml file)
  * install plugins using jenkins-plugin-cli: configuration-as-code, database-postgresql, and opentelemetry
  * install plugin junit-sql-plugin from the target folder and configure jenkins using jenkins-config.yaml file
- Add deploy.sh: bash script for compiling the plugin, building the 'jenkins' docker image, and deploying the docker-compose.yaml file
- Add jenkins-config.yaml:
  * add settings for postgresql plugin
  * configure the openTelemetry plugin
  * configure the junit to store results to database
- Update README.md:
  * Add examples on how to build plugin
  * Add example on how to deploy docker compose swarm
  * Add example on how to install the compiled plugin
  * Add example on how to configure jenkins using configuration-as-code plugin
  * Add note on Accessing jaeger
  * Add note on how to access the db inside of docker
Copy link
Member

@timja timja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good, I've tested it out, fixed up the deploy script to work with build and a couple of other minor things.

I've found one issue though with the open telemetry integration, I would expect the spans to be linked to the build but they aren't:

image

I'm wondering if the junit step needs a StepHandler?

https://github.com/jenkinsci/opentelemetry-plugin/blob/03559ec4adf9ed6843e7fb0635a5e4cc9f961fb2/src/main/java/io/jenkins/plugins/opentelemetry/job/step/DurableTaskHandler.java

see also:
jenkinsci/opentelemetry-plugin#850

I couldn't see any TRACEPARENT env values being set

cc @cyrille-leclerc or @kuisathaverat if you can provide any advice

statement.setNull(12, Types.VARCHAR);
@Override
public void publish(TestResult result, TaskListener listener) throws IOException {
var publishSpan = createSpan("DatabaseTestResultStorage.RemotePublisherImpl.publish");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell this only works for agents on the controller not remote agents

Copy link
Member

@timja timja Jun 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting -Dotel.java.global-autoconfigure.enabled=true on the agent seems to make it work

@timja
Copy link
Member

timja commented Jun 15, 2024

I don't think its StepHandler I think if the variables are propagated through createRemotePublisher and this code is copied then it would work:
https://github.com/jenkinsci/opentelemetry-plugin/blob/82e3e1b2574a68864dcd05b43b95d0aeb7f41249/src/main/java/io/jenkins/plugins/opentelemetry/job/OtelEnvironmentContributorService.java#L41-L52

@timja timja merged commit e33b238 into jenkinsci:master Jun 15, 2024
14 checks passed
@timja
Copy link
Member

timja commented Jun 15, 2024

Continued at #414

@jonesbusy
Copy link
Contributor

Shoudn't the opentelemetry plugin be optional ? What about jenkins instance using SQL storage but not connected to an Opentelemetry collector ?

From what I remember few month ago is that having OpenTelemetry plugin installed will display some deak links on job/build page when the plugin is not configured

@timja
Copy link
Member

timja commented Jun 16, 2024

Possibly, would be better to remove those dead links I think if the plugin isn't configured at all.

Plugins should be able to integrate without it adding a bunch of things to the UI

@cyrille-leclerc
Copy link
Contributor

Thanks for your interest in the opentelemetry plugin.
I think we could make it optional for simpler integration of otel in the other Jenkins plugins, I have a few checks to do.
I'll get back to you ASAP

@timja
Copy link
Member

timja commented Jun 16, 2024

just raised a PR jenkinsci/opentelemetry-plugin#868

@kuisathaverat
Copy link

I couldn't see any TRACEPARENT env values being set

you have to enable the export of the environment variables that define the TRACEPARENT, but I not sure how this plugin get the Opentelemetry context, it probably has to retrieve it from the job not front and environment variable. We store the context in actions in the job.

@timja
Copy link
Member

timja commented Jun 16, 2024

I was able to get it in #414

@cyrille-leclerc
Copy link
Contributor

@craigtmoore @timja the way you integrate OpenTelemetry is great. GlobalOpenTelemetry.get() is the way to go.
I'm looking at some improvements to this pattern GlobalOpenTelemetry.get(). I'll get back to you shortly.

I saw @timja' PR, Ill review it it asap, I have to understand the challenge you faced :-)

@timja
Copy link
Member

timja commented Jun 17, 2024

Anything running on the controller I think would be quite straightforward, but running on agents is a bit more complicated, its not the nicest / cleanest implementation in #414

@craigtmoore craigtmoore deleted the add-caching-tracing-logging-batching branch July 26, 2024 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add tracing, logging, and caching to the plugin
5 participants