Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak #185

Open
benjwadams opened this issue Aug 2, 2024 · 16 comments
Open

Memory leak #185

benjwadams opened this issue Aug 2, 2024 · 16 comments

Comments

@benjwadams
Copy link

Describe the bug
Memory leak under ERDDAP

To Reproduce
Steps to reproduce the behavior:
Uncertain, but this behavior occurs on some production ERDDAP servers with many aggregations.
Expected behavior
Reasonable memory usage that doesn't grow without
Screenshots
If applicable, add screenshots to help explain your problem.

Server

  • What ERDDAP™ server? If you're the admin, please provide what version of ERDDAP™, Tomcat, and Java you are using.
    ERDDAP 2.23, based off Axiom's Docker-ERDDAP image
    Desktop (please complete the following information):
  • OS: Linux

Additional context
Memory grows without bound in certain ERDDAP configurations, eventually exhausting memory.

@ChrisJohnNOAA
Copy link
Contributor

One thing to note, Java generally uses all available memory before initiating garbage collection. This behavior can sometimes be interpreted as a leak, though it is just how the JVM behaves. Does memory usage reach a peak and then drop? If your server has no queries for a period does memory usage drop to a low baseline? Does the server crash with an out of memory error?

Can you provide additional information about your setup? For example what datasets (or at least dataset types) are you using? Are there queries or feature usage that seem to trigger your issues? Does it seem to happen if the server is under high load? Do you have a lot of datasets that refresh and are you using the new sharedWatchService?

Can you upgrade the software? ERDDAP's latest version is 2.24. I'd also recommend using the most recent LTS Java version (21), and ensuring the Tomcat version is up to date.

@srstsavage
Copy link
Contributor

Currently docker-erddap doesn't have an image for ERDDAP 2.24. I can work on publishing that with Java 21 and updated Tomcat to facilitate testing here. Will ping when that's ready.

@rmendels
Copy link
Collaborator

rmendels commented Aug 4, 2024

@srstsavage In looking up how to set up your Docker it shows:

docker run
-p 8080:8080
--env ERDDAP_MIN_MEMORY=4G --env ERDDAP_MAX_MEMORY=8G
...
axiom/docker-erddap

ERDDAP_MIN_MEMORY, ERDDAP_MAX_MEMORY are not default settings in Tomcat, so it must be something you do internally in the docker image. What do these map do inside the image, and generally what does the setenv.sh look like, or any other scripts controlling the startup of Tomcat and Java (things like which garbage collector is being used, any restrictions on metaspace use etc etc). It may just be a coincidence, but most of the reports of this have been in sites running the Docker image. We have had a related problem but it has to do with swapping and running out of swap space, mainly because we need more memory - the Java memory model introduced in Java 16 changes a lot of how things are to be set. One of the key differences is that a lot of things that use to be in heap are not put there any more, so that the total java memory usage is much larger than max heap size (on our heavy used system heap is set at 10GB, rarely is more than 5GB-7GB used, but total Java memory usage stays around 21GB - so just monitoring heap size is't sufficient. ) Basically you don't want any swapping (a long story).

Also if memory serves in this particular instance -Xms, -Xmx are set to different values. For a server type setting it is much preferred that these be the same values, you can duckduckgo this and see discussions of this.

@srstsavage
Copy link
Contributor

srstsavage commented Aug 4, 2024

Docker image for ERDDAP 2.24 is available at axiom/docker-erddap:2.24-jdk21-openjdk

Github - tag, release

@rmendels At your suggestion I updated the README to favor ERDDAP_MEMORY which sets Xms and Xmx to the same value (#85). There may be some use cases where users would want to set those differently, but I agree that most users will probably want them set to the same value to avoid heap adjustments.

setenv.sh is available in the GitHub repository.

Do we have any concept of how many users are using Docker vs manually setting up a Tomcat instance? I have no clue, but I've only heard of Docker deployments for many years now (other than the CoastWatch mothership of course).

@rmendels
Copy link
Collaborator

rmendels commented Aug 4, 2024

@srstsavage @benjwadams Great with the image! Thanks. This might require some work, but would help us if once you get the particular ERDDAP instance mentioned above, if you could monitor the usage of the following:

  1. heap space use
  2. metaspace use
  3. total java memory use (and the total memory available)
  4. swap space use (and total swap space available)
  5. number of threads running under java (I find the number given by say visualvm is not as good as that given by btop)

A detailed time series isn't as important as likely maximum values of each and some idea of how much they fluctuate. Particularly for total memory use, you need to have the ERDDAP completely loaded and running for a bit to get a feel for the total java memory and number of threads

One more thing that would be useful is to get an idea on what type of requests are being made during peak usage. So for instance is the heaviest use when NDBC accesses the data what are the URLs given. These type of things may help as narrow down if there is leak where it may be, or if there are Java settings that can alleviate the problems.

@benjwadams
Copy link
Author

cc @pavaniankam92

@benjwadams
Copy link
Author

We're still working on instrumentation of this issue, but I did put an .hprof dump into the Eclipse Memory Analyzer Tool from an ERDDAP that had been running for four days or so:

Screenshot from 2024-10-18 12-37-29

Memory usage currently reported by top is well in excess of this:

4062619 rocky 20 0 31.0g 23.8g 0 S 1.0 38.3 9472:41 java

@ChrisJohnNOAA
Copy link
Contributor

We're still working on instrumentation of this issue, but I did put an .hprof dump into the Eclipse Memory Analyzer Tool from an ERDDAP that had been running for four days or so:

Screenshot from 2024-10-18 12-37-29

Memory usage currently reported by top is well in excess of this:

4062619 rocky 20 0 31.0g 23.8g 0 S 1.0 38.3 9472:41 java

Erddap.java holds a number of concurrent hashmaps which have data loaded into them shortly after the server starts up (and will hold it as long as the server is running- some profilers might flag this behavior as a potential leak because it will be very old). There are a number of ConcurrentHashMaps in Erddap.java, knowing which one(s) are large for you and if they are growing (particularly, growing after the initial load datasets) would be extremely helpful since I have not reproduced the problem you are having.

@pavaniankam92
Copy link

Configured JVM Metrics of ERDDAP application

Please find the metrices for the past 12 hours, attached screenshots here for your reference
We will monitor the JVM metrices and see what parameters exceeds during the peak load

HeapMemory
HeapMomory_Graph
Metaspace
Threads_Data

Let me know if you any questions on the same.

@benjwadams
Copy link
Author

benjwadams commented Nov 7, 2024

Having some serious ERDDAP issues today where I/O is getting bogged down and server load average is extremely high.

Got an interesting stack trace in the logs along with log message that explicitly mentions possible memory leak conditions, however:

07-Nov-2024 17:53:09.247 WARNING [Thread-2] org.apache.catalina.loader.WebappClassLoaderBase.clearReferencesThreads The web application [erddap] is still processing a request that has yet to finish. This is very likely to create a memory leak. You can control the time allowed for requests to finish by using the unloadDelay attribute of the standard Context implementation. Stack trace of request processing thread:[
 java.base@17.0.9/java.io.RandomAccessFile.readBytes(Native Method)
 java.base@17.0.9/java.io.RandomAccessFile.read(RandomAccessFile.java:405)
 ucar.unidata.io.RandomAccessFile.read_(RandomAccessFile.java:738)
 ucar.unidata.io.RandomAccessFile.readBuffer(RandomAccessFile.java:503)
 ucar.unidata.io.RandomAccessFile.seek(RandomAccessFile.java:491)
 ucar.nc2.iosp.hdf5.BTree2$LeafNode.<init>(BTree2.java:181)
 ucar.nc2.iosp.hdf5.BTree2$InternalNode.recurse(BTree2.java:168)
 ucar.nc2.iosp.hdf5.BTree2.<init>(BTree2.java:88)
 ucar.nc2.internal.iosp.hdf5.H5objects.readGroupNew(H5objects.java:2112)
 ucar.nc2.internal.iosp.hdf5.H5objects.access$700(H5objects.java:27)
 ucar.nc2.internal.iosp.hdf5.H5objects$H5Group.<init>(H5objects.java:197)
 ucar.nc2.internal.iosp.hdf5.H5objects$H5Group.<init>(H5objects.java:155)
 ucar.nc2.internal.iosp.hdf5.H5objects.readRootSymbolTable(H5objects.java:59)
 ucar.nc2.internal.iosp.hdf5.H5headerNew.readSuperBlock1(H5headerNew.java:357)
 ucar.nc2.internal.iosp.hdf5.H5headerNew.read(H5headerNew.java:253)
 ucar.nc2.internal.iosp.hdf5.H5iospNew.build(H5iospNew.java:122)
 ucar.nc2.NetcdfFiles.build(NetcdfFiles.java:811)
 ucar.nc2.NetcdfFiles.open(NetcdfFiles.java:750)
 ucar.nc2.NetcdfFiles.open(NetcdfFiles.java:276)
 ucar.nc2.NetcdfFiles.open(NetcdfFiles.java:243)
 ucar.nc2.NetcdfFiles.open(NetcdfFiles.java:216)
 gov.noaa.pfel.coastwatch.griddata.NcHelper.openFile(NcHelper.java:849)
 gov.noaa.pfel.coastwatch.pointdata.Table.readNDNc(Table.java:6866)
 gov.noaa.pfel.erddap.dataset.EDDTableFromNcFiles.lowGetSourceDataFromFile(EDDTableFromNcFiles.java:212)
 gov.noaa.pfel.erddap.dataset.EDDTableFromFiles.getSourceDataFromFile(EDDTableFromFiles.java:3271)
 gov.noaa.pfel.erddap.dataset.EDDTableFromFilesCallable.call(EDDTableFromFilesCallable.java:105)
 gov.noaa.pfel.erddap.dataset.EDDTableFromFilesCallable.call(EDDTableFromFilesCallable.java:31)
 gov.noaa.pfel.erddap.util.ThreadedWorkManager.addTask(ThreadedWorkManager.java:35)
 gov.noaa.pfel.erddap.dataset.EDDTableFromFiles.getDataForDapQuery(EDDTableFromFiles.java:3925)
 gov.noaa.pfel.erddap.dataset.EDDTable.getTwawmForDapQuery(EDDTable.java:1744)
 gov.noaa.pfel.erddap.dataset.EDDTable.respondToDapQuery(EDDTable.java:3271)
 gov.noaa.pfel.erddap.Erddap.doDap(Erddap.java:4750)
 gov.noaa.pfel.erddap.Erddap.doGet(Erddap.java:629)
 jakarta.servlet.http.HttpServlet.service(HttpServlet.java:564)
 jakarta.servlet.http.HttpServlet.service(HttpServlet.java:658)
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:205)
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:149)
 org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:51)
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:174)
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:149)
 org.apache.catalina.filters.CorsFilter.handleNonCORS(CorsFilter.java:331)
 org.apache.catalina.filters.CorsFilter.doFilter(CorsFilter.java:158)
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:174)
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:149)
 org.apache.catalina.filters.HttpHeaderSecurityFilter.doFilter(HttpHeaderSecurityFilter.java:129)
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:174)
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:149)
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:167)
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:90)
 org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:482)
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:115)
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:93)
 org.apache.catalina.valves.RemoteIpValve.invoke(RemoteIpValve.java:735)
 org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:673)
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:74)
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:340)
 org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:391)
 org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:63)
 org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:896)
 org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1744)
 org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:52)
 org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1191)
 org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659)
 org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
 java.base@17.0.9/java.lang.Thread.run(Thread.java:840)]

@ChrisJohnNOAA
Copy link
Contributor

@benjwadams Based on that stack trace it looks like there may be expensive queries against netcdf based dataset(s). Can you check the ERDDAP logs for what the expensive queries might be? If there's a specific dataset that is causing a lot of expensive reads, I could try to look into what might be slow, if you share the dataset (and some sample files).

@benjwadams
Copy link
Author

I've seen a couple, but I'm not sure if it's due to load averages being high in the first place.
I think it may be due to a sort taking place on some queries -- how does ERDDAP index data for things such as sorting operations?

@benjwadams
Copy link
Author

Highly abridged output:

grep '#2062' log.txt{.previous,}
log.txt.previous:{{{{#2062 2024-11-07T18:57:39+00:00 [https] (unknownIPAddress) GET /erddap/tabledap/ce_312-20200705T1232.json?longitude,latitude,qartod_location_test_flag,time&orderBy(%22time%22)
log.txt:#2062 Error: ClientAbortException
log.txt:#2062 FAILURE. TIME=342080ms  (>10s!)
log.txt:*** sendError for request #2062 caught ERROR=ClientAbortException

Pertinent datasets.xml entry excerpt:

                        <!-- defaultDataQuery uses datasetID -->
                        <!--
                        <defaultDataQuery>&amp;trajectory=ce_312-20200705T1232</defaultDataQuery>
                        <defaultGraphQuery>longitude,latitude,time&amp;.draw=markers&amp;.marker=2|5&.color=0xFFFFFF&.colorBar=|||||</defaultGraphQuery>
                        -->
                        <reloadEveryNMinutes>720</reloadEveryNMinutes>
                        <updateEveryNMillis>-1</updateEveryNMillis>
                        <!-- use datasetID as the directory name -->
                        <fileDir>/data/data/priv_erddap/OOI-CE/ce_312-20200705T1232</fileDir>
                        <recursive>false</recursive>
                        <fileNameRegex>.*\.nc</fileNameRegex>
                        <metadataFrom>last</metadataFrom>
                        <sortedColumnSourceName>time</sortedColumnSourceName>
                        <sortFilesBySourceNames>trajectory time</sortFilesBySourceNames>
                        <fileTableInMemory>false</fileTableInMemory>
                        <accessibleViaFiles>true</accessibleViaFiles>

@rmendels
Copy link
Collaborator

rmendels commented Nov 7, 2024 via email

@benjwadams
Copy link
Author

Haven't been seeing extreme requests from external sources.

I think it might be the subset variables. I tried without qartod_location_test_flag and it ran very quickly. Without, it took about 7 1/2 minutes to load.

This might be something left field of the memory problems in this GitHub issue, but it does seem to eat up quite a but of I/O. Perhaps it's worth opening a separate issue for this particular behavior.

Of course, spawning a bunch of ERDDAP threads will chew up memory and CPU too while it's waiting for results.

@ChrisJohnNOAA
Copy link
Contributor

@benjwadams Please do make a separate issue about the subset variables. It would be helpful if you could provide a full dataset definition, data files, and example queries on that dataset that demonstrate the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants