Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance / resource usage for big codebases #10

Closed
Trass3r opened this issue Sep 19, 2019 · 6 comments
Closed

Improve performance / resource usage for big codebases #10

Trass3r opened this issue Sep 19, 2019 · 6 comments
Labels
enhancement New feature or request

Comments

@Trass3r
Copy link
Contributor

Trass3r commented Sep 19, 2019

The analyzer can't read large json files due to this code:

diff --git a/src/main.cpp b/src/main.cpp
index 380b26f..fdfafbc 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -26,7 +26,7 @@ static std::string ReadFileToString(const std::string& path)
     if (!f)
         return "";
     fseek(f, 0, SEEK_END);
-    size_t fsize = ftell(f);
+    size_t fsize = _ftelli64(f);
     fseek(f, 0, SEEK_SET);
     std::string str;
     str.resize(fsize);

Maybe it should just use memory-mapped files.

@Trass3r
Copy link
Contributor Author

Trass3r commented Sep 19, 2019

Also it uses loads of memory for such large codebases, around 27GB for clang.

@aras-p
Copy link
Owner

aras-p commented Sep 21, 2019

Yeah, changed to (turns out, different on each OS!) variants of ftell just now, thanks.

I have some thoughts about improving performance and reducing memory usage for very large codebases, but didn't get to do that yet.

@aras-p aras-p changed the title large file support Improve performance / resource usage for big codebases Sep 28, 2019
@aras-p aras-p added the enhancement New feature or request label Sep 28, 2019
@JVApen
Copy link

JVApen commented Sep 30, 2019

I've encountered similar issues when trying this. In order to get this partially working, I had to rewrite parts of the code. My changes ain't ready to share, as I mainly hacked it in. Not sure if I even want to clean it up.

Some elements I found (runStop):

  • The program takes everything into memory. Instead of reading all files and putting it into memory, I read them once for the filtering and stored the file name iso the content. On generation, I simply read it a second time and immediately write it to a std::fstream.
  • I've replaced the std::map of the JsonFileFinder with a std::vector<std::pair<string, string>>, I guess std::vectorstd::string could do?
  • Although previous changes made it possible to generate the combined file, it was still almost out-of-memory on a powerful machine. I would wonder if it doesn't make more sense in simply writing it while reading, so you don't need a state.

The analyze phase currently takes way too much memory to be usable (combined file of 40GB causes a memory usage of 220GB). From what I can see, to support this kind of magnitude, the choice of JSON parser ain't OK. As it's a DOM-parser, it first needs to get the complete file in memory and then translate it into a tree (takes even more space) before any preprocessing can be done. If you really want to support big files, I think you'll need a SAX parser

@aras-p
Copy link
Owner

aras-p commented Sep 30, 2019

I was thinking that majority of space in the combined JSON file (or even in single JSON file) is redundant strings, e.g. full paths to where exactly <vector> was, over and over again.

My plan is to at some point making the "smash all jsons into one huge json" a bit more intelligent. It could de-duplicate strings and just store their IDs, and a table of ID->string elsewhere. Maybe then it would not be as huge.

@aras-p
Copy link
Owner

aras-p commented Mar 2, 2020

@Trass3r @JVApen I did a bunch of changes (memory handling, threading, ...) that made the thing 2x faster, use 10x less memory and the data file is 4x smaller in my tests, see #37 -- plan to merge it to master branch soon.

@aras-p
Copy link
Owner

aras-p commented Mar 2, 2020

Merged the above to master, should be better than previously. If still issues on your codebases, please reopen!

@aras-p aras-p closed this as completed Mar 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants