Improve performance / resource usage for big codebases #10

Trass3r · 2019-09-19T16:55:55Z

The analyzer can't read large json files due to this code:

diff --git a/src/main.cpp b/src/main.cpp
index 380b26f..fdfafbc 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -26,7 +26,7 @@ static std::string ReadFileToString(const std::string& path)
     if (!f)
         return "";
     fseek(f, 0, SEEK_END);
-    size_t fsize = ftell(f);
+    size_t fsize = _ftelli64(f);
     fseek(f, 0, SEEK_SET);
     std::string str;
     str.resize(fsize);

Maybe it should just use memory-mapped files.

Trass3r · 2019-09-19T16:58:26Z

Also it uses loads of memory for such large codebases, around 27GB for clang.

aras-p · 2019-09-21T17:27:43Z

Yeah, changed to (turns out, different on each OS!) variants of ftell just now, thanks.

I have some thoughts about improving performance and reducing memory usage for very large codebases, but didn't get to do that yet.

JVApen · 2019-09-30T06:28:52Z

I've encountered similar issues when trying this. In order to get this partially working, I had to rewrite parts of the code. My changes ain't ready to share, as I mainly hacked it in. Not sure if I even want to clean it up.

Some elements I found (runStop):

The program takes everything into memory. Instead of reading all files and putting it into memory, I read them once for the filtering and stored the file name iso the content. On generation, I simply read it a second time and immediately write it to a std::fstream.
I've replaced the std::map of the JsonFileFinder with a std::vector<std::pair<string, string>>, I guess std::vectorstd::string could do?
Although previous changes made it possible to generate the combined file, it was still almost out-of-memory on a powerful machine. I would wonder if it doesn't make more sense in simply writing it while reading, so you don't need a state.

The analyze phase currently takes way too much memory to be usable (combined file of 40GB causes a memory usage of 220GB). From what I can see, to support this kind of magnitude, the choice of JSON parser ain't OK. As it's a DOM-parser, it first needs to get the complete file in memory and then translate it into a tree (takes even more space) before any preprocessing can be done. If you really want to support big files, I think you'll need a SAX parser

aras-p · 2019-09-30T17:21:26Z

I was thinking that majority of space in the combined JSON file (or even in single JSON file) is redundant strings, e.g. full paths to where exactly <vector> was, over and over again.

My plan is to at some point making the "smash all jsons into one huge json" a bit more intelligent. It could de-duplicate strings and just store their IDs, and a table of ID->string elsewhere. Maybe then it would not be as huge.

aras-p · 2020-03-02T08:22:46Z

@Trass3r @JVApen I did a bunch of changes (memory handling, threading, ...) that made the thing 2x faster, use 10x less memory and the data file is 4x smaller in my tests, see #37 -- plan to merge it to master branch soon.

aras-p · 2020-03-02T19:12:27Z

Merged the above to master, should be better than previously. If still issues on your codebases, please reopen!

aras-p added a commit that referenced this issue Sep 21, 2019

Use _ftelli64 for larger files support (#10)

3261ef4

aras-p added a commit that referenced this issue Sep 21, 2019

Support larger files by using 64 bit ftell variants (#10)

15f0f20

aras-p changed the title ~~large file support~~ Improve performance / resource usage for big codebases Sep 28, 2019

aras-p added the enhancement New feature or request label Sep 28, 2019

aras-p mentioned this issue Feb 29, 2020

Optimizations for large builds #37

Merged

7 tasks

aras-p closed this as completed Mar 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance / resource usage for big codebases #10

Improve performance / resource usage for big codebases #10

Trass3r commented Sep 19, 2019 •

edited

Loading

Trass3r commented Sep 19, 2019

aras-p commented Sep 21, 2019

JVApen commented Sep 30, 2019 •

edited

Loading

aras-p commented Sep 30, 2019

aras-p commented Mar 2, 2020

aras-p commented Mar 2, 2020

Improve performance / resource usage for big codebases #10

Improve performance / resource usage for big codebases #10

Comments

Trass3r commented Sep 19, 2019 • edited Loading

Trass3r commented Sep 19, 2019

aras-p commented Sep 21, 2019

JVApen commented Sep 30, 2019 • edited Loading

aras-p commented Sep 30, 2019

aras-p commented Mar 2, 2020

aras-p commented Mar 2, 2020

Trass3r commented Sep 19, 2019 •

edited

Loading

JVApen commented Sep 30, 2019 •

edited

Loading