Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

include performance improvements for FastTrack2 (memory + run-time) #84

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

taranbis
Copy link
Collaborator

@taranbis taranbis commented Jan 9, 2021

• reduced memory consumption on the VarState class
• improved parallelism on the VarState hash map by removing the Fasttrack lock and introduced locking at sub map level
• introduced 3 memory address removal policies + 1 default to cap memory consumption
• modified removal functions to also remove elements from the hash map used for storing call stack information. This includes replacing make_shared() function in Fasttrack::create_thread()
• changed unit tests to be more general, to work via the general interface using log counters 
• included the implementation of the cache efficient prefix tree (not currently used, only experimental, but maybe will useful later after some further optimizations

@taranbis taranbis added the performance performance could be better label Jan 9, 2021
@taranbis
Copy link
Collaborator Author

taranbis commented Jan 9, 2021

Hi,

I solved the conflicts as well, as mainly all them were because a context was added to detector initialization.

@fmoessbauer
Copy link
Member

Hi @taranbis , thanks for this contribution.
That is a lot of code to review, so it will take some time for us to review.

May I ask you to try to cleanup the commit history a bit in the meantime.
There seems to be a lot of "working-commit" commits.
Maybe you could group that up into features and try one commit per feature.

As you wrote, the effectiveness of some optimizations depends strongly on the application under test.
Currently we do not have a clean interface to pass arguments into the detector (we have to register all options on DRace currently).

@mhmdkanj Could you take a look into that and check what is possible with clipp to do so?
I'd prefer a method on the detector where it registers all options along with a docstring.
Also, we could think of a "detector-options" group, which is first constructed when the detector is known (might be a bit tricky because selecting the detector itself is already an option).
Anyways, maybe we should discuss this in a separate issue.

@taranbis
Copy link
Collaborator Author

Hi @fmoessbauer ,

I've arranged the commit history. will look over it again, as there were many commits. The project compiles and runs flawlessly in any of these commits.

As a TODO for me is still to check if there is a run-time drawback in using a stack of thread numbers instead of linearly increasing thread numbers. If there's not I will change the first commit to work with a stack of thread numbers, otherwise we will discuss here.

@mhmdkanj
Copy link
Collaborator

Hi @taranbis ,

Please check the latest master as from now on you could pass detector-specific options using the CLI, and if you're using the GUI, there as well. In brief, you could do that after specifying the detector flag with -d fasttrack -option1 -option2 - you could see that in the DRace README for the usage.

In case you had questions/concerns/bugs with this usage please feel free to contact me!

Copy link
Collaborator

@mhmdkanj mhmdkanj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @taranbis

First of all, great effort on your additions and changes!

I've left some comments throughout the files:

  • The most critical one being a compilation error in fasttrack.h (at least on my side), which needs a double-checking. Please refer to the corresponding comment for more info.
  • Most of the remaining comments pertain to style changes in order to be consistent with all other files - should be fast to change.
  • A couple of these comments pertain to questions regarding the intended use (in the case where several classes are put together in one header file) and whether this could be further beautified.

Upon making further changes and committing, please make sure of the following:

  • The clang style formatting script is executed on the changed file
  • The PR is rebased with the new master

After resolving these comments, we will further review the changes as to their implementation and usage.

Best Regards,
Mohamad

standalone/detectors/fasttrack/include/fasttrack.h Outdated Show resolved Hide resolved
standalone/detectors/fasttrack/include/fasttrack.h Outdated Show resolved Hide resolved
standalone/detectors/fasttrack/include/fasttrack.h Outdated Show resolved Hide resolved
standalone/detectors/fasttrack/include/fasttrack.h Outdated Show resolved Hide resolved
standalone/detectors/fasttrack/include/MemoryPool.h Outdated Show resolved Hide resolved
standalone/detectors/fasttrack/include/vectorclock.h Outdated Show resolved Hide resolved
standalone/detectors/fasttrack/src/threadstate.cpp Outdated Show resolved Hide resolved
standalone/ft_benchmark/ft_benchmark.cpp Outdated Show resolved Hide resolved
@taranbis
Copy link
Collaborator Author

Hi @mhmdkanj,

I will look within the next days over your comments. For the moment I just want to talk about the compilation error in fasttrack.h. I changed a file in the phmap submodule, namely, phmap.h to be able to drop an entire sub hash map, as I couldn't work with just the public interface. This is why the function is missing.

The changes cannot be commited from my side. They could be placed as a patch to the phmap implementation.

Best greetings,
Mihai

@mhmdkanj
Copy link
Collaborator

Hi @mhmdkanj,

I will look within the next days over your comments. For the moment I just want to talk about the compilation error in fasttrack.h. I changed a file in the phmap submodule, namely, phmap.h to be able to drop an entire sub hash map, as I couldn't work with just the public interface. This is why the function is missing.

The changes cannot be commited from my side. They could be placed as a patch to the phmap implementation.

Best greetings,
Mihai

@taranbis In the course of solving this issue, try to avoid the need to fork the phmap repo in order to patch it - but rather come up with an implementation on the DRace side, or if greg7mdp can patch it by opening up an issue there, etc... This can be up for discussion as well.

@taranbis
Copy link
Collaborator Author

Hi @mhmdkanj ,

well this feature is used to remove data from tracking and I am working with private members of the phmap class as I am dropping an entire sub hash map of memory addresses. @fmoessbauer knows about this. I don't have a solution to keep the feature will not modifying at least something in the fast parallel hash map implementation.

  1. to patch the hash map implementation.
  2. it could also work if the array of sub hash map in the phmap implementation is declared public, as it is currently private. This is why I had to work create the functions within that class. If it is public, I could move all implementations to DRace side However, it would not make much of a sense to declare the array of sub hash maps as public.
  3. just drop the feature, but I think it is faster than random or lowest clock removals. I could come up with some measurements if dropping this feature comes into discussion.

of course, I am open to suggestions and the choice is yours in the end.

All the best,
Mihai

@fmoessbauer
Copy link
Member

Hi @taranbis , I did already discuss this internally with @mhmdkanj and also back then was aware of the issue.
Nonetheless, we want to avoid having to maintain a fork of the phmap. By that, we have two options:

  1. upstream the necessary feature. As we have a use-case for that, we could try to get into contact with greg7mdp and discuss options for integration. Maybe just declaring some members protected would help, so that we can inherit from the map.
  2. Use another strategy that does not rely on the internals of the phmap. But according to your results, this was the most promising strategy, was it?

Anyways, thanks for this well-written contribution and for your interest in speeding up DRace ;)

@taranbis
Copy link
Collaborator Author

taranbis commented Feb 2, 2021

Hi @fmoessbauer , Hi @mhmdkanj,

I didn't get the chance to look over all your proposed changes. However, what I have done is talk to greg7mdp and he added the necessary API to be able to drop a sub hash map, so we can maintain all proposed features:

    void clear(std::size_t submap_index) {
        Inner& inner = sets_[submap_index];
        typename Lockable::UniqueLock m(inner);
        inner.set_.clear();
    }

Within the last commit I added the new API functionality and I also changed some other things that failed due to a changed namespace name. Now it should compile fine, so you can test in the meantime while I modify also the other things

Best regards!

@fmoessbauer
Copy link
Member

Hi @taranbis,

However, what I have done is talk to greg7mdp and he added the necessary API to be able to drop a sub hash map, so we can maintain all proposed features:

Excellent work! Now we can just use the upstream version and benefit from (possible) improvements there, instead of maintaining our own fork which would bitrot at some point in time.

Thanks also for taking the time to fix the other mentioned points. Is there still something conceptual, where we can help you? Or anything that needs a discussion?

All the best, Felix

@taranbis
Copy link
Collaborator Author

Hi,

I am sorry for the long due reply, but I have been really busy. First of all, I have walked though all the proposed changes and did all the required changes. However, working though the project again I realized that something could be improved, namely replacing pointers in prefix trees with shared pointers and also counting references in order to be able to remove them when needed.

However, this might take some time for me to investigate, but I will happily do it to see that memory is ok deallocated and that there is a mechanism for future removals.

All the best,
Mihai

@fmoessbauer
Copy link
Member

Hi Mihai, thanks for taking care of integrating the required changes.
Just take your time on the shared-ptr stuff.

I think we can integrate that in the meantime. Then, just create a new PR for that.

@mhmdkanj : Could you please give the new version a try?

Best, Felix

Copy link
Collaborator

@mhmdkanj mhmdkanj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @taranbis

Thank you lots for incorporating these changes! The required code style is thus matched.

Also, it compiles successfully now & I tried executing the integration tests locally with fasttrack, whereby all had succeeded (except for the DotnetClrMonitor one for some reason; we can check that out later @fmoessbauer, most probably the source is elsewhere).

Please take a look at the couple of comments I posted after the new changes. Don't worry though; they are mostly code-style issues due to the new changes and/or things I forgot to review previously. So, they should be fast to incorporate. The most pressing one is rebasing the branch onto the new master so that logging of fasttrack details could be toggled on/off by the user using fasttrack flags rather than being toggled on by default.

Take your time with adjusting these small details, and afterwards we could move to start integrating them.
If you have any inquiries feel free to contact me.

Best Regards,
Mohamad


void read_from_block(std::vector<std::pair<uintptr_t*, uintptr_t*>>* blocks) {
try {
std::uniform_int_distribution<int> dist(0, blocks->size() - 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 41 generates a warning that conversion from 'size_t' to '_Ty', possible loss of data; if this is intended, then it would be safe to ignore. Similarly in line 61.

standalone/ft_benchmark/ft_benchmark.cpp Outdated Show resolved Hide resolved
standalone/detectors/fasttrack/include/SelectAllocator.h Outdated Show resolved Hide resolved
standalone/detectors/fasttrack/include/StackTreeNode.h Outdated Show resolved Hide resolved
standalone/detectors/fasttrack/include/PoolAllocator.h Outdated Show resolved Hide resolved
standalone/detectors/fasttrack/include/PrefixTreeDepot.h Outdated Show resolved Hide resolved
standalone/detectors/fasttrack/include/fasttrack.h Outdated Show resolved Hide resolved
standalone/detectors/fasttrack/include/fasttrack.h Outdated Show resolved Hide resolved
@mhmdkanj
Copy link
Collaborator

Hi @fmoessbauer , I've mentioned you in three revision comments whereby your input might be needed, whenever you are able. Thanks in advance!

taranbis and others added 8 commits June 23, 2021 13:26
- removed pointer for read-shared state from VarState class and replaced
  it with a hash map mapping memory addresses to the respective vector
  clocks in case of a read-shared state
- reduce read and write epoch of the VarState class from 64 bits
  (32 bits thread ID + 32 clock value) to 32 bits (11 bits thread num +
  21 bits clock value)
- remove size member of the VarState class

Result: sizeof(VarState) = 8

- increased concurrency by using internal locks of the parallel hash map
  used to store VarStates
- switched std::list to std::deque for callstack information for
  improved performance
- FastTrack tests now work via the general interface (easier to extend)
- Functions return_stack_trace() and set_read_write() moved to
  ThreadState as by doing so it provides better decoupling with the
  callstack storage implementation, as storing the call stack will
  just have to implement 4 functions regardless of the implementation.
  These functions are accesses from Fasttrack class or from ThreadState
- this is not an improvment feature, but a necessary one, especially
  useful in the case of long-running applications
- 4 types of removal policies, 1 default removing retired memory
  addresses and 3 that can be selected: random removal, hash map pruning
  and removal by lowest clock
- VarStates can be removed even if they are in a read-shared state
- benchmark can be modified to test different performance aspects of
  DRace, such as more memory accessed and less threads or more threads
  and less memory accessed
- Implementations for storing call stack information will just have to
  implement 4 functions: make_trace(), get_current_element(),
  insert_function_element() and remove_function_element();
- PrefixTree_StackDepot is the first version of adapting a prefix tree
  to be used for storing call stack information. It uses hash maps
  instead of constant size arrays;
- PrefixTreeDepot.h represents the cache efficient optimized version for
  the prefix tree implementation.
@mhmdkanj
Copy link
Collaborator

mhmdkanj commented Jun 24, 2021

Hi @fmoessbauer ,
Current status:

    • My revisions above have already been merged
    • All Integration tests pass
    • 9/42 unit tests fail, all of which pertain to the newly written fasttrack tests; I can send you the log when/if needed

@taranbis
Copy link
Collaborator Author

Hi @mhmdkanj,

since the newly written fasttrack is responsible for that, I will have a look. I will get until Sunday back to you (sorry the last few months have been crazy). you can send the logs to my private email address.

@mhmdkanj
Copy link
Collaborator

mhmdkanj commented Jun 24, 2021

Hi @mhmdkanj,

since the newly written fasttrack is responsible for that, I will have a look. I will get until Sunday back to you (sorry the last few months have been crazy). you can send the logs to my private email address.

Hi @taranbis ,
I've sent you the text file to your yahoo mail.

Best,
Mohamad

@taranbis
Copy link
Collaborator Author

Hi @mhmdkanj ,

I looked over the tests. So, the reason the tests fail is because they use the logging structured In the last commit: refactor: set default logging to false in fasttrack (6f7d3da), you removed the final_output variable and made log_flag false.

The tests used the logging structure as it was an easy way not to depend on the internals of the fasttrack algorithm and pass info to fasttrack like normally. If you undo your commit, the tests should work fine.

I am already thinking of ways to rewrite this and one such way would be the callback. However, it doesn't hold all the info needed to differentiate the cases.

Until then, the integration tests pass if set the log flag to true. I did also a final_output flag to not print it when it was not necessary.

All the best,
Mihai

@taranbis
Copy link
Collaborator Author

Oh and one more thing. The callback not only does not hold all the required info, it is also only triggered when there's a race. You cannot distinguish the other use cases such as read-exclusive, read-shared-same-epoch etc.

@taranbis
Copy link
Collaborator Author

Hi @mhmdkanj,

Nevermind. I solved it. I added a wrapper on top of fasttrack to work with the testing framework. Now logging is like in your commit (false by default) and integration tests work. See my commit refactor: add class test wrapper (f381862).

Tell me if there is anything else.

P.S.: try not to use force pushes :D.

All the best,
Mihai

@mhmdkanj
Copy link
Collaborator

mhmdkanj commented Jun 29, 2021

Hi @mhmdkanj,

Nevermind. I solved it. I added a wrapper on top of fasttrack to work with the testing framework. Now logging is like in your commit (false by default) and integration tests work. See my commit refactor: add class test wrapper (f381862).

Tell me if there is anything else.

P.S.: try not to use force pushes :D.

All the best,
Mihai

Hi @taranbis ,
Yup this looks good. However:

97% tests passed, 2 tests failed out of 59

Total Test time (real) =  21.98 sec

The following tests FAILED:
         58 - Interface/DetectorTest.Parallelism/"fasttrack.standalone" (Failed)
         59 - Interface/DetectorTest.Parallelism/"tsan" (Failed)

Do these two unit tests pass on your end?

@taranbis
Copy link
Collaborator Author

taranbis commented Jun 29, 2021

Hi @mhmdkanj ,

No, but this is not fasttrack related. it also fails for tsan. This is a test written by @fmoessbauer to check that concurrency works well on the detector side. I might be able to have a look into those, but i guess @fmoessbauer knows better. I will check nonetheless.

@mhmdkanj
Copy link
Collaborator

@taranbis True, it's probably not a direct consequence but I was wondering if you were experiencing the same things.

@mhmdkanj
Copy link
Collaborator

Hi @fmoessbauer ,
So on my machine, the following tests cause the missing of a 100% pass rate:

Unit Tests

The following tests FAILED:
         45 - Interface/DetectorTest.ForkInitialize/"tsan" (Failed)
         47 - Interface/DetectorTest.Barrier/"tsan" (Failed)
         49 - Interface/DetectorTest.ResetRange/"tsan" (Failed)
         51 - Interface/DetectorTest.RaceInspection/"tsan" (Failed)
         58 - Interface/DetectorTest.Parallelism/"fasttrack.standalone" (Failed)
         59 - Interface/DetectorTest.Parallelism/"tsan" (Failed)

Most of the time only the last two fail; however, occasionally the others fail as well.

Integration Tests

[  FAILED  ] Integration/DR.HowToSolution/0, where GetParam() = "fasttrack"
[  FAILED  ] Integration/DR.NoIoRaces/0, where GetParam() = "fasttrack"

Usually either one or both of these fail.

@fmoessbauer
Copy link
Member

Hi @taranbis , @mhmdkanj ,

thanks for taking care of the integration. Regarding the failed tests:

  • Interface/DetectorTest.*/"tsan" the TSAN tests fail in case TSAN is not able to allocate memory at the specified location. That happens from time to time. Then, all TSAN tests fail. There's not much we can do about it.
  • Interface/DetectorTest.Parallelism/* I will have a look into that one. Maybe there is an issue in the test itself.
  • Integration/DR.HowToSolution/, Integration/DR.NoIoRaces/0 looks like we get races on IO. Maybe symbol names changed, so they are not correctly intercepted. I will have a look.

Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance performance could be better
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants