Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Clarify default number of threads. #4975

Closed
wants to merge 1 commit into from

Conversation

trivialfis
Copy link
Member

While looking into #4843 , I found the configuration of nthreads is not unified nor clear. This PR clarify the behaviour of nthreads and creates a simple function that used through out XGBoost with tests.

@trivialfis trivialfis requested a review from hcho3 October 23, 2019 07:48
@trivialfis
Copy link
Member Author

@hcho3 Now we need to face the inevitable: #4619

Copy link
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a major change in default behaviour. Since I have been involved with xgboost it has always used all threads by default for training.

Your choice of default is probably a good one. Particularly on systems with hyperthreading this can dramatically improve performance. How confident are we that this applies broadly? Some users out there may update their xgboost and get a major performance regression.

Maybe find some examples of what other libraries are doing in this situation, so we have some kind of consistent behaviour. I don't think I have seen another library automatically choose half the number of threads. Also if we change it, it needs to be advertised as a breaking change (or it least a significant change in expected behaviour).

@hcho3
Copy link
Collaborator

hcho3 commented Oct 25, 2019

@RAMitchell We have an example in dmlc/tvm: https://github.com/dmlc/tvm/blob/cffb4fba03ea582417e2630bd163bca773756af6/src/runtime/threading_backend.cc#L226-L230

@trivialfis trivialfis changed the title Clarify default number of threads. [WIP] Clarify default number of threads. Oct 25, 2019
@trivialfis
Copy link
Member Author

@RAMitchell @hcho3 I see the significance of this change. Marked as WIP for further discussion.

@RAMitchell
Copy link
Member

I'm thinking let's take your approach and document clearly. For the majority of cases it is correct and the best we can do, given no way to reliably detect hyperthreads from c++.

@trivialfis
Copy link
Member Author

trivialfis commented Oct 26, 2019

@RAMitchell On c++ land no, on Linux yes.

@trivialfis
Copy link
Member Author

trivialfis commented Oct 26, 2019

@RAMitchell Also, right now (before this PR) we use whatever OMP is set outside of XGBoost context, so user may have a global configuration of threads. This will change the behaviour. Now that I think about it, it's evil to "be smart" and add configurations. Quoting from Python:

import this
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.

So more inputs are welcomed. @trams @hcho3 ;-)

@RAMitchell
Copy link
Member

It would be good to leave some hints on the docs about the evils of hyperthreading and suggest manually configuring nthreads for optimal performance.

@trams
Copy link
Contributor

trams commented Oct 26, 2019

I like the general direction of this pull request. Your code change makes it very clear that by default we are using NUM_OF_LOGICAL_CPU / 2 and I think one should add a comment that it is done because of Hyper Threading (it is not obvious)

Now here is my feedback

  1. We run xgboost sometimes in hosted notebook. So the kernel is actually running inside Mesos container with restriction to 2 or 4 CPUs. I always set n_thread manually so I do not know what will be default value. Either way, I am not sure that ignoring hyper threading is particularly good idea in containers. I am not sure whether Linux CFS and its share limits (Kuberneris and Mesos uses this mechanism) aware of hyperthreading. In cloud we also usually have muti processor machines so it makes things even more complex
  2. I suggest to add some logging. If n_thread is not set get a reasonable default and output it to stdout or stderr so a user can see this picked value
  3. I suggest to create a python function get_default_nthread to expose this value to a user (to give an ability to troubleshoot) like wrf-python did https://wrf-python.readthedocs.io/en/latest/user_api/generated/wrf.omp_get_num_procs.html

Meanwhile I will try to launch training in this hosted solution and see how it works in 0.90

@trams
Copy link
Contributor

trams commented Oct 27, 2019

Also I forgot to share two interesting links

  1. https://www.openmp.org/spec-html/5.0/openmpsu114.html it seems the function will return number of processors (logical?) so it won't return container size. I do not know how to fetch a container size :(
  2. http://mesos.apache.org/documentation/latest/isolators/cgroups-cpu/#effects-on-application-when-using-cfs-bandwidth-limiting CFS Bandwidth limiting has interesting side effects.

@trivialfis
Copy link
Member Author

trivialfis commented Oct 27, 2019

@trams Thanks for your suggestions and pointers. They are really helpful!

Either way, I am not sure that ignoring hyper threading is particularly good idea in containers.

I think none of us has any good idea on how to make the default value good across different platform. That's why I quoted the Python Easter eggs and thinking just stick with whatever omp is default and let the user choose it explicitly.

If n_thread is not set get a reasonable default and output it to stdout or stderr so a user can see this picked value

I'm trying to avoid warning, in many issues people choose to ignore warning by setting verbosity = 0, which suppresses many truly important warnings like not to set the updater parameter.

I suggest to create a python function get_default_nthread to expose this value to a user

Aside from this PR, I want to create a global config store for XGBoost like xgb.config.gpu_id = 0 with some language sugars like in Python:

with xgb.config(gpu_id=0, nthread=16) as xgb_config:
    dtrain = xgb.DMatrix(X_train, label=y_train)
    xgb.train({'tree_method`: 'gpu_hist', ... }, dtrain)

The configuration maybe available for per-booster and global. But right now it's not on the top of to-do list.

https://www.openmp.org/spec-html/5.0/openmpsu114.html it seems the function will return number of processors (logical?) so it won't return container size.

I tried this and it returns number of threads after hyper-threading.

CFS Bandwidth limiting has interesting side effects.

There are other side effects like memory bandwidths. Building histogram is very sensitive to this one. So having more threads with limited memory bandwidth is also harmful.

@trams
Copy link
Contributor

trams commented Nov 20, 2019

@trams Thanks for your suggestions and pointers. They are really helpful!

Either way, I am not sure that ignoring hyper threading is particularly good idea in containers.

I think none of us has any good idea on how to make the default value good across different platform. That's why I quoted the Python Easter eggs and thinking just stick with whatever omp is default and let the user choose it explicitly.
Yes. I like this principle. Explicit is better then implicit

If n_thread is not set get a reasonable default and output it to stdout or stderr so a user can see this picked value

I'm trying to avoid warning, in many issues people choose to ignore warning by setting verbosity = 0, which suppresses many truly important warnings like not to set the updater parameter.
Fair enough but I still strongly suggest

  1. output it to debug log at least
  2. Expose this in xgboost-spark (or even fail if it is not explicitly set in xgboost-spark) as Spark Accumulator or Spark Config so one can see it in UI

I suggest to create a python function get_default_nthread to expose this value to a user

Aside from this PR, I want to create a global config store for XGBoost like xgb.config.gpu_id = 0 with some language sugars like in Python:

with xgb.config(gpu_id=0, nthread=16) as xgb_config:
    dtrain = xgb.DMatrix(X_train, label=y_train)
    xgb.train({'tree_method`: 'gpu_hist', ... }, dtrain)

The configuration maybe available for per-booster and global. But right now it's not on the top of to-do list.

https://www.openmp.org/spec-html/5.0/openmpsu114.html it seems the function will return number of processors (logical?) so it won't return container size.

I tried this and it returns number of threads after hyper-threading.
I think I failed to communicate my concerned.
My concern is not whether the function returns number of physical processors or logical (hyper threads). My concern how does this function will behave inside the Mesos container (or any other container).

Let me give you some context.

In my company we have a hosted notebook solution which allows to launch a notebook inside the cloud. Behind the scene it launches a kernel inside a Mesos container (with 8 cpus by default). From a user perspective amount of logical cores one has is the size of the container (so it will be 8) but ... as far as I read Mesos documentation CPU limit implemented using CFS Bandwidth limiting which results in OMP thinking that it have access to all logical cores of the machine (which is 72 in our case) instead of actual amount of "available" core in the container (8 in our case)
This will lead in increased overhead for handling all these extra threads for nothing

This Mesos container thing already bit me few times. That's why I strongly suggest to either have some logging or expose the default in python library so a hypothetical user can just look it up and check whether it is sane

CFS Bandwidth limiting has interesting side effects.

There are other side effects like memory bandwidths. Building histogram is very sensitive to this one. So having more threads with limited memory bandwidth is also harmful.

@@ -142,6 +145,15 @@ class Range {
};

int AllVisibleGPUs();

inline int OmpDefaultThreads(int32_t threads) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to make it not inline function and expose it in C API and Python API

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants