[WIP] Clarify default number of threads. #4975

trivialfis · 2019-10-23T07:48:35Z

While looking into #4843 , I found the configuration of nthreads is not unified nor clear. This PR clarify the behaviour of nthreads and creates a simple function that used through out XGBoost with tests.

trivialfis · 2019-10-23T13:58:24Z

@hcho3 Now we need to face the inevitable: #4619

RAMitchell

This is a major change in default behaviour. Since I have been involved with xgboost it has always used all threads by default for training.

Your choice of default is probably a good one. Particularly on systems with hyperthreading this can dramatically improve performance. How confident are we that this applies broadly? Some users out there may update their xgboost and get a major performance regression.

Maybe find some examples of what other libraries are doing in this situation, so we have some kind of consistent behaviour. I don't think I have seen another library automatically choose half the number of threads. Also if we change it, it needs to be advertised as a breaking change (or it least a significant change in expected behaviour).

hcho3 · 2019-10-25T06:32:34Z

@RAMitchell We have an example in dmlc/tvm: https://github.com/dmlc/tvm/blob/cffb4fba03ea582417e2630bd163bca773756af6/src/runtime/threading_backend.cc#L226-L230

trivialfis · 2019-10-25T08:15:35Z

@RAMitchell @hcho3 I see the significance of this change. Marked as WIP for further discussion.

RAMitchell · 2019-10-25T21:51:14Z

I'm thinking let's take your approach and document clearly. For the majority of cases it is correct and the best we can do, given no way to reliably detect hyperthreads from c++.

trivialfis · 2019-10-26T07:52:40Z

@RAMitchell On c++ land no, on Linux yes.

trivialfis · 2019-10-26T07:59:40Z

@RAMitchell Also, right now (before this PR) we use whatever OMP is set outside of XGBoost context, so user may have a global configuration of threads. This will change the behaviour. Now that I think about it, it's evil to "be smart" and add configurations. Quoting from Python:

import this

In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.

So more inputs are welcomed. @trams @hcho3 ;-)

RAMitchell · 2019-10-26T22:00:10Z

It would be good to leave some hints on the docs about the evils of hyperthreading and suggest manually configuring nthreads for optimal performance.

trams · 2019-10-26T23:06:57Z

I like the general direction of this pull request. Your code change makes it very clear that by default we are using NUM_OF_LOGICAL_CPU / 2 and I think one should add a comment that it is done because of Hyper Threading (it is not obvious)

Now here is my feedback

We run xgboost sometimes in hosted notebook. So the kernel is actually running inside Mesos container with restriction to 2 or 4 CPUs. I always set n_thread manually so I do not know what will be default value. Either way, I am not sure that ignoring hyper threading is particularly good idea in containers. I am not sure whether Linux CFS and its share limits (Kuberneris and Mesos uses this mechanism) aware of hyperthreading. In cloud we also usually have muti processor machines so it makes things even more complex
I suggest to add some logging. If n_thread is not set get a reasonable default and output it to stdout or stderr so a user can see this picked value
I suggest to create a python function get_default_nthread to expose this value to a user (to give an ability to troubleshoot) like wrf-python did https://wrf-python.readthedocs.io/en/latest/user_api/generated/wrf.omp_get_num_procs.html

Meanwhile I will try to launch training in this hosted solution and see how it works in 0.90

trams · 2019-10-27T00:26:17Z

Also I forgot to share two interesting links

https://www.openmp.org/spec-html/5.0/openmpsu114.html it seems the function will return number of processors (logical?) so it won't return container size. I do not know how to fetch a container size :(
http://mesos.apache.org/documentation/latest/isolators/cgroups-cpu/#effects-on-application-when-using-cfs-bandwidth-limiting CFS Bandwidth limiting has interesting side effects.

trivialfis · 2019-10-27T14:02:30Z

@trams Thanks for your suggestions and pointers. They are really helpful!

Either way, I am not sure that ignoring hyper threading is particularly good idea in containers.

I think none of us has any good idea on how to make the default value good across different platform. That's why I quoted the Python Easter eggs and thinking just stick with whatever omp is default and let the user choose it explicitly.

If n_thread is not set get a reasonable default and output it to stdout or stderr so a user can see this picked value

I'm trying to avoid warning, in many issues people choose to ignore warning by setting verbosity = 0, which suppresses many truly important warnings like not to set the updater parameter.

I suggest to create a python function get_default_nthread to expose this value to a user

Aside from this PR, I want to create a global config store for XGBoost like xgb.config.gpu_id = 0 with some language sugars like in Python:

with xgb.config(gpu_id=0, nthread=16) as xgb_config:
    dtrain = xgb.DMatrix(X_train, label=y_train)
    xgb.train({'tree_method`: 'gpu_hist', ... }, dtrain)

The configuration maybe available for per-booster and global. But right now it's not on the top of to-do list.

https://www.openmp.org/spec-html/5.0/openmpsu114.html it seems the function will return number of processors (logical?) so it won't return container size.

I tried this and it returns number of threads after hyper-threading.

CFS Bandwidth limiting has interesting side effects.

There are other side effects like memory bandwidths. Building histogram is very sensitive to this one. So having more threads with limited memory bandwidth is also harmful.

trams · 2019-11-20T03:18:27Z

@trams Thanks for your suggestions and pointers. They are really helpful!

Either way, I am not sure that ignoring hyper threading is particularly good idea in containers.

I think none of us has any good idea on how to make the default value good across different platform. That's why I quoted the Python Easter eggs and thinking just stick with whatever omp is default and let the user choose it explicitly.
Yes. I like this principle. Explicit is better then implicit

If n_thread is not set get a reasonable default and output it to stdout or stderr so a user can see this picked value

I'm trying to avoid warning, in many issues people choose to ignore warning by setting verbosity = 0, which suppresses many truly important warnings like not to set the updater parameter.
Fair enough but I still strongly suggest

output it to debug log at least
Expose this in xgboost-spark (or even fail if it is not explicitly set in xgboost-spark) as Spark Accumulator or Spark Config so one can see it in UI

I suggest to create a python function get_default_nthread to expose this value to a user

Aside from this PR, I want to create a global config store for XGBoost like xgb.config.gpu_id = 0 with some language sugars like in Python:
with xgb.config(gpu_id=0, nthread=16) as xgb_config:
    dtrain = xgb.DMatrix(X_train, label=y_train)
    xgb.train({'tree_method`: 'gpu_hist', ... }, dtrain)
The configuration maybe available for per-booster and global. But right now it's not on the top of to-do list.

https://www.openmp.org/spec-html/5.0/openmpsu114.html it seems the function will return number of processors (logical?) so it won't return container size.

I tried this and it returns number of threads after hyper-threading.
I think I failed to communicate my concerned.
My concern is not whether the function returns number of physical processors or logical (hyper threads). My concern how does this function will behave inside the Mesos container (or any other container).

Let me give you some context.

In my company we have a hosted notebook solution which allows to launch a notebook inside the cloud. Behind the scene it launches a kernel inside a Mesos container (with 8 cpus by default). From a user perspective amount of logical cores one has is the size of the container (so it will be 8) but ... as far as I read Mesos documentation CPU limit implemented using CFS Bandwidth limiting which results in OMP thinking that it have access to all logical cores of the machine (which is 72 in our case) instead of actual amount of "available" core in the container (8 in our case)
This will lead in increased overhead for handling all these extra threads for nothing

This Mesos container thing already bit me few times. That's why I strongly suggest to either have some logging or expose the default in python library so a hypothetical user can just look it up and check whether it is sane

CFS Bandwidth limiting has interesting side effects.

There are other side effects like memory bandwidths. Building histogram is very sensitive to this one. So having more threads with limited memory bandwidth is also harmful.

trams · 2019-11-20T03:19:43Z

src/common/common.h

@@ -142,6 +145,15 @@ class Range {
 };

 int AllVisibleGPUs();
+
+inline int OmpDefaultThreads(int32_t threads) {


I suggest to make it not inline function and expose it in C API and Python API

trivialfis requested a review from hcho3 October 23, 2019 07:48

This was referenced Oct 23, 2019

Add JSON IO for various components. #4732

Closed

Fix external memory race condition. #4980

Merged

trivialfis force-pushed the set-threads branch from b6a3a5b to c88f491 Compare October 25, 2019 03:17

RAMitchell reviewed Oct 25, 2019

View reviewed changes

Consistent set threads.

48df575

trivialfis changed the title ~~Clarify default number of threads.~~ [WIP] Clarify default number of threads. Oct 25, 2019

trivialfis force-pushed the set-threads branch from c88f491 to 48df575 Compare October 25, 2019 08:16

trams reviewed Nov 20, 2019

View reviewed changes

trams approved these changes Nov 20, 2019

View reviewed changes

trivialfis mentioned this pull request Apr 24, 2020

Better message when no GPU is found. #5594

Merged

trivialfis mentioned this pull request Sep 30, 2020

Unify thread configuration. #6186

Merged

trivialfis closed this in #6186 Oct 19, 2020

trivialfis deleted the set-threads branch February 5, 2021 19:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Clarify default number of threads. #4975

[WIP] Clarify default number of threads. #4975

trivialfis commented Oct 23, 2019

trivialfis commented Oct 23, 2019

RAMitchell left a comment

hcho3 commented Oct 25, 2019

trivialfis commented Oct 25, 2019

RAMitchell commented Oct 25, 2019

trivialfis commented Oct 26, 2019 •

edited

Loading

trivialfis commented Oct 26, 2019 •

edited

Loading

RAMitchell commented Oct 26, 2019

trams commented Oct 26, 2019

trams commented Oct 27, 2019

trivialfis commented Oct 27, 2019 •

edited

Loading

trams commented Nov 20, 2019

trams Nov 20, 2019

[WIP] Clarify default number of threads. #4975

[WIP] Clarify default number of threads. #4975

Conversation

trivialfis commented Oct 23, 2019

trivialfis commented Oct 23, 2019

RAMitchell left a comment

Choose a reason for hiding this comment

hcho3 commented Oct 25, 2019

trivialfis commented Oct 25, 2019

RAMitchell commented Oct 25, 2019

trivialfis commented Oct 26, 2019 • edited Loading

trivialfis commented Oct 26, 2019 • edited Loading

RAMitchell commented Oct 26, 2019

trams commented Oct 26, 2019

trams commented Oct 27, 2019

trivialfis commented Oct 27, 2019 • edited Loading

trams commented Nov 20, 2019

trams Nov 20, 2019

Choose a reason for hiding this comment

trivialfis commented Oct 26, 2019 •

edited

Loading

trivialfis commented Oct 26, 2019 •

edited

Loading

trivialfis commented Oct 27, 2019 •

edited

Loading